mwells
c283e85e40
add support for noindex meta tag.
...
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
2013-10-12 22:50:23 -07:00
mwells
ca6af65217
git dmoz nagivation system working.
...
now we just need to index the urls to
populate dmoz.
2013-10-10 22:08:21 -07:00
mwells
7ba9994804
many dmoz fixes. but still more we need to do.
...
isn't printing subcategories right now.
2013-10-08 23:55:11 -07:00
mwells
d8e6ac8748
fixed bug of not putting meta tags
...
in all gbdmoz.urls.txt.* files in
dmozparse.cpp
2013-10-06 00:18:59 -06:00
mwells
000caa5a26
support for usefakeips meta tag
2013-10-06 00:10:07 -06:00
mwells
612f2872f7
use addurl to add the gbdmoz url
...
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
mwells
71d5d05f7c
use catdb/ subdir not cat/ for consistency.
2013-10-04 21:35:13 -06:00
mwells
78c4bda368
fix dmozparse urldump -s bugs
...
for dumping out urls in dmoz.
2013-10-04 00:00:26 -06:00
mwells
f562e6da9a
just ignore all urls with # (hashtag) in them
...
from the dmoz dump. we were truncating
http://twitter.com/#!/ronpaul to
http://twitter.com/ and when looking up
the catids of twitter.com got that ronpaul url.
so that's bad. people should respect the hashtag.
2013-10-03 23:33:55 -06:00
mwells
9e1fee2cb9
dmozparse works with latest dmoz files now
2013-10-03 22:08:40 -06:00
mwells
6c2c9f7774
trying to bring back dmoz integration.
2013-10-02 22:34:21 -06:00
Matt Wells
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00