Commit Graph

12 Commits

Author SHA1 Message Date
mwells
c283e85e40 add support for noindex meta tag.
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
2013-10-12 22:50:23 -07:00
mwells
ca6af65217 git dmoz nagivation system working.
now we just need to index the urls to
populate dmoz.
2013-10-10 22:08:21 -07:00
mwells
7ba9994804 many dmoz fixes. but still more we need to do.
isn't printing subcategories right now.
2013-10-08 23:55:11 -07:00
mwells
d8e6ac8748 fixed bug of not putting meta tags
in all gbdmoz.urls.txt.* files in
dmozparse.cpp
2013-10-06 00:18:59 -06:00
mwells
000caa5a26 support for usefakeips meta tag 2013-10-06 00:10:07 -06:00
mwells
612f2872f7 use addurl to add the gbdmoz url
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
mwells
71d5d05f7c use catdb/ subdir not cat/ for consistency. 2013-10-04 21:35:13 -06:00
mwells
78c4bda368 fix dmozparse urldump -s bugs
for dumping out urls in dmoz.
2013-10-04 00:00:26 -06:00
mwells
f562e6da9a just ignore all urls with # (hashtag) in them
from the dmoz dump. we were truncating
http://twitter.com/#!/ronpaul to
http://twitter.com/ and when looking up
the catids of twitter.com got that ronpaul url.
so that's bad. people should respect the hashtag.
2013-10-03 23:33:55 -06:00
mwells
9e1fee2cb9 dmozparse works with latest dmoz files now 2013-10-03 22:08:40 -06:00
mwells
6c2c9f7774 trying to bring back dmoz integration. 2013-10-02 22:34:21 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00