parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
fix for thinking a crawl is done when it is just taking
a while to populate doledb from the waiting tree for
that SpiderColl. we just call populateDoledbFromWaiting in
doneSleepingWrapper avery 50ms. it loops over every coll
so it could be more efficient.
for better performance. cache empty dolebufs
if winner tree list was not from cache, so
in case we have a huge spiderdb scan list
of urls we aren't spidering we can cache it,
like twitter.com e.g. do not call strstr
in getUrlFilterNum2() for .css? or /print/
since it was taking way too much cpu time.
than 10M unique urls for a given "firstip"
queued up to be spidered in spiderdb
that have never been spidered. should prevent us
from having 20GB spiderdbs for spidering those sites
that essentially have an infinite # of urls, black hole
sites, that seems to be plaguing crawls.
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.