Commit Graph

9 Commits

Author SHA1 Message Date
Matt Wells
6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects.
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
f64b53bfb3 almost done with rebalancing code 2014-01-10 14:12:58 -08:00
Matt Wells
b0d77a834a do not spider fake ips requests, just re-add them
with the right firstip
2013-12-20 12:22:02 -08:00
mwells
e04d596288 minor comments update. 2013-12-09 13:42:33 -08:00
Matt Wells
6495dfd86e try to fix json parser overflow error. needs
testing. tried to fix round num from incrementing
for little job because i think server overload.
should be fixed right some time. just made wait
time 30 secs instead of 10 in Spider.cpp.
2013-11-15 11:30:16 -08:00
mwells
9bf8bf7712 add spider reply even on g_errno now with an error
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
Matt Wells
a50898649b various fixes. 2013-09-16 10:16:49 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00