Commit Graph

18 Commits

Author SHA1 Message Date
Matt Wells
d6434191d1 nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
mwells
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
Matt Wells
e351d2a6f1 get searching on token working 2014-03-06 17:01:41 -08:00
Matt Wells
8aef2ba8a0 take out potentially bad robots.txt
filter compression logic.
2014-01-28 18:26:16 -08:00
Matt Wells
321fc90ff6 fix some cores.
NOTE: emails disabled here... need to fix.
2014-01-24 12:07:28 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
6aac48e487 fix crawl delay wait queue logic.
if coll already exists trying to add, let it be. don't error out.
2013-12-27 14:35:51 -08:00
Matt Wells
5cdb73bc70 fix spider core 2013-12-27 15:28:44 -07:00
Matt Wells
bff0083538 ensured robots.txt redirects are cached as well 2013-12-27 13:01:01 -08:00
Matt Wells
6f2e552bcd fix core in linked list of msg13requests in
case one gets freed
2013-12-20 11:26:46 -08:00
Matt Wells
5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
57eb231a4e do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
2013-11-26 14:21:17 -08:00
Matt Wells
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00
Matt Wells
8bb086ac60 crawldelay works now but it measures
from the end of the download, not the
beginning.
2013-11-26 12:58:14 -08:00
Matt Wells
e8065a0f0a enforce crawl delay perfectly. 2013-11-22 18:26:34 -08:00
Matt Wells
f21fb98c16 fix core when getting new spider reply
when g_errno was ECORRUPTDATA
2013-10-04 20:44:29 -07:00
Gigablast
6ac55d3b75 Update Msg13.cpp
Do not override the spider user agent with Gigabot, use the one specified by the admin in Parms.cpp as the default.
2013-08-05 10:34:08 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00