Matt Wells
|
d6434191d1
|
nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
|
2014-03-31 15:02:17 -07:00 |
|
mwells
|
b6e5424e32
|
do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
|
2014-03-21 12:40:38 -07:00 |
|
Matt Wells
|
e351d2a6f1
|
get searching on token working
|
2014-03-06 17:01:41 -08:00 |
|
Matt Wells
|
8aef2ba8a0
|
take out potentially bad robots.txt
filter compression logic.
|
2014-01-28 18:26:16 -08:00 |
|
Matt Wells
|
321fc90ff6
|
fix some cores.
NOTE: emails disabled here... need to fix.
|
2014-01-24 12:07:28 -08:00 |
|
Matt Wells
|
e366c12470
|
Merge branch 'master' into diffbot
Conflicts:
Collectiondb.cpp
Msg13.cpp
Parms.cpp
Spider.h
|
2014-01-07 12:09:11 -08:00 |
|
Matt Wells
|
6aac48e487
|
fix crawl delay wait queue logic.
if coll already exists trying to add, let it be. don't error out.
|
2013-12-27 14:35:51 -08:00 |
|
Matt Wells
|
5cdb73bc70
|
fix spider core
|
2013-12-27 15:28:44 -07:00 |
|
Matt Wells
|
bff0083538
|
ensured robots.txt redirects are cached as well
|
2013-12-27 13:01:01 -08:00 |
|
Matt Wells
|
6f2e552bcd
|
fix core in linked list of msg13requests in
case one gets freed
|
2013-12-20 11:26:46 -08:00 |
|
Matt Wells
|
5ee2be8fcf
|
fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
|
2013-11-27 14:18:15 -08:00 |
|
Matt Wells
|
57eb231a4e
|
do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
|
2013-11-26 14:21:17 -08:00 |
|
Matt Wells
|
0f3374e3f3
|
measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
|
2013-11-26 14:07:28 -08:00 |
|
Matt Wells
|
8bb086ac60
|
crawldelay works now but it measures
from the end of the download, not the
beginning.
|
2013-11-26 12:58:14 -08:00 |
|
Matt Wells
|
e8065a0f0a
|
enforce crawl delay perfectly.
|
2013-11-22 18:26:34 -08:00 |
|
Matt Wells
|
f21fb98c16
|
fix core when getting new spider reply
when g_errno was ECORRUPTDATA
|
2013-10-04 20:44:29 -07:00 |
|
Gigablast
|
6ac55d3b75
|
Update Msg13.cpp
Do not override the spider user agent with Gigabot, use the one specified by the admin in Parms.cpp as the default.
|
2013-08-05 10:34:08 -06:00 |
|
Matt Wells
|
f6e560c1f4
|
Initial file population.
|
2013-08-02 13:12:24 -07:00 |
|