Commit Graph

25 Commits

Author SHA1 Message Date
Matt
86800a0656 if a root/seed url has no outlinks, assumed banned. 2015-05-04 14:23:28 -07:00
Matt Wells
1825f6bd27 retry download if was in the twitchy table
at start of download, and not using proxies at all.
2015-04-30 16:06:13 -07:00
Matt
6d8bb19962 checkpoint for auto proxy logic 2015-04-30 13:28:57 -07:00
Matt
6fc83566e2 more fixes 2015-02-02 14:06:38 -08:00
Matt
c15bd53e52 added support for supplying basic proxy authorization
to spider proxies. username:password@1.2.3.4:80
2015-02-02 13:23:38 -08:00
Matt Wells
8e315504a2 fix empty rdbcache bug of not enough buf mem. 2014-11-27 13:17:00 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
65800b65cf fix so diffbot doesn't timeout due
to large floater/proxy backoff crawl delay.
append &timeout=MAXCRAWLDELAY to diffbot api url.
2014-10-07 14:32:38 -07:00
mwells
c2f98a81b6 fix floater bug from reading hashtable off disk.
force use floaters if ! useRobots and is diffbot crawl.
2014-09-26 15:30:42 -07:00
mwells
6a28250e94 get qa test working after nyt bug fix 2014-08-06 16:00:25 -07:00
mwells
947be58f10 Merge branch 'diffbot-testing' into testing
Conflicts:
	HttpRequest.cpp
	Msg13.cpp
	XmlDoc.cpp
2014-08-05 17:19:53 -07:00
mwells
cc1ceaaac2 fix nyt.com cookie redir bug.
fixed bug when POSTing injection request with multipart/form-data.
2014-08-05 17:04:11 -07:00
mwells
05fcef9651 more vote infusion and squid proxy fixes. 2014-07-09 14:57:58 -07:00
mwells
ea90e7f755 more fixes for sectiondb markup code 2014-06-12 13:05:45 -07:00
mwells
7d452a766c completed squid proxy simulation code 2014-06-09 12:42:05 -07:00
mwells
965d992f98 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Msg13.cpp
2014-06-06 15:14:41 -07:00
mwells
3f2dcda4e1 got new floater/proxy logic compiling. 2014-06-06 15:11:51 -07:00
Matt Wells
ce7294e9a9 more mem leak fixes for fake
bulk job empty http replies
2014-06-05 20:09:12 -07:00
mwells
ee5af6b30e more spider proxy fixes 2014-06-02 14:59:15 -07:00
mwells
ca450e6bbd using msg55 when done downloading through a proxy to record
stats for load balancing on host #0
2014-06-02 13:48:33 -07:00
mwells
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
Matt Wells
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00
Matt Wells
e8065a0f0a enforce crawl delay perfectly. 2013-11-22 18:26:34 -08:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00