Matt
|
86800a0656
|
if a root/seed url has no outlinks, assumed banned.
|
2015-05-04 14:23:28 -07:00 |
|
Matt Wells
|
1825f6bd27
|
retry download if was in the twitchy table
at start of download, and not using proxies at all.
|
2015-04-30 16:06:13 -07:00 |
|
Matt
|
6d8bb19962
|
checkpoint for auto proxy logic
|
2015-04-30 13:28:57 -07:00 |
|
Matt
|
6fc83566e2
|
more fixes
|
2015-02-02 14:06:38 -08:00 |
|
Matt
|
c15bd53e52
|
added support for supplying basic proxy authorization
to spider proxies. username:password@1.2.3.4:80
|
2015-02-02 13:23:38 -08:00 |
|
Matt Wells
|
8e315504a2
|
fix empty rdbcache bug of not enough buf mem.
|
2014-11-27 13:17:00 -08:00 |
|
Matt
|
96b8197ad3
|
now it compiles with -m32
|
2014-11-10 14:45:11 -08:00 |
|
Matt Wells
|
e7dd8f7956
|
replace long long with int64_t
|
2014-10-30 13:36:39 -06:00 |
|
Matt Wells
|
65800b65cf
|
fix so diffbot doesn't timeout due
to large floater/proxy backoff crawl delay.
append &timeout=MAXCRAWLDELAY to diffbot api url.
|
2014-10-07 14:32:38 -07:00 |
|
mwells
|
c2f98a81b6
|
fix floater bug from reading hashtable off disk.
force use floaters if ! useRobots and is diffbot crawl.
|
2014-09-26 15:30:42 -07:00 |
|
mwells
|
6a28250e94
|
get qa test working after nyt bug fix
|
2014-08-06 16:00:25 -07:00 |
|
mwells
|
947be58f10
|
Merge branch 'diffbot-testing' into testing
Conflicts:
HttpRequest.cpp
Msg13.cpp
XmlDoc.cpp
|
2014-08-05 17:19:53 -07:00 |
|
mwells
|
cc1ceaaac2
|
fix nyt.com cookie redir bug.
fixed bug when POSTing injection request with multipart/form-data.
|
2014-08-05 17:04:11 -07:00 |
|
mwells
|
05fcef9651
|
more vote infusion and squid proxy fixes.
|
2014-07-09 14:57:58 -07:00 |
|
mwells
|
ea90e7f755
|
more fixes for sectiondb markup code
|
2014-06-12 13:05:45 -07:00 |
|
mwells
|
7d452a766c
|
completed squid proxy simulation code
|
2014-06-09 12:42:05 -07:00 |
|
mwells
|
965d992f98
|
Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
Msg13.cpp
|
2014-06-06 15:14:41 -07:00 |
|
mwells
|
3f2dcda4e1
|
got new floater/proxy logic compiling.
|
2014-06-06 15:11:51 -07:00 |
|
Matt Wells
|
ce7294e9a9
|
more mem leak fixes for fake
bulk job empty http replies
|
2014-06-05 20:09:12 -07:00 |
|
mwells
|
ee5af6b30e
|
more spider proxy fixes
|
2014-06-02 14:59:15 -07:00 |
|
mwells
|
ca450e6bbd
|
using msg55 when done downloading through a proxy to record
stats for load balancing on host #0
|
2014-06-02 13:48:33 -07:00 |
|
mwells
|
b6e5424e32
|
do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
|
2014-03-21 12:40:38 -07:00 |
|
Matt Wells
|
0f3374e3f3
|
measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
|
2013-11-26 14:07:28 -08:00 |
|
Matt Wells
|
e8065a0f0a
|
enforce crawl delay perfectly.
|
2013-11-22 18:26:34 -08:00 |
|
Matt Wells
|
f6e560c1f4
|
Initial file population.
|
2013-08-02 13:12:24 -07:00 |
|