Matt Wells
b8d57dcd3a
fix bug of dumping too many files to disk and not
...
being able to merge, and corrupting RdbBase::m_files[]
array and associated arrays.
2015-11-17 09:52:41 -08:00
Matt Wells
8b84297392
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-11 08:27:25 -08:00
Matt Wells
6cf6abf3d9
fix spider proxy clean up algo a little
...
so it won't freeze up
2015-11-11 08:27:09 -08:00
Matt
5f1695fab8
fix url.cpp
2015-11-10 00:29:42 -07:00
Matt
5061e5d7b5
normalize utf8 url paths into url encoded sequences.
2015-11-09 13:54:32 -07:00
Matt
80991c943f
complete merge of ia code into testing.
...
make indexing warcs/arcs a switch in spider controls.
2015-11-09 12:46:06 -07:00
Matt
fe448173d5
Merge branch 'ia' into testing
2015-11-09 11:14:00 -07:00
Matt
37cc4f2ba8
Merge branch 'diffbot-testing' into testing
2015-11-09 11:13:42 -07:00
Zak Betz
1351d9f994
Code cleanup.
2015-11-09 09:01:20 -07:00
Matt
dbe93c2ccf
fix bug of not always dumping core?
2015-11-08 08:54:46 -07:00
Matt Wells
3db9ae5d4d
rebuild fix
2015-11-07 13:14:38 -08:00
Matt Wells
93ec3138c5
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-06 13:31:02 -08:00
Matt Wells
44e3b0ca19
try to fix spider proxy load table pruning bug.
2015-11-06 13:30:42 -08:00
Zak Betz
c1bbd0207d
Don't bias tagdb lookups to a single host, use the host with the lowest
...
number of outstanding requests. The original reasoning was that one
host would handle all lookups for a site and that lookup would remain
in cache. Given that there are mega hubs like youtube and
facebook there should be as many hosts as possible handling requests for
these sites and the tagdb entries should stay in cache in all of the hosts
that have the key.
2015-11-04 15:37:49 -07:00
Matt
afbedba858
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-03 18:49:49 -08:00
Matt
c8c29db56b
fix core
2015-11-03 18:49:42 -08:00
Zak Betz
baa817b51d
Fix load balance of msg22s to use the udp slots in pinginfo.
...
Fix sigchild interrupting popen, when pdftohtml segfaults
popen was hanging forever.
Fix another bug when content length in http header was one off.
2015-11-03 11:51:19 -07:00
Matt
7608c5c29c
default I/O error detection to enabled so we see
...
hosts with I/O errors in the hosts table.
2015-11-03 11:33:16 -07:00
Matt Wells
08b6fa67d7
improve spider performance when we have lots of collections.
...
fix core from corrupt titledb rec of some sort.
automatically turn off profiler when you get data back for
simplicity.
2015-11-01 20:23:18 -08:00
Zak Betz
ff6caf79a2
Increase time to mark item as stale in warc injector.
2015-11-01 19:45:29 -07:00
Matt
cc305eb73a
fix so we can generate posdb map for
...
headless data files.
2015-11-01 14:56:39 -08:00
Matt
23d376f6c7
fix core from a bad title rec fetch
2015-10-29 19:43:02 -07:00
Zak Betz
aeca57e9f4
Pass in the buffer size of an injection request so that if the content
...
length header field is bigger than the actual buffer we won't index
random memory. Fixes bug with truncated warc captures.
2015-10-28 00:38:08 -06:00
Zak Betz
555844ce1f
Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream
...
Conflicts:
Msg40.cpp
2015-10-26 22:17:03 -06:00
Zak Betz
f7bb617b85
Fixes for bad content lengths when injecting warcs.
2015-10-26 22:15:03 -06:00
Zak Betz
18e0b9ea9c
Fix warc injection so that pdfs, xls, ps, docs work.
...
Crank up max warc rec size to 5mb because pdfs are rarely < 1MB.
2015-10-25 23:09:43 -06:00
Matt Wells
66145e4396
fix core when exiting while merging
2015-10-24 12:50:57 -07:00
Matt Wells
776b94396e
a new ban msg for http status 503
2015-10-22 13:23:02 -07:00
Matt
488db03f60
do not send summary requests to non queryable hosts
2015-10-22 11:46:13 -06:00
Matt Wells
998c25e29b
spider proxy fixes for negative ports
2015-10-21 15:32:58 -07:00
Matt Wells
b2af4a00ae
remove old code preventing proxies form being passed to diffbot
2015-10-21 14:46:38 -07:00
Matt Wells
5f965c2c9a
reset proxy table every hour
2015-10-21 13:30:48 -07:00
Zak Betz
5241f2e1c7
Fix double call of gotSummary when computing facets in msg40. Fixes
...
missing results on page > 1 when searching for facets.
2015-10-20 17:21:37 -06:00
Zak Betz
089b36e050
Injector fixes.
2015-10-20 17:01:05 -06:00
Matt
51d68c4b3d
pass proxy info back to diffbot
2015-10-20 15:53:16 -06:00
Matt Wells
2d8c84b29c
fix bug of not shutting down right away
2015-10-20 13:26:24 -07:00
Matt Wells
b0b716010e
turn off proxyauth stuff for now
2015-10-20 13:06:59 -07:00
Matt Wells
771f4d7799
Merge branch 'diffbot-testing' into diffbot
...
Conflicts:
Spider.cpp
2015-10-20 11:48:44 -07:00
Matt Wells
928511f036
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-10-20 09:45:15 -07:00
Matt Wells
b9451a1f6f
fix token expired bug
2015-10-20 09:44:50 -07:00
Zak Betz
925fea29f4
Bug fix for search with facets with s=N | N > 0
...
Make warc injector more resillient to advancedsearch.php failure.
2015-10-19 18:28:15 -06:00
Matt Wells
e6d2cb5962
backwards compatible fix
2015-10-19 16:12:41 -06:00
Matt Wells
3afd768a32
make rel no follow a separate switch, but still just use
...
the robots.txt switch for diffbot crawls.
2015-10-19 15:34:57 -06:00
Matt
2df573acd8
enable diffbot proxyauth stuff for http urls only
2015-10-19 10:18:20 -06:00
Zak Betz
667e65ce01
Progress bar for warc injector.
2015-10-19 10:08:04 -06:00
Matt
5d07e24c01
use rel no follow switch support.
2015-10-19 10:05:46 -06:00
Zak Betz
ea139a65e6
Warc stream busy loop fixes.
...
Load balance msg22 to the one with the least outstanding requests.
2015-10-15 22:30:07 -06:00
Matt
75b72cc233
fix add url seg fault
2015-10-14 13:57:47 -06:00
Matt
e57e3481b4
fix innerloop strangeness when counting keys in buckets
2015-10-14 13:52:42 -06:00
Matt
3e19d43aa5
fix core
2015-10-14 12:03:12 -06:00