Commit Graph

3615 Commits

Author SHA1 Message Date
Matt Wells
b8d57dcd3a fix bug of dumping too many files to disk and not
being able to merge, and corrupting RdbBase::m_files[]
array and associated arrays.
2015-11-17 09:52:41 -08:00
Matt Wells
8b84297392 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-11 08:27:25 -08:00
Matt Wells
6cf6abf3d9 fix spider proxy clean up algo a little
so it won't freeze up
2015-11-11 08:27:09 -08:00
Matt
5f1695fab8 fix url.cpp 2015-11-10 00:29:42 -07:00
Matt
5061e5d7b5 normalize utf8 url paths into url encoded sequences. 2015-11-09 13:54:32 -07:00
Matt
80991c943f complete merge of ia code into testing.
make indexing warcs/arcs a switch in spider controls.
2015-11-09 12:46:06 -07:00
Matt
fe448173d5 Merge branch 'ia' into testing 2015-11-09 11:14:00 -07:00
Matt
37cc4f2ba8 Merge branch 'diffbot-testing' into testing 2015-11-09 11:13:42 -07:00
Zak Betz
1351d9f994 Code cleanup. 2015-11-09 09:01:20 -07:00
Matt
dbe93c2ccf fix bug of not always dumping core? 2015-11-08 08:54:46 -07:00
Matt Wells
3db9ae5d4d rebuild fix 2015-11-07 13:14:38 -08:00
Matt Wells
93ec3138c5 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-06 13:31:02 -08:00
Matt Wells
44e3b0ca19 try to fix spider proxy load table pruning bug. 2015-11-06 13:30:42 -08:00
Zak Betz
c1bbd0207d Don't bias tagdb lookups to a single host, use the host with the lowest
number of outstanding requests.  The original reasoning was that one
host would handle all lookups for a site and that lookup would remain
in cache.  Given that there are mega hubs like youtube and
facebook there should be as many hosts as possible handling requests for
these sites and the tagdb entries should stay in cache in all of the hosts
that have the key.
2015-11-04 15:37:49 -07:00
Matt
afbedba858 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-03 18:49:49 -08:00
Matt
c8c29db56b fix core 2015-11-03 18:49:42 -08:00
Zak Betz
baa817b51d Fix load balance of msg22s to use the udp slots in pinginfo.
Fix sigchild interrupting popen, when pdftohtml segfaults
popen was hanging forever.
Fix another bug when content length in http header was one off.
2015-11-03 11:51:19 -07:00
Matt
7608c5c29c default I/O error detection to enabled so we see
hosts with I/O errors in the hosts table.
2015-11-03 11:33:16 -07:00
Matt Wells
08b6fa67d7 improve spider performance when we have lots of collections.
fix core from corrupt titledb rec of some sort.
automatically turn off profiler when you get data back for
simplicity.
2015-11-01 20:23:18 -08:00
Zak Betz
ff6caf79a2 Increase time to mark item as stale in warc injector. 2015-11-01 19:45:29 -07:00
Matt
cc305eb73a fix so we can generate posdb map for
headless data files.
2015-11-01 14:56:39 -08:00
Matt
23d376f6c7 fix core from a bad title rec fetch 2015-10-29 19:43:02 -07:00
Zak Betz
aeca57e9f4 Pass in the buffer size of an injection request so that if the content
length header field is bigger than the actual buffer we won't index
random memory.  Fixes bug with truncated warc captures.
2015-10-28 00:38:08 -06:00
Zak Betz
555844ce1f Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream
Conflicts:
	Msg40.cpp
2015-10-26 22:17:03 -06:00
Zak Betz
f7bb617b85 Fixes for bad content lengths when injecting warcs. 2015-10-26 22:15:03 -06:00
Zak Betz
18e0b9ea9c Fix warc injection so that pdfs, xls, ps, docs work.
Crank up max warc rec size to 5mb because pdfs are rarely < 1MB.
2015-10-25 23:09:43 -06:00
Matt Wells
66145e4396 fix core when exiting while merging 2015-10-24 12:50:57 -07:00
Matt Wells
776b94396e a new ban msg for http status 503 2015-10-22 13:23:02 -07:00
Matt
488db03f60 do not send summary requests to non queryable hosts 2015-10-22 11:46:13 -06:00
Matt Wells
998c25e29b spider proxy fixes for negative ports 2015-10-21 15:32:58 -07:00
Matt Wells
b2af4a00ae remove old code preventing proxies form being passed to diffbot 2015-10-21 14:46:38 -07:00
Matt Wells
5f965c2c9a reset proxy table every hour 2015-10-21 13:30:48 -07:00
Zak Betz
5241f2e1c7 Fix double call of gotSummary when computing facets in msg40. Fixes
missing results on page > 1 when searching for facets.
2015-10-20 17:21:37 -06:00
Zak Betz
089b36e050 Injector fixes. 2015-10-20 17:01:05 -06:00
Matt
51d68c4b3d pass proxy info back to diffbot 2015-10-20 15:53:16 -06:00
Matt Wells
2d8c84b29c fix bug of not shutting down right away 2015-10-20 13:26:24 -07:00
Matt Wells
b0b716010e turn off proxyauth stuff for now 2015-10-20 13:06:59 -07:00
Matt Wells
771f4d7799 Merge branch 'diffbot-testing' into diffbot
Conflicts:
	Spider.cpp
2015-10-20 11:48:44 -07:00
Matt Wells
928511f036 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-10-20 09:45:15 -07:00
Matt Wells
b9451a1f6f fix token expired bug 2015-10-20 09:44:50 -07:00
Zak Betz
925fea29f4 Bug fix for search with facets with s=N | N > 0
Make warc injector more resillient to advancedsearch.php failure.
2015-10-19 18:28:15 -06:00
Matt Wells
e6d2cb5962 backwards compatible fix 2015-10-19 16:12:41 -06:00
Matt Wells
3afd768a32 make rel no follow a separate switch, but still just use
the robots.txt switch for diffbot crawls.
2015-10-19 15:34:57 -06:00
Matt
2df573acd8 enable diffbot proxyauth stuff for http urls only 2015-10-19 10:18:20 -06:00
Zak Betz
667e65ce01 Progress bar for warc injector. 2015-10-19 10:08:04 -06:00
Matt
5d07e24c01 use rel no follow switch support. 2015-10-19 10:05:46 -06:00
Zak Betz
ea139a65e6 Warc stream busy loop fixes.
Load balance msg22 to the one with the least outstanding requests.
2015-10-15 22:30:07 -06:00
Matt
75b72cc233 fix add url seg fault 2015-10-14 13:57:47 -06:00
Matt
e57e3481b4 fix innerloop strangeness when counting keys in buckets 2015-10-14 13:52:42 -06:00
Matt
3e19d43aa5 fix core 2015-10-14 12:03:12 -06:00