Commit Graph

3645 Commits

Author SHA1 Message Date
Matt Wells
add6f84b79 Merge branch 'diffbot-testing' into testing 2015-11-21 10:44:14 -08:00
Matt Wells
398225dde1 Merge branch 'diffbot' into diffbot-testing 2015-11-21 10:44:05 -08:00
Matt Wells
d55932d0b6 fix spider proxy table bug that seemed to be the
reason for the table getting so full. but in case
it does get full again added a call the hashtablex::empty()
so we don't freeze up any more.
2015-11-21 10:43:23 -08:00
Matt Wells
b3729ed214 tune spider proxy table flushing logic a bit 2015-11-21 10:29:02 -08:00
Matt Wells
425fc699f8 Merge branch 'diffbot-testing' into testing 2015-11-21 10:21:04 -08:00
Matt Wells
0964fb9715 Merge branch 'diffbot' into diffbot-testing 2015-11-21 10:20:49 -08:00
Matt Wells
3c766451d1 try to fix the proxy load balancing table logic some more.
seems to not cleanup after itself very well.
2015-11-21 10:20:20 -08:00
Zak Betz
dbe101d759 Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing 2015-11-20 10:42:53 -07:00
Zak Betz
a46c5b8f86 Fix anomalous link text detector to take into consideration the total
number of inlinkers instead of just counting the matched link texts.
2015-11-20 10:42:46 -07:00
Matt
aeaca04df3 fix bug of losing the line waiter header
in linkdb.cpp for incoming msg25 requests.
start to show more info in sockets table
by parsing the request.
2015-11-19 19:40:30 -07:00
Matt
bdceec1796 Merge branch 'master' into testing 2015-11-19 16:24:45 -07:00
Matt
7bc27a521e fix compiler error on 32bit arches 2015-11-19 16:24:29 -07:00
Matt
6e7e267cfb Merge branch 'master' into testing 2015-11-19 16:14:24 -07:00
Matt
cd875f4ab9 fix empty url condition in add url. 2015-11-19 16:14:12 -07:00
Matt
b4ef9ca29f Merge branch 'diffbot-testing' into testing 2015-11-19 16:11:54 -07:00
Matt
eb57f0a8c3 Merge branch 'master' into testing 2015-11-19 16:11:38 -07:00
Zak Betz
87af33db66 Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing 2015-11-19 12:25:33 -07:00
Zak Betz
0bc50deb42 Filter link text anomalies at query time.
If a search result only has a few matches for a term in link text,
then don't return it in search results for that query.
2015-11-19 12:25:25 -07:00
Gigablast
7783c4655e Merge pull request #66 from alc-privacore/master
Fix segfault when using add URL
2015-11-18 23:39:49 -07:00
Matt
68f41bd22a debug why we don't dump core sometimes. 2015-11-18 16:11:27 -07:00
Matt
e0f4ba65c1 remove fixme log comment 2015-11-18 08:11:45 -07:00
Matt
feff30b6dc Merge branch 'diffbot' into testing 2015-11-17 11:04:56 -07:00
Matt Wells
b8d57dcd3a fix bug of dumping too many files to disk and not
being able to merge, and corrupting RdbBase::m_files[]
array and associated arrays.
2015-11-17 09:52:41 -08:00
Matt
690b4c5069 fix core from bogus url some more. 2015-11-16 12:51:18 -07:00
Matt
1a3c69af6b fix core dump from empty url 2015-11-16 12:08:16 -07:00
Matt
296651d416 fix getLeastLoadedInShard() to only return
the appropriate nospider/noquery hosts when using
nospider/noquery in hosts.conf.
2015-11-16 09:53:40 -07:00
Matt
1b60cbd46e fix core in Url.cpp 2015-11-16 09:29:08 -07:00
Matt
6e12f96aea Merge branch 'testing'
Conflicts:
	Rdb.cpp
2015-11-14 10:57:27 -07:00
Zak Betz
9ff387a898 More fixes to prevent spider traffic from hitting hosts with nospider
directive.
Bug fix for msg20 lookups always being directed away from noquery hosts.
2015-11-13 15:03:02 -07:00
Matt Wells
8b84297392 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-11 08:27:25 -08:00
Matt Wells
6cf6abf3d9 fix spider proxy clean up algo a little
so it won't freeze up
2015-11-11 08:27:09 -08:00
Matt
5f1695fab8 fix url.cpp 2015-11-10 00:29:42 -07:00
Matt
5061e5d7b5 normalize utf8 url paths into url encoded sequences. 2015-11-09 13:54:32 -07:00
Matt
80991c943f complete merge of ia code into testing.
make indexing warcs/arcs a switch in spider controls.
2015-11-09 12:46:06 -07:00
Matt
fe448173d5 Merge branch 'ia' into testing 2015-11-09 11:14:00 -07:00
Matt
37cc4f2ba8 Merge branch 'diffbot-testing' into testing 2015-11-09 11:13:42 -07:00
Zak Betz
1351d9f994 Code cleanup. 2015-11-09 09:01:20 -07:00
Matt
dbe93c2ccf fix bug of not always dumping core? 2015-11-08 08:54:46 -07:00
Matt Wells
3db9ae5d4d rebuild fix 2015-11-07 13:14:38 -08:00
Matt Wells
93ec3138c5 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-06 13:31:02 -08:00
Matt Wells
44e3b0ca19 try to fix spider proxy load table pruning bug. 2015-11-06 13:30:42 -08:00
Zak Betz
c1bbd0207d Don't bias tagdb lookups to a single host, use the host with the lowest
number of outstanding requests.  The original reasoning was that one
host would handle all lookups for a site and that lookup would remain
in cache.  Given that there are mega hubs like youtube and
facebook there should be as many hosts as possible handling requests for
these sites and the tagdb entries should stay in cache in all of the hosts
that have the key.
2015-11-04 15:37:49 -07:00
Matt
afbedba858 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-11-03 18:49:49 -08:00
Matt
c8c29db56b fix core 2015-11-03 18:49:42 -08:00
Zak Betz
baa817b51d Fix load balance of msg22s to use the udp slots in pinginfo.
Fix sigchild interrupting popen, when pdftohtml segfaults
popen was hanging forever.
Fix another bug when content length in http header was one off.
2015-11-03 11:51:19 -07:00
Matt
7608c5c29c default I/O error detection to enabled so we see
hosts with I/O errors in the hosts table.
2015-11-03 11:33:16 -07:00
Matt
95d70b110e fix bug in rebuild pipeline. need to merge the
files lest we max the # of files out.
2015-11-03 11:12:39 -07:00
Ai Lin Chia
6526accb50 Fix coredump when using add URL 2015-11-02 17:23:01 +01:00
Matt Wells
08b6fa67d7 improve spider performance when we have lots of collections.
fix core from corrupt titledb rec of some sort.
automatically turn off profiler when you get data back for
simplicity.
2015-11-01 20:23:18 -08:00
Zak Betz
ff6caf79a2 Increase time to mark item as stale in warc injector. 2015-11-01 19:45:29 -07:00