Matt Wells
add6f84b79
Merge branch 'diffbot-testing' into testing
2015-11-21 10:44:14 -08:00
Matt Wells
398225dde1
Merge branch 'diffbot' into diffbot-testing
2015-11-21 10:44:05 -08:00
Matt Wells
d55932d0b6
fix spider proxy table bug that seemed to be the
...
reason for the table getting so full. but in case
it does get full again added a call the hashtablex::empty()
so we don't freeze up any more.
2015-11-21 10:43:23 -08:00
Matt Wells
b3729ed214
tune spider proxy table flushing logic a bit
2015-11-21 10:29:02 -08:00
Matt Wells
425fc699f8
Merge branch 'diffbot-testing' into testing
2015-11-21 10:21:04 -08:00
Matt Wells
0964fb9715
Merge branch 'diffbot' into diffbot-testing
2015-11-21 10:20:49 -08:00
Matt Wells
3c766451d1
try to fix the proxy load balancing table logic some more.
...
seems to not cleanup after itself very well.
2015-11-21 10:20:20 -08:00
Zak Betz
dbe101d759
Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing
2015-11-20 10:42:53 -07:00
Zak Betz
a46c5b8f86
Fix anomalous link text detector to take into consideration the total
...
number of inlinkers instead of just counting the matched link texts.
2015-11-20 10:42:46 -07:00
Matt
aeaca04df3
fix bug of losing the line waiter header
...
in linkdb.cpp for incoming msg25 requests.
start to show more info in sockets table
by parsing the request.
2015-11-19 19:40:30 -07:00
Matt
bdceec1796
Merge branch 'master' into testing
2015-11-19 16:24:45 -07:00
Matt
7bc27a521e
fix compiler error on 32bit arches
2015-11-19 16:24:29 -07:00
Matt
6e7e267cfb
Merge branch 'master' into testing
2015-11-19 16:14:24 -07:00
Matt
cd875f4ab9
fix empty url condition in add url.
2015-11-19 16:14:12 -07:00
Matt
b4ef9ca29f
Merge branch 'diffbot-testing' into testing
2015-11-19 16:11:54 -07:00
Matt
eb57f0a8c3
Merge branch 'master' into testing
2015-11-19 16:11:38 -07:00
Zak Betz
87af33db66
Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing
2015-11-19 12:25:33 -07:00
Zak Betz
0bc50deb42
Filter link text anomalies at query time.
...
If a search result only has a few matches for a term in link text,
then don't return it in search results for that query.
2015-11-19 12:25:25 -07:00
Gigablast
7783c4655e
Merge pull request #66 from alc-privacore/master
...
Fix segfault when using add URL
2015-11-18 23:39:49 -07:00
Matt
68f41bd22a
debug why we don't dump core sometimes.
2015-11-18 16:11:27 -07:00
Matt
e0f4ba65c1
remove fixme log comment
2015-11-18 08:11:45 -07:00
Matt
feff30b6dc
Merge branch 'diffbot' into testing
2015-11-17 11:04:56 -07:00
Matt Wells
b8d57dcd3a
fix bug of dumping too many files to disk and not
...
being able to merge, and corrupting RdbBase::m_files[]
array and associated arrays.
2015-11-17 09:52:41 -08:00
Matt
690b4c5069
fix core from bogus url some more.
2015-11-16 12:51:18 -07:00
Matt
1a3c69af6b
fix core dump from empty url
2015-11-16 12:08:16 -07:00
Matt
296651d416
fix getLeastLoadedInShard() to only return
...
the appropriate nospider/noquery hosts when using
nospider/noquery in hosts.conf.
2015-11-16 09:53:40 -07:00
Matt
1b60cbd46e
fix core in Url.cpp
2015-11-16 09:29:08 -07:00
Matt
6e12f96aea
Merge branch 'testing'
...
Conflicts:
Rdb.cpp
2015-11-14 10:57:27 -07:00
Zak Betz
9ff387a898
More fixes to prevent spider traffic from hitting hosts with nospider
...
directive.
Bug fix for msg20 lookups always being directed away from noquery hosts.
2015-11-13 15:03:02 -07:00
Matt Wells
8b84297392
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-11 08:27:25 -08:00
Matt Wells
6cf6abf3d9
fix spider proxy clean up algo a little
...
so it won't freeze up
2015-11-11 08:27:09 -08:00
Matt
5f1695fab8
fix url.cpp
2015-11-10 00:29:42 -07:00
Matt
5061e5d7b5
normalize utf8 url paths into url encoded sequences.
2015-11-09 13:54:32 -07:00
Matt
80991c943f
complete merge of ia code into testing.
...
make indexing warcs/arcs a switch in spider controls.
2015-11-09 12:46:06 -07:00
Matt
fe448173d5
Merge branch 'ia' into testing
2015-11-09 11:14:00 -07:00
Matt
37cc4f2ba8
Merge branch 'diffbot-testing' into testing
2015-11-09 11:13:42 -07:00
Zak Betz
1351d9f994
Code cleanup.
2015-11-09 09:01:20 -07:00
Matt
dbe93c2ccf
fix bug of not always dumping core?
2015-11-08 08:54:46 -07:00
Matt Wells
3db9ae5d4d
rebuild fix
2015-11-07 13:14:38 -08:00
Matt Wells
93ec3138c5
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-06 13:31:02 -08:00
Matt Wells
44e3b0ca19
try to fix spider proxy load table pruning bug.
2015-11-06 13:30:42 -08:00
Zak Betz
c1bbd0207d
Don't bias tagdb lookups to a single host, use the host with the lowest
...
number of outstanding requests. The original reasoning was that one
host would handle all lookups for a site and that lookup would remain
in cache. Given that there are mega hubs like youtube and
facebook there should be as many hosts as possible handling requests for
these sites and the tagdb entries should stay in cache in all of the hosts
that have the key.
2015-11-04 15:37:49 -07:00
Matt
afbedba858
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-11-03 18:49:49 -08:00
Matt
c8c29db56b
fix core
2015-11-03 18:49:42 -08:00
Zak Betz
baa817b51d
Fix load balance of msg22s to use the udp slots in pinginfo.
...
Fix sigchild interrupting popen, when pdftohtml segfaults
popen was hanging forever.
Fix another bug when content length in http header was one off.
2015-11-03 11:51:19 -07:00
Matt
7608c5c29c
default I/O error detection to enabled so we see
...
hosts with I/O errors in the hosts table.
2015-11-03 11:33:16 -07:00
Matt
95d70b110e
fix bug in rebuild pipeline. need to merge the
...
files lest we max the # of files out.
2015-11-03 11:12:39 -07:00
Ai Lin Chia
6526accb50
Fix coredump when using add URL
2015-11-02 17:23:01 +01:00
Matt Wells
08b6fa67d7
improve spider performance when we have lots of collections.
...
fix core from corrupt titledb rec of some sort.
automatically turn off profiler when you get data back for
simplicity.
2015-11-01 20:23:18 -08:00
Zak Betz
ff6caf79a2
Increase time to mark item as stale in warc injector.
2015-11-01 19:45:29 -07:00