Matt Wells
0fdbaa4196
makefile optimizations
2016-03-14 16:34:24 -07:00
Matt
0dbc304bbf
fix to allow us to gather ip-only url outlinks again
2016-03-14 10:56:33 -06:00
Matt
2c167aada7
fix redirect to self bug that requires setting cookie
2016-03-14 10:33:05 -06:00
Matt Wells
d6fe684b99
fix another core caused by deleted coll
2016-03-07 10:20:25 -08:00
Matt Wells
d4e16a4dab
pass a crawlbotnightly smoke
2016-03-04 13:14:28 -08:00
Matt Wells
e75d80abbe
ignore meta redirect tags in html comment tags.
2016-02-22 12:41:03 -08:00
Matt Wells
412b04bbd4
fix neverending crawl rounds by only trying each url
...
once per round. updated url filters.
2016-02-22 09:28:46 -08:00
Matt Wells
da9949f462
try to fix a couple more core dumps.
2016-02-19 08:54:48 -08:00
Matt Wells
c7696a69eb
fix core from a federated query and null msg20
2016-02-18 10:53:20 -08:00
Matt Wells
f649944573
if spidered time is in future, consider the spiderreply
...
corrupt and ignore it. if you set back the OS clock then
you might end up ignoring some spider replies but hopefully
it won't be such a big deal.
2016-02-16 12:25:49 -08:00
Matt Wells
f11595efc3
fix core dump from deleting an active/dumping
...
collection
2016-02-12 16:54:03 -08:00
Matt Wells
e68406f073
fix core in posdbtable from docid of 0.
...
no idea why docid was 0, but why core?
2016-02-09 22:43:09 -08:00
Matt Wells
e376b97814
let's generalize it. if a redirect sets cookies
...
then follow it through, don't stop in the middle
because we think it is 'simplified'.
2016-02-09 13:47:12 -08:00
Matt Wells
d7a6a0a1ff
fix gap.com redirects that require us
...
setting multiple cookies in the spider request.
2016-02-09 13:38:59 -08:00
Matt
92934bbd5c
use http/1.0 since we dont support chunked transfer encoding
2016-02-09 12:04:05 -07:00
Matt
ef97462acc
thanks for the bug fix, ivan!
2016-02-09 10:38:46 -07:00
Matt Wells
1d2dfe1456
bring back max doc len parms.
...
index gbssIsContentTruncated field.
fix 30-day wait for >= 3 errors.
fix gbss formatting some more.
2016-02-08 14:10:04 -08:00
Matt Wells
97c2f09225
move gbssFirstIndexed up so the new shard fields
...
dont mess up printing of the human readable dates
2016-02-08 10:26:02 -08:00
Matt Wells
cdb8a5f86a
fix core from treating corrupted titlerecs
...
as non-existent.
2016-02-08 09:49:04 -08:00
Matt Wells
c1a72213d7
watch out for negative datasize spider requests
...
in doledb when calling xmldoc::set4 so we don't core
any more.
2016-02-08 09:18:33 -08:00
Matt Wells
3b6704d318
also show what shardnum stores the docid
...
so we can track down titlerec corruption easier.
2016-02-08 08:58:37 -08:00
Matt Wells
d183525db6
treat corrupted titlerecs as not founds so
...
spidering can continue despite it.
update gbss result display to have human readable
dates and a link to the docid.
added gbssSpideredByHostId so we can track down
issues faster.
2016-02-08 08:42:04 -08:00
Matt Wells
9247e15210
make spider use HTTP/1.1 not 1.0 since
...
some sites have been found to return
406 unacceptable periodically because of it.
2016-02-05 10:14:09 -08:00
Matt Wells
4eec8eb5b7
fix invalid <base href=/> tag
2016-02-02 12:52:45 -08:00
Matt Wells
1a0378b76e
do not report edocunchanged for bulk jobs ever.
...
more skipped shards bug fixes.
remove some log spam.
2016-01-30 11:14:12 -08:00
Matt Wells
3e9ee2f6d0
bulk robots hack fix
2016-01-28 09:14:38 -08:00
Matt Wells
04bdda20cf
fix for empty queries saying a shard is down
2016-01-28 08:51:04 -08:00
Matt Wells
4c7f969988
fix critical spider issue of an IP corking an
...
entire spider priority. also exit faster on
'save & exit' if in evalIpLoop().
2016-01-23 10:32:14 -08:00
Matt
7e13b147e4
updated dmoz docs
2016-01-23 08:54:35 -07:00
Matt Wells
fb3b179666
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2016-01-21 09:58:38 -08:00
Matt Wells
a652d60b2b
detect more corrupted spider records caused
...
by saving memory to disk after a segv corrupts the memory.
2016-01-21 09:57:43 -08:00
Matt Wells
4d88ca5d28
improve spierrequest::isCorrupt()
2016-01-19 22:24:52 -08:00
Matt
1e248218da
added 4 more diffbot errors so hopefully
...
no more 'unknown diffbot error' error codes
in crawlbot.
2016-01-11 16:12:33 -08:00
Matt
032f597a16
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
Errno.cpp
Errno.h
2016-01-11 15:30:53 -08:00
Matt
33a40480b4
try to fix Unknown Diffbot Error error.
2016-01-11 15:28:30 -08:00
Matt Wells
422ffae8e3
fix a core from dumping doledb out to stdout
2016-01-11 10:32:47 -08:00
Matt Wells
4fb900da03
fix strange corruption in doledb core
2016-01-11 09:40:55 -08:00
Matt Wells
6d73b57243
fix core dump from bad langid of 99
2016-01-05 14:32:02 -08:00
Matt
03aad9db54
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2016-01-05 13:50:18 -07:00
Matt
5804e28230
fix bug of losing the hopcount 0 spiderrequest
...
because it gets overridden by a link to itself
then it becomes hopcount 1.
2016-01-05 13:49:54 -07:00
Matt Wells
bb316b0e88
respider all diffbot urls even if 404 or 500.
...
that sometimes happens TEPMORARILY to a website.
2016-01-05 12:47:04 -08:00
Matt Wells
953d636448
added trivial link on cached page to gb root page
2016-01-03 11:27:24 -08:00
Matt Wells
60e0306d0d
fix xmldoc::getRootXmlDoc() to put an http:// or https://
...
in front of the root url before putting into spiderrequest
so it is not perceived as corrupted.
2015-12-29 08:22:14 -08:00
Matt Wells
dd46a82bb7
fix again
2015-12-28 14:50:11 -08:00
Matt Wells
1091c5c4ea
fix core
2015-12-28 14:48:03 -08:00
Matt Wells
21c337be27
fix the spider status msg fix some more
2015-12-28 12:03:50 -08:00
Matt Wells
d6a9db35d2
instead of showing SP_ROUNDDONE, show SP_MAXROUNDS
...
if necessary so we can pass the crawlbot nightly smoke tests.
2015-12-28 11:50:56 -08:00
Matt
5d049213b9
try to fix host 31 from coring. possible corrupt spider request.
2015-12-28 12:22:39 -07:00
Matt Wells
9b423690ee
fix compiler error for maxpp
2015-12-28 11:21:36 -08:00
Matt
40ff79c8ff
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-12-23 22:18:49 -07:00