Commit Graph

3838 Commits

Author SHA1 Message Date
Matt Wells
97c2f09225 move gbssFirstIndexed up so the new shard fields
dont mess up printing of the human readable dates
2016-02-08 10:26:02 -08:00
Matt Wells
cdb8a5f86a fix core from treating corrupted titlerecs
as non-existent.
2016-02-08 09:49:04 -08:00
Matt Wells
c1a72213d7 watch out for negative datasize spider requests
in doledb when calling xmldoc::set4 so we don't core
any more.
2016-02-08 09:18:33 -08:00
Matt Wells
3b6704d318 also show what shardnum stores the docid
so we can track down titlerec corruption easier.
2016-02-08 08:58:37 -08:00
Matt Wells
d183525db6 treat corrupted titlerecs as not founds so
spidering can continue despite it.
update gbss result display to have human readable
dates and a link to the docid.
added gbssSpideredByHostId so we can track down
issues faster.
2016-02-08 08:42:04 -08:00
Matt Wells
9247e15210 make spider use HTTP/1.1 not 1.0 since
some sites have been found to return
406 unacceptable periodically because of it.
2016-02-05 10:14:09 -08:00
appchecker
701b14ada7 Fix: possible double free 2016-02-05 16:11:53 +03:00
Matt Wells
4eec8eb5b7 fix invalid <base href=/> tag 2016-02-02 12:52:45 -08:00
Matt
ef8dc25b55 fix compiler warning 2016-01-30 14:31:01 -07:00
Matt
19de1d2eae Merge branch 'diffbot-testing' into testing 2016-01-30 14:27:08 -07:00
Matt Wells
1a0378b76e do not report edocunchanged for bulk jobs ever.
more skipped shards bug fixes.
remove some log spam.
2016-01-30 11:14:12 -08:00
Matt Wells
3e9ee2f6d0 bulk robots hack fix 2016-01-28 09:14:38 -08:00
Matt Wells
04bdda20cf fix for empty queries saying a shard is down 2016-01-28 08:51:04 -08:00
Matt Wells
4c7f969988 fix critical spider issue of an IP corking an
entire spider priority. also exit faster on
'save & exit' if in evalIpLoop().
2016-01-23 10:32:14 -08:00
Matt
7e13b147e4 updated dmoz docs 2016-01-23 08:54:35 -07:00
Matt Wells
fb3b179666 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2016-01-21 09:58:38 -08:00
Matt Wells
a652d60b2b detect more corrupted spider records caused
by saving memory to disk after a segv corrupts the memory.
2016-01-21 09:57:43 -08:00
Matt Wells
4d88ca5d28 improve spierrequest::isCorrupt() 2016-01-19 22:24:52 -08:00
Matt
1e248218da added 4 more diffbot errors so hopefully
no more 'unknown diffbot error' error codes
in crawlbot.
2016-01-11 16:12:33 -08:00
Matt
032f597a16 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Conflicts:
	Errno.cpp
	Errno.h
2016-01-11 15:30:53 -08:00
Matt
33a40480b4 try to fix Unknown Diffbot Error error. 2016-01-11 15:28:30 -08:00
Matt Wells
422ffae8e3 fix a core from dumping doledb out to stdout 2016-01-11 10:32:47 -08:00
Matt Wells
4fb900da03 fix strange corruption in doledb core 2016-01-11 09:40:55 -08:00
Zak Betz
344732ac19 Don't try to match implicit non-required phrases when verifying doc
has query terms.
2016-01-08 10:09:34 -07:00
Zak Betz
008b21ee6b Fix query "the" and "the" not matching all of the terms.
Stop words were supposed to be ignored but we were still
trying to match them when looking for anomalous link text.
This was causing all summaries to be rejected at query time causing
a long wait for no results.
2016-01-07 15:30:45 -07:00
Matt Wells
6d73b57243 fix core dump from bad langid of 99 2016-01-05 14:32:02 -08:00
Matt
03aad9db54 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2016-01-05 13:50:18 -07:00
Matt
5804e28230 fix bug of losing the hopcount 0 spiderrequest
because it gets overridden by a link to itself
then it becomes hopcount 1.
2016-01-05 13:49:54 -07:00
Matt Wells
bb316b0e88 respider all diffbot urls even if 404 or 500.
that sometimes happens TEPMORARILY to a website.
2016-01-05 12:47:04 -08:00
Matt Wells
953d636448 added trivial link on cached page to gb root page 2016-01-03 11:27:24 -08:00
Matt Wells
60e0306d0d fix xmldoc::getRootXmlDoc() to put an http:// or https://
in front of the root url before putting into spiderrequest
so it is not perceived as corrupted.
2015-12-29 08:22:14 -08:00
Matt Wells
dd46a82bb7 fix again 2015-12-28 14:50:11 -08:00
Matt Wells
1091c5c4ea fix core 2015-12-28 14:48:03 -08:00
Matt Wells
21c337be27 fix the spider status msg fix some more 2015-12-28 12:03:50 -08:00
Matt Wells
d6a9db35d2 instead of showing SP_ROUNDDONE, show SP_MAXROUNDS
if necessary so we can pass the crawlbot nightly smoke tests.
2015-12-28 11:50:56 -08:00
Matt
5d049213b9 try to fix host 31 from coring. possible corrupt spider request. 2015-12-28 12:22:39 -07:00
Matt Wells
9b423690ee fix compiler error for maxpp 2015-12-28 11:21:36 -08:00
Matt
40ff79c8ff Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2015-12-23 22:18:49 -07:00
Matt
7be4224817 change try agains recvd to try agains sent
in the hosts table.
2015-12-23 22:18:24 -07:00
Matt Wells
da5f0c8d76 Merge branch 'testing' into diffbot-testing 2015-12-23 14:23:05 -08:00
Matt Wells
9147d6bb02 fix some diffbot crawls.
do not spider pages at the hopcount limit
when 'only spider urls if new' is enabled.
meaning only spider each url once. (unless there is
a temporary error)
fix malformed url bug some more.
added some commented out code for indexing spider replies
(gbss docs) for certain fatal/critical errors, in which
case they are not being indexed.
2015-12-23 13:49:21 -08:00
Matt
a5e6a12ff8 added support for TLS SNI (Server name identification) 2015-12-23 13:30:49 -07:00
Matt
6f9f177d7c another fix for the parms not getting updated fix. 2015-12-17 14:30:35 -07:00
Matt Wells
770e94b4cc fix it so we don't call the page handler
before all parms have been digested.
2015-12-17 13:11:57 -08:00
Matt Wells
a07c840a6a join with threads when exiting -- to no avail
exit status is still foobar.
2015-12-17 10:15:39 -08:00
Matt Wells
c5f21a721f zero out crazy local spider stats.
corrupted from saving after memory got corrupted.
2015-12-17 09:43:41 -08:00
Matt
f2be319dcd try to fix exiting w/ pthreads some more (part 2) 2015-12-17 08:38:12 -07:00
Matt
73e9ed0719 try to fix cores not being dropped. 2015-12-17 06:44:25 -07:00
Matt
ede7a78594 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Conflicts:
	UdpSlot.h
2015-12-17 06:08:47 -07:00
Matt Wells
83ade5c58d fix spider req corruption 2015-12-16 20:33:45 -08:00