Matt Wells
4c7f969988
fix critical spider issue of an IP corking an
...
entire spider priority. also exit faster on
'save & exit' if in evalIpLoop().
2016-01-23 10:32:14 -08:00
Matt
7e13b147e4
updated dmoz docs
2016-01-23 08:54:35 -07:00
Matt Wells
fb3b179666
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2016-01-21 09:58:38 -08:00
Matt Wells
a652d60b2b
detect more corrupted spider records caused
...
by saving memory to disk after a segv corrupts the memory.
2016-01-21 09:57:43 -08:00
Matt Wells
4d88ca5d28
improve spierrequest::isCorrupt()
2016-01-19 22:24:52 -08:00
Matt
1e248218da
added 4 more diffbot errors so hopefully
...
no more 'unknown diffbot error' error codes
in crawlbot.
2016-01-11 16:12:33 -08:00
Matt
032f597a16
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
Errno.cpp
Errno.h
2016-01-11 15:30:53 -08:00
Matt
33a40480b4
try to fix Unknown Diffbot Error error.
2016-01-11 15:28:30 -08:00
Matt Wells
422ffae8e3
fix a core from dumping doledb out to stdout
2016-01-11 10:32:47 -08:00
Matt Wells
4fb900da03
fix strange corruption in doledb core
2016-01-11 09:40:55 -08:00
Matt Wells
6d73b57243
fix core dump from bad langid of 99
2016-01-05 14:32:02 -08:00
Matt
03aad9db54
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2016-01-05 13:50:18 -07:00
Matt
5804e28230
fix bug of losing the hopcount 0 spiderrequest
...
because it gets overridden by a link to itself
then it becomes hopcount 1.
2016-01-05 13:49:54 -07:00
Matt Wells
bb316b0e88
respider all diffbot urls even if 404 or 500.
...
that sometimes happens TEPMORARILY to a website.
2016-01-05 12:47:04 -08:00
Matt Wells
953d636448
added trivial link on cached page to gb root page
2016-01-03 11:27:24 -08:00
Matt Wells
60e0306d0d
fix xmldoc::getRootXmlDoc() to put an http:// or https://
...
in front of the root url before putting into spiderrequest
so it is not perceived as corrupted.
2015-12-29 08:22:14 -08:00
Matt Wells
dd46a82bb7
fix again
2015-12-28 14:50:11 -08:00
Matt Wells
1091c5c4ea
fix core
2015-12-28 14:48:03 -08:00
Matt Wells
21c337be27
fix the spider status msg fix some more
2015-12-28 12:03:50 -08:00
Matt Wells
d6a9db35d2
instead of showing SP_ROUNDDONE, show SP_MAXROUNDS
...
if necessary so we can pass the crawlbot nightly smoke tests.
2015-12-28 11:50:56 -08:00
Matt
5d049213b9
try to fix host 31 from coring. possible corrupt spider request.
2015-12-28 12:22:39 -07:00
Matt Wells
9b423690ee
fix compiler error for maxpp
2015-12-28 11:21:36 -08:00
Matt
40ff79c8ff
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-12-23 22:18:49 -07:00
Matt
7be4224817
change try agains recvd to try agains sent
...
in the hosts table.
2015-12-23 22:18:24 -07:00
Matt Wells
da5f0c8d76
Merge branch 'testing' into diffbot-testing
2015-12-23 14:23:05 -08:00
Matt Wells
9147d6bb02
fix some diffbot crawls.
...
do not spider pages at the hopcount limit
when 'only spider urls if new' is enabled.
meaning only spider each url once. (unless there is
a temporary error)
fix malformed url bug some more.
added some commented out code for indexing spider replies
(gbss docs) for certain fatal/critical errors, in which
case they are not being indexed.
2015-12-23 13:49:21 -08:00
Matt
a5e6a12ff8
added support for TLS SNI (Server name identification)
2015-12-23 13:30:49 -07:00
Matt
6f9f177d7c
another fix for the parms not getting updated fix.
2015-12-17 14:30:35 -07:00
Matt Wells
770e94b4cc
fix it so we don't call the page handler
...
before all parms have been digested.
2015-12-17 13:11:57 -08:00
Matt Wells
a07c840a6a
join with threads when exiting -- to no avail
...
exit status is still foobar.
2015-12-17 10:15:39 -08:00
Matt Wells
c5f21a721f
zero out crazy local spider stats.
...
corrupted from saving after memory got corrupted.
2015-12-17 09:43:41 -08:00
Matt
f2be319dcd
try to fix exiting w/ pthreads some more (part 2)
2015-12-17 08:38:12 -07:00
Matt
73e9ed0719
try to fix cores not being dropped.
2015-12-17 06:44:25 -07:00
Matt
ede7a78594
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
UdpSlot.h
2015-12-17 06:08:47 -07:00
Matt Wells
83ade5c58d
fix spider req corruption
2015-12-16 20:33:45 -08:00
Matt Wells
becc244e12
add new link to page crawlbot to see spider attempt
...
gbss docs.
2015-12-15 16:22:53 -08:00
Matt Wells
9a20a4387f
improve spiderreq corruption detection
2015-12-15 14:23:21 -08:00
Matt Wells
879ef32db0
fix for all urls getting malformed url (EBADURL)
...
while spidering. had to add 'errorcode==' to
urlfilters to just redo those pages otherwise
having that error, a non-temporary error, would have
barred them from being retried in the future.
2015-12-15 10:06:06 -08:00
Matt Wells
23a20ff639
more fixes for ebadurl bug
2015-12-14 17:49:06 -08:00
Matt Wells
e1cfeb4c82
have diffbot retry non tmp errors to make up
...
for bug of calling valid urls malformed EBADURLs
2015-12-14 17:30:41 -08:00
Matt Wells
4b9240e42e
fix a fix
2015-12-14 17:06:58 -08:00
Matt Wells
6d30f21ad9
use small dgrams to avoid splitting at the kernel level
...
down to the mtu. increase 0xc1 msg request delay from 3 to 20 secs.
need to make it linear order.
2015-12-14 10:47:07 -08:00
Matt Wells
cb0d75b343
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
XmlDoc.cpp
2015-12-12 19:21:06 -08:00
Matt Wells
c161f17196
fix punycode bug in firsturl of bad
...
punycode.
2015-12-12 19:18:28 -08:00
Matt
dc505aefa8
Merge branch 'testing' into diffbot-testing
2015-12-11 12:35:15 -07:00
Matt
f548b6a728
smaller dgram size works on more networks
...
and over the internet.
2015-12-11 12:34:51 -07:00
Matt
b92763ebc7
Merge branch 'testing' into diffbot-testing
2015-12-09 23:11:37 -07:00
Matt Wells
657c27a0ee
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
Process.cpp
2015-12-09 21:35:30 -08:00
Matt Wells
2b60faf5df
Merge branch 'diffbot' into diffbot-testing
...
Conflicts:
Process.cpp
2015-12-09 21:34:54 -08:00
Matt
a5b8276299
Merge branch 'diffbot' into diffbot-testing
...
Conflicts:
Process.cpp
2015-12-09 17:56:44 -07:00