Commit Graph

21 Commits

Author SHA1 Message Date
mwells
9730e5f3ef fix lost spiders from updating crawl info.
fix maxspidersperip limitation not being obeyed.
removed fakedb.
only add "0" time waiting tree keys to waiting tree.
only scanSpiderdb() will change their times to
a future time or add them to doledb directly.
confirmLockAcquisition() will not add to waitingtree
if max spiders per ip limit would be exceeded.
an incoming spider reply will trigger the add to
waiting tree with a time of "0".
2013-09-28 13:12:33 -06:00
mwells
0b5a45e8aa more api updates. added m_avoidSpiderLinks to
spider request so urldata=xxxx can turn link
spidering off. probably desirable so its default.
so &spiderlinks=[0|1] applies to urldata as well
as injecturl=
2013-09-25 17:51:43 -06:00
mwells
40192249f9 spider speedups and fixes. 2013-09-25 11:58:03 -06:00
mwells
b16d8519fc more spider fixes. still need more speedups
when spidering multiple spiders on same ip.
2013-09-24 16:40:14 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
mwells
05400a0c25 updated spider code documentation. 2013-09-20 11:19:24 -07:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00
Matt Wells
d982997b0c streamline crawl stats. 2013-09-13 17:34:39 -07:00
Matt Wells
a412c798bf Merge branch 'master' into diffbot
Conflicts:
	PageResults.cpp
2013-09-13 09:24:28 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
Matt Wells
03706131fe documentation updates in Spider.h. 2013-09-08 13:42:02 -07:00
Matt Wells
9696c7936a Merge branch 'master' into diffbot 2013-08-30 16:33:00 -07:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
mwells
2e9c8f7c6e Merge branch 'master' of github.com:gigablast/open-source-search-engine 2013-08-29 21:17:46 -06:00
mwells
84fae9a3c6 Fix issue of reading spiderrequests from
doledb at the very first key in spiderdb.
causes lots of positive/negative key annihilations.
we end up re-reading like 300 times in some
cases just to get a url from a doledb priority.
2013-08-29 21:16:59 -06:00
mwells
ca2a024d04 fixed up thread/spider log msgs.
fixed core from calling fprintf in
alarm signal missed quickpoll handler.
2013-08-29 21:15:42 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00