Commit Graph

48 Commits

Author SHA1 Message Date
Matt Wells
cb111a1efa fix doledb empty logic 2013-12-19 13:06:35 -08:00
Matt Wells
16e91375f4 bring in changes from live beta from ~/github.
limit spiders to 50, not 500 to prevent oom.
resume killed merges that had num files shrunk even
if down to one file. show collnum in spider queue.
remove back-to-back whitespace, and make all space
a ' ' for getting the doc checksum for deduping.
2013-12-12 12:58:58 -08:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
c3517ee019 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Spider.cpp
2013-11-22 17:37:42 -08:00
Matt Wells
9d9a976b4f fix bug of perpetual round incrementing ad nauseam. 2013-11-22 11:14:03 -08:00
Matt Wells
dcae4682e8 new api. tossed action/expression
and added urlCrawlPattern/urlProcessPattern/apiUrl
2013-11-20 16:41:28 -08:00
Matt Wells
a8ffc6e50b indicate diffbot processing errors in the urls csv 2013-11-18 17:38:14 -08:00
Matt Wells
7d3b52fb3a if intersect thread takes forever
was causing msg5 reads to block forever
and spider round was getting incremented.
fixed a few bugs around that issue.
2013-11-18 16:20:30 -08:00
Matt Wells
8d9f000f11 make getNumSpidersOutPerIp() specific to a coll
so another coll does not prevent a coll from popuating
its own waiting tree.
2013-11-18 14:13:28 -08:00
Matt Wells
34bffc2cc6 1-second crawl info sleep wrapper update 2013-11-04 16:02:03 -08:00
Matt Wells
7b319e5948 show more info in the urls csv file.
record whether we processed the url or not
in the SpiderReply. normalize /index.html etc.
to / for the outlinks. in Links.cpp class.
2013-11-04 10:49:31 -08:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
c39b45ff88 fix crawl round end detection etc.
inc round counter even if not repeating crawl
2013-10-23 15:53:59 -07:00
Matt Wells
1fb85db307 url filters fixes. 2013-10-21 13:44:30 -07:00
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
mwells
be01041e36 added support for new url filter:
"lastspidertime>={roundstart} --> IGNORE"
so we can spider all urls before we advance
to the next spider round and re-spider everything
again. CollectionRec::m_spiderRoundStartTime and
CollectionRec::m_spiderRoundNum are the new
collection rec parms. show the round stuff
on url filters page.
2013-10-10 18:47:46 -06:00
mwells
2bb8b818d6 more bug fixes with notification system. 2013-10-09 16:28:15 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
mwells
612f2872f7 use addurl to add the gbdmoz url
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
e1bde7b7fe fixed bug of getting lock from the
wrong group.
2013-10-04 12:42:01 -06:00
mwells
d4aa65c0fe try to fix spiders with m_msg5StartKey logic. 2013-10-04 09:39:05 -06:00
mwells
0edcbcc7d8 printlocktable() function 2013-09-29 10:20:14 -06:00
mwells
c216f7b2a7 use 48 bit url hash for lock keys again.
query reindex recs can just use their
prob docids as fake uh48s. we need it so we
can avoid the fakedb record and just use
the spider reply to trigger a 5-second
lock expiration. a little simpler. added
logdebugspiderwait for waiting tree debugging.
fixed per ip spider limiting. fixed losing
spiders down blackhole from updateCrawlInfo.
check UrlLock::m_confirmed when counting outstanding
spiders on one ip since may have a lock on one host
but not get granted on all! it calls
confirmLockAcquisition() when it gets fully granted
the lock so it can set UrlLock::confirmed.
2013-09-29 00:09:46 -06:00
mwells
9730e5f3ef fix lost spiders from updating crawl info.
fix maxspidersperip limitation not being obeyed.
removed fakedb.
only add "0" time waiting tree keys to waiting tree.
only scanSpiderdb() will change their times to
a future time or add them to doledb directly.
confirmLockAcquisition() will not add to waitingtree
if max spiders per ip limit would be exceeded.
an incoming spider reply will trigger the add to
waiting tree with a time of "0".
2013-09-28 13:12:33 -06:00
mwells
0b5a45e8aa more api updates. added m_avoidSpiderLinks to
spider request so urldata=xxxx can turn link
spidering off. probably desirable so its default.
so &spiderlinks=[0|1] applies to urldata as well
as injecturl=
2013-09-25 17:51:43 -06:00
mwells
40192249f9 spider speedups and fixes. 2013-09-25 11:58:03 -06:00
mwells
b16d8519fc more spider fixes. still need more speedups
when spidering multiple spiders on same ip.
2013-09-24 16:40:14 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
mwells
05400a0c25 updated spider code documentation. 2013-09-20 11:19:24 -07:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00
Matt Wells
d982997b0c streamline crawl stats. 2013-09-13 17:34:39 -07:00
Matt Wells
a412c798bf Merge branch 'master' into diffbot
Conflicts:
	PageResults.cpp
2013-09-13 09:24:28 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
Matt Wells
03706131fe documentation updates in Spider.h. 2013-09-08 13:42:02 -07:00
Matt Wells
9696c7936a Merge branch 'master' into diffbot 2013-08-30 16:33:00 -07:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
mwells
2e9c8f7c6e Merge branch 'master' of github.com:gigablast/open-source-search-engine 2013-08-29 21:17:46 -06:00
mwells
84fae9a3c6 Fix issue of reading spiderrequests from
doledb at the very first key in spiderdb.
causes lots of positive/negative key annihilations.
we end up re-reading like 300 times in some
cases just to get a url from a doledb priority.
2013-08-29 21:16:59 -06:00
mwells
ca2a024d04 fixed up thread/spider log msgs.
fixed core from calling fprintf in
alarm signal missed quickpoll handler.
2013-08-29 21:15:42 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00