Commit Graph

60 Commits

Author SHA1 Message Date
Matt Wells
5c86d8a122 simplified spiderdb.cpp scanSpiderdb()
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
471599e9e7 formatting 2014-01-19 10:44:19 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
49c935cf6d added SpiderReply::m_wasIndexedValid
so we know whether to cound m_wasIndexed and m_isIndexed
for page counting quota purposes.
2014-01-06 14:27:38 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
5fcfff6729 fixes for spiders getting stuck 2013-12-19 20:04:06 -07:00
Matt Wells
cb111a1efa fix doledb empty logic 2013-12-19 13:06:35 -08:00
Matt Wells
16e91375f4 bring in changes from live beta from ~/github.
limit spiders to 50, not 500 to prevent oom.
resume killed merges that had num files shrunk even
if down to one file. show collnum in spider queue.
remove back-to-back whitespace, and make all space
a ' ' for getting the doc checksum for deduping.
2013-12-12 12:58:58 -08:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
c3517ee019 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Spider.cpp
2013-11-22 17:37:42 -08:00
Matt Wells
9d9a976b4f fix bug of perpetual round incrementing ad nauseam. 2013-11-22 11:14:03 -08:00
Matt Wells
dcae4682e8 new api. tossed action/expression
and added urlCrawlPattern/urlProcessPattern/apiUrl
2013-11-20 16:41:28 -08:00
Matt Wells
a8ffc6e50b indicate diffbot processing errors in the urls csv 2013-11-18 17:38:14 -08:00
Matt Wells
7d3b52fb3a if intersect thread takes forever
was causing msg5 reads to block forever
and spider round was getting incremented.
fixed a few bugs around that issue.
2013-11-18 16:20:30 -08:00
Matt Wells
8d9f000f11 make getNumSpidersOutPerIp() specific to a coll
so another coll does not prevent a coll from popuating
its own waiting tree.
2013-11-18 14:13:28 -08:00
Matt Wells
34bffc2cc6 1-second crawl info sleep wrapper update 2013-11-04 16:02:03 -08:00
Matt Wells
7b319e5948 show more info in the urls csv file.
record whether we processed the url or not
in the SpiderReply. normalize /index.html etc.
to / for the outlinks. in Links.cpp class.
2013-11-04 10:49:31 -08:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
c39b45ff88 fix crawl round end detection etc.
inc round counter even if not repeating crawl
2013-10-23 15:53:59 -07:00
Matt Wells
1fb85db307 url filters fixes. 2013-10-21 13:44:30 -07:00
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
mwells
be01041e36 added support for new url filter:
"lastspidertime>={roundstart} --> IGNORE"
so we can spider all urls before we advance
to the next spider round and re-spider everything
again. CollectionRec::m_spiderRoundStartTime and
CollectionRec::m_spiderRoundNum are the new
collection rec parms. show the round stuff
on url filters page.
2013-10-10 18:47:46 -06:00
mwells
2bb8b818d6 more bug fixes with notification system. 2013-10-09 16:28:15 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
mwells
612f2872f7 use addurl to add the gbdmoz url
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
e1bde7b7fe fixed bug of getting lock from the
wrong group.
2013-10-04 12:42:01 -06:00
mwells
d4aa65c0fe try to fix spiders with m_msg5StartKey logic. 2013-10-04 09:39:05 -06:00
mwells
0edcbcc7d8 printlocktable() function 2013-09-29 10:20:14 -06:00
mwells
c216f7b2a7 use 48 bit url hash for lock keys again.
query reindex recs can just use their
prob docids as fake uh48s. we need it so we
can avoid the fakedb record and just use
the spider reply to trigger a 5-second
lock expiration. a little simpler. added
logdebugspiderwait for waiting tree debugging.
fixed per ip spider limiting. fixed losing
spiders down blackhole from updateCrawlInfo.
check UrlLock::m_confirmed when counting outstanding
spiders on one ip since may have a lock on one host
but not get granted on all! it calls
confirmLockAcquisition() when it gets fully granted
the lock so it can set UrlLock::confirmed.
2013-09-29 00:09:46 -06:00
mwells
9730e5f3ef fix lost spiders from updating crawl info.
fix maxspidersperip limitation not being obeyed.
removed fakedb.
only add "0" time waiting tree keys to waiting tree.
only scanSpiderdb() will change their times to
a future time or add them to doledb directly.
confirmLockAcquisition() will not add to waitingtree
if max spiders per ip limit would be exceeded.
an incoming spider reply will trigger the add to
waiting tree with a time of "0".
2013-09-28 13:12:33 -06:00
mwells
0b5a45e8aa more api updates. added m_avoidSpiderLinks to
spider request so urldata=xxxx can turn link
spidering off. probably desirable so its default.
so &spiderlinks=[0|1] applies to urldata as well
as injecturl=
2013-09-25 17:51:43 -06:00
mwells
40192249f9 spider speedups and fixes. 2013-09-25 11:58:03 -06:00
mwells
b16d8519fc more spider fixes. still need more speedups
when spidering multiple spiders on same ip.
2013-09-24 16:40:14 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
mwells
05400a0c25 updated spider code documentation. 2013-09-20 11:19:24 -07:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00