Commit Graph

73 Commits

Author SHA1 Message Date
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
9c26b85c2f fixed contenthash32 logic for json objects.
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb do not store 40 urls in doledb if
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e added winnertable to avoid dups in winnertree. 2014-02-04 20:09:43 -08:00
Matt Wells
3312400fee checkpoint for faster spider code. 2014-02-04 16:15:27 -08:00
Matt Wells
93021b2f13 Merge branch 'diffbot'
Conflicts:

	Collectiondb.cpp
	Spider.cpp
	Spider.h
2014-02-01 11:31:00 -07:00
Matt Wells
3a6a271dd9 make crawl sync bug fixes.
fix Puz crawl from dying out on host 9
because spider reply did not resuscitate waiting
tree for its ip.
fix mike's zola crawl with a repeat of 3 days
from not incmreneting the round because it had
maxrounds 0, which means to ignore... assume 0
means to ignore now. send out 0xc1 crawl info
requests to even dead hosts so we can at least use
their last known good info.
2014-01-25 13:47:03 -08:00
Matt Wells
dd663eb9f7 fix round based spidering some more 2014-01-23 15:03:37 -08:00
Matt Wells
5f890f5d4f minor doc update 2014-01-22 15:52:04 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
9354d06493 menu updates. 2014-01-21 13:01:37 -08:00
Matt Wells
41cdfcef96 inc spider limits in various places 2014-01-20 18:51:15 -08:00
Matt Wells
946a683e39 quite a few spider fixes 2014-01-20 16:45:27 -08:00
Matt Wells
5c86d8a122 simplified spiderdb.cpp scanSpiderdb()
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
471599e9e7 formatting 2014-01-19 10:44:19 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
49c935cf6d added SpiderReply::m_wasIndexedValid
so we know whether to cound m_wasIndexed and m_isIndexed
for page counting quota purposes.
2014-01-06 14:27:38 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
5fcfff6729 fixes for spiders getting stuck 2013-12-19 20:04:06 -07:00
Matt Wells
cb111a1efa fix doledb empty logic 2013-12-19 13:06:35 -08:00
Matt Wells
16e91375f4 bring in changes from live beta from ~/github.
limit spiders to 50, not 500 to prevent oom.
resume killed merges that had num files shrunk even
if down to one file. show collnum in spider queue.
remove back-to-back whitespace, and make all space
a ' ' for getting the doc checksum for deduping.
2013-12-12 12:58:58 -08:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
c3517ee019 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Spider.cpp
2013-11-22 17:37:42 -08:00
Matt Wells
9d9a976b4f fix bug of perpetual round incrementing ad nauseam. 2013-11-22 11:14:03 -08:00
Matt Wells
dcae4682e8 new api. tossed action/expression
and added urlCrawlPattern/urlProcessPattern/apiUrl
2013-11-20 16:41:28 -08:00
Matt Wells
a8ffc6e50b indicate diffbot processing errors in the urls csv 2013-11-18 17:38:14 -08:00
Matt Wells
7d3b52fb3a if intersect thread takes forever
was causing msg5 reads to block forever
and spider round was getting incremented.
fixed a few bugs around that issue.
2013-11-18 16:20:30 -08:00
Matt Wells
8d9f000f11 make getNumSpidersOutPerIp() specific to a coll
so another coll does not prevent a coll from popuating
its own waiting tree.
2013-11-18 14:13:28 -08:00
Matt Wells
34bffc2cc6 1-second crawl info sleep wrapper update 2013-11-04 16:02:03 -08:00
Matt Wells
7b319e5948 show more info in the urls csv file.
record whether we processed the url or not
in the SpiderReply. normalize /index.html etc.
to / for the outlinks. in Links.cpp class.
2013-11-04 10:49:31 -08:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
c39b45ff88 fix crawl round end detection etc.
inc round counter even if not repeating crawl
2013-10-23 15:53:59 -07:00
Matt Wells
1fb85db307 url filters fixes. 2013-10-21 13:44:30 -07:00
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
mwells
be01041e36 added support for new url filter:
"lastspidertime>={roundstart} --> IGNORE"
so we can spider all urls before we advance
to the next spider round and re-spider everything
again. CollectionRec::m_spiderRoundStartTime and
CollectionRec::m_spiderRoundNum are the new
collection rec parms. show the round stuff
on url filters page.
2013-10-10 18:47:46 -06:00
mwells
2bb8b818d6 more bug fixes with notification system. 2013-10-09 16:28:15 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
mwells
612f2872f7 use addurl to add the gbdmoz url
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
e1bde7b7fe fixed bug of getting lock from the
wrong group.
2013-10-04 12:42:01 -06:00
mwells
d4aa65c0fe try to fix spiders with m_msg5StartKey logic. 2013-10-04 09:39:05 -06:00