Commit Graph

117 Commits

Author SHA1 Message Date
Matt
90456222b6 now we add the spider status docs as json documents.
so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -06:00
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt Wells
485c600c7c fix for changing maxtocrawl/process/rounds.
fix for thinking a crawl is done when it is just taking
a while to populate doledb from the waiting tree for
that SpiderColl. we just call populateDoledbFromWaiting in
doneSleepingWrapper avery 50ms. it loops over every coll
so it could be more efficient.
2015-03-12 13:15:52 -07:00
Matt
dfd6d8b2cf fix critical spider bug that was deleting pages
because of bogus SpiderReply::m_langId values!
2015-03-05 08:49:39 -08:00
mwells
d4f67285ce keep spiders maxed out all the time. 2015-02-25 18:40:50 -07:00
mwells
4e485b6649 increase dolebuf cache time from 2 to 5 mins
for better performance. cache empty dolebufs
if winner tree list was not from cache, so
in case we have a huge spiderdb scan list
of urls we aren't spidering we can cache it,
like twitter.com e.g. do not call strstr
in getUrlFilterNum2() for .css? or /print/
since it was taking way too much cpu time.
2015-02-21 15:17:28 -07:00
Matt
579a08d287 fixed link overflow logic. 2015-02-12 15:03:01 -08:00
Matt
415c96fc56 added overflow checks to ensure we don't have more
than 10M unique urls for a given "firstip"
queued up to be spidered in spiderdb
that have never been spidered. should prevent us
from having 20GB spiderdbs for spidering those sites
that essentially have an infinite # of urls, black hole
sites, that seems to be plaguing crawls.
2015-02-12 13:41:40 -08:00
Matt
c009430b6c more fixes for new spider updates 2015-02-11 21:54:36 -08:00
Matt
9ea53ed89e bug fixes. spidering seems to work somewhat again. 2015-02-11 19:23:36 -08:00
Matt
30a77dd422 checkpoint on massive spidering speed ups. 2015-02-11 17:55:28 -08:00
Matt
f6723ddaa3 new much faster spider. cache the winner tree
basically. TODO: need to update cache if
new spiderrequests are added that should be
in the cached winner tree.
2015-02-10 21:27:21 -08:00
Matt Wells
c8c56a24da fixed query reindex for diffbot json docs.
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
2014-12-15 16:49:20 -08:00
Matt
adcef39376 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Collectiondb.h
	Conf.cpp
	Conf.h
	Msg39.cpp
	PageEvents.cpp
	PageResults.cpp
	PageTurk.cpp
	Pages.cpp
	Parms.cpp
	Posdb.cpp
	Proxy.cpp
	Query.cpp
	Query.h
	RdbBase.cpp
	RdbMap.cpp
	Repair.cpp
	Repair.h
	SafeBuf.cpp
	Spider.cpp
	Tagdb.cpp
	TopTree.cpp
	XmlDoc.cpp
	main.cpp
2014-11-20 16:53:07 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
69ef3c14ef fixes for repair/rebuild functionality.
more to come.
2014-11-13 13:04:28 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
f483fccc2e if no crawl regex, and it has a crawl pattern consisting of
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
mwells
5a508cad69 upped MAX_SPIDERS from 100 to 300.
watch out for oom though.
2014-09-03 07:25:40 -07:00
Matt Wells
d2b1196a85 Merge branch 'diffbot-testing' into testing 2014-07-22 10:47:33 -07:00
Matt Wells
248b02ea9e fix another spiderdb corruption core 2014-07-22 06:34:34 -07:00
mwells
cd48799030 try to fix core on neo 2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
3162c83473 add some debug msgs 2014-06-27 08:28:28 -07:00
mwells
651f0f27ac only send localcrawlinfo if it has
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
mwells
c314e61968 make sectiondb stats just a special case of facets 2014-06-17 16:39:02 -06:00
mwells
6dcbc10e92 spider proxy updates. 2014-06-03 11:38:44 -07:00
mwells
ba2329808b fix siteListIsEmpty bug causing spider to
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
7ad9058f77 when doing a query reindex on a json
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
f8e561e6f4 more new site list api fixes 2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878 new site list updates 2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a more site list based spider fixes to be more like gsa 2014-03-08 20:52:31 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
48b5330d9c only skip checking to spider a url of its
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
9d0dca71db fix rapid coll delete bug some more. 2014-02-16 20:13:06 -08:00
Matt Wells
f8135e628e fall back to hop count if priority
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
6c9a44367f code checkpoint 2014-02-09 12:38:40 -07:00
Matt Wells
e60576c8eb another code checkpoint 2014-02-08 22:57:30 -07:00
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
9c26b85c2f fixed contenthash32 logic for json objects.
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb do not store 40 urls in doledb if
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e added winnertable to avoid dups in winnertree. 2014-02-04 20:09:43 -08:00
Matt Wells
3312400fee checkpoint for faster spider code. 2014-02-04 16:15:27 -08:00
Matt Wells
93021b2f13 Merge branch 'diffbot'
Conflicts:

	Collectiondb.cpp
	Spider.cpp
	Spider.h
2014-02-01 11:31:00 -07:00