Commit Graph

129 Commits

Author SHA1 Message Date
Matt
09de59f026 do not store cblock, etc. tags into tagdb to save
disk space. added tagdb file cache for better performance,
less disk accesses. will help reduce disk load.
put file cache sizes in master controls and if they change
then update the cache size dynamically.
2015-09-10 12:46:00 -06:00
Matt Wells
2483b73cbd fix infinite scanning loops caused by corrupt spiderdb records. 2015-09-09 14:28:02 -07:00
Matt Wells
0bbb493199 limit crawl delay to 60 seconds 2015-09-05 10:38:44 -07:00
Matt
5e7a06229c print special message if no seeds were able to be crawled. 2015-07-17 08:42:01 -06:00
Matt
1e3c52a0ef fix infinite loop bug from performance enhancement
using active list for spidering i put in a few days back.
2015-06-23 13:52:02 -07:00
Matt Wells
bdebd79f4f spiderloop active list bug fix.
change diffbot ip max from 1 to 7 again.
2015-06-18 15:05:16 -07:00
Matt Wells
e9f1ab1150 make donesleepingwrapper in spider.cpp faster
using the active list of colls to save time.
2015-06-18 08:38:46 -07:00
Matt
ef604e6d5e introduce SpiderRequest::m_discoveryTime 2015-05-05 21:23:30 -07:00
Matt
7a7dacc56d revive spiderwaited keyword for url filters 2015-05-05 20:36:45 -07:00
Matt Wells
d3ca12ab0a fix corrupt spider replies from causing a url with
error to be spidered over and over again.
2015-05-01 14:50:05 -07:00
Matt
09a79d230c check for .css?* better as media extensions.
do it when adding outlinks in xmldoc.cpp.
2015-04-28 14:42:04 -07:00
Matt Wells
61af961dfd use m_sentToDiffbotThisTime in SpiderReply now too 2015-04-14 15:23:12 -07:00
Matt
90456222b6 now we add the spider status docs as json documents.
so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -06:00
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt Wells
485c600c7c fix for changing maxtocrawl/process/rounds.
fix for thinking a crawl is done when it is just taking
a while to populate doledb from the waiting tree for
that SpiderColl. we just call populateDoledbFromWaiting in
doneSleepingWrapper avery 50ms. it loops over every coll
so it could be more efficient.
2015-03-12 13:15:52 -07:00
Matt
dfd6d8b2cf fix critical spider bug that was deleting pages
because of bogus SpiderReply::m_langId values!
2015-03-05 08:49:39 -08:00
mwells
d4f67285ce keep spiders maxed out all the time. 2015-02-25 18:40:50 -07:00
mwells
4e485b6649 increase dolebuf cache time from 2 to 5 mins
for better performance. cache empty dolebufs
if winner tree list was not from cache, so
in case we have a huge spiderdb scan list
of urls we aren't spidering we can cache it,
like twitter.com e.g. do not call strstr
in getUrlFilterNum2() for .css? or /print/
since it was taking way too much cpu time.
2015-02-21 15:17:28 -07:00
Matt
579a08d287 fixed link overflow logic. 2015-02-12 15:03:01 -08:00
Matt
415c96fc56 added overflow checks to ensure we don't have more
than 10M unique urls for a given "firstip"
queued up to be spidered in spiderdb
that have never been spidered. should prevent us
from having 20GB spiderdbs for spidering those sites
that essentially have an infinite # of urls, black hole
sites, that seems to be plaguing crawls.
2015-02-12 13:41:40 -08:00
Matt
c009430b6c more fixes for new spider updates 2015-02-11 21:54:36 -08:00
Matt
9ea53ed89e bug fixes. spidering seems to work somewhat again. 2015-02-11 19:23:36 -08:00
Matt
30a77dd422 checkpoint on massive spidering speed ups. 2015-02-11 17:55:28 -08:00
Matt
f6723ddaa3 new much faster spider. cache the winner tree
basically. TODO: need to update cache if
new spiderrequests are added that should be
in the cached winner tree.
2015-02-10 21:27:21 -08:00
Matt Wells
c8c56a24da fixed query reindex for diffbot json docs.
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
2014-12-15 16:49:20 -08:00
Matt
adcef39376 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Collectiondb.h
	Conf.cpp
	Conf.h
	Msg39.cpp
	PageEvents.cpp
	PageResults.cpp
	PageTurk.cpp
	Pages.cpp
	Parms.cpp
	Posdb.cpp
	Proxy.cpp
	Query.cpp
	Query.h
	RdbBase.cpp
	RdbMap.cpp
	Repair.cpp
	Repair.h
	SafeBuf.cpp
	Spider.cpp
	Tagdb.cpp
	TopTree.cpp
	XmlDoc.cpp
	main.cpp
2014-11-20 16:53:07 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
69ef3c14ef fixes for repair/rebuild functionality.
more to come.
2014-11-13 13:04:28 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
f483fccc2e if no crawl regex, and it has a crawl pattern consisting of
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
mwells
5a508cad69 upped MAX_SPIDERS from 100 to 300.
watch out for oom though.
2014-09-03 07:25:40 -07:00
Matt Wells
d2b1196a85 Merge branch 'diffbot-testing' into testing 2014-07-22 10:47:33 -07:00
Matt Wells
248b02ea9e fix another spiderdb corruption core 2014-07-22 06:34:34 -07:00
mwells
cd48799030 try to fix core on neo 2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
3162c83473 add some debug msgs 2014-06-27 08:28:28 -07:00
mwells
651f0f27ac only send localcrawlinfo if it has
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
mwells
c314e61968 make sectiondb stats just a special case of facets 2014-06-17 16:39:02 -06:00
mwells
6dcbc10e92 spider proxy updates. 2014-06-03 11:38:44 -07:00
mwells
ba2329808b fix siteListIsEmpty bug causing spider to
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
7ad9058f77 when doing a query reindex on a json
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
f8e561e6f4 more new site list api fixes 2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878 new site list updates 2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a more site list based spider fixes to be more like gsa 2014-03-08 20:52:31 -07:00