Matt
5e7a06229c
print special message if no seeds were able to be crawled.
2015-07-17 08:42:01 -06:00
Matt
1e3c52a0ef
fix infinite loop bug from performance enhancement
...
using active list for spidering i put in a few days back.
2015-06-23 13:52:02 -07:00
Matt Wells
bdebd79f4f
spiderloop active list bug fix.
...
change diffbot ip max from 1 to 7 again.
2015-06-18 15:05:16 -07:00
Matt Wells
e9f1ab1150
make donesleepingwrapper in spider.cpp faster
...
using the active list of colls to save time.
2015-06-18 08:38:46 -07:00
Matt
ef604e6d5e
introduce SpiderRequest::m_discoveryTime
2015-05-05 21:23:30 -07:00
Matt
7a7dacc56d
revive spiderwaited keyword for url filters
2015-05-05 20:36:45 -07:00
Matt Wells
d3ca12ab0a
fix corrupt spider replies from causing a url with
...
error to be spidered over and over again.
2015-05-01 14:50:05 -07:00
Matt
09a79d230c
check for .css?* better as media extensions.
...
do it when adding outlinks in xmldoc.cpp.
2015-04-28 14:42:04 -07:00
Matt Wells
61af961dfd
use m_sentToDiffbotThisTime in SpiderReply now too
2015-04-14 15:23:12 -07:00
Matt
90456222b6
now we add the spider status docs as json documents.
...
so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -06:00
Matt
a54471849b
sitemap.xml support for harvesting loc urls.
...
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt Wells
485c600c7c
fix for changing maxtocrawl/process/rounds.
...
fix for thinking a crawl is done when it is just taking
a while to populate doledb from the waiting tree for
that SpiderColl. we just call populateDoledbFromWaiting in
doneSleepingWrapper avery 50ms. it loops over every coll
so it could be more efficient.
2015-03-12 13:15:52 -07:00
Matt
dfd6d8b2cf
fix critical spider bug that was deleting pages
...
because of bogus SpiderReply::m_langId values!
2015-03-05 08:49:39 -08:00
mwells
d4f67285ce
keep spiders maxed out all the time.
2015-02-25 18:40:50 -07:00
mwells
4e485b6649
increase dolebuf cache time from 2 to 5 mins
...
for better performance. cache empty dolebufs
if winner tree list was not from cache, so
in case we have a huge spiderdb scan list
of urls we aren't spidering we can cache it,
like twitter.com e.g. do not call strstr
in getUrlFilterNum2() for .css? or /print/
since it was taking way too much cpu time.
2015-02-21 15:17:28 -07:00
Matt
579a08d287
fixed link overflow logic.
2015-02-12 15:03:01 -08:00
Matt
415c96fc56
added overflow checks to ensure we don't have more
...
than 10M unique urls for a given "firstip"
queued up to be spidered in spiderdb
that have never been spidered. should prevent us
from having 20GB spiderdbs for spidering those sites
that essentially have an infinite # of urls, black hole
sites, that seems to be plaguing crawls.
2015-02-12 13:41:40 -08:00
Matt
c009430b6c
more fixes for new spider updates
2015-02-11 21:54:36 -08:00
Matt
9ea53ed89e
bug fixes. spidering seems to work somewhat again.
2015-02-11 19:23:36 -08:00
Matt
30a77dd422
checkpoint on massive spidering speed ups.
2015-02-11 17:55:28 -08:00
Matt
f6723ddaa3
new much faster spider. cache the winner tree
...
basically. TODO: need to update cache if
new spiderrequests are added that should be
in the cached winner tree.
2015-02-10 21:27:21 -08:00
Matt Wells
c8c56a24da
fixed query reindex for diffbot json docs.
...
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
2014-12-15 16:49:20 -08:00
Matt
adcef39376
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Collectiondb.h
Conf.cpp
Conf.h
Msg39.cpp
PageEvents.cpp
PageResults.cpp
PageTurk.cpp
Pages.cpp
Parms.cpp
Posdb.cpp
Proxy.cpp
Query.cpp
Query.h
RdbBase.cpp
RdbMap.cpp
Repair.cpp
Repair.h
SafeBuf.cpp
Spider.cpp
Tagdb.cpp
TopTree.cpp
XmlDoc.cpp
main.cpp
2014-11-20 16:53:07 -08:00
Matt
4e8a42e024
text replacements for bad int32_t substitutions
2014-11-17 18:24:38 -08:00
Matt
69ef3c14ef
fixes for repair/rebuild functionality.
...
more to come.
2014-11-13 13:04:28 -08:00
Matt
4c19453ea9
working with -m32 for basic testing.
...
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7
replaced unsigned long long with uint64_t
2014-10-30 13:30:39 -06:00
mwells
f483fccc2e
if no crawl regex, and it has a crawl pattern consisting of
...
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
mwells
5a508cad69
upped MAX_SPIDERS from 100 to 300.
...
watch out for oom though.
2014-09-03 07:25:40 -07:00
Matt Wells
d2b1196a85
Merge branch 'diffbot-testing' into testing
2014-07-22 10:47:33 -07:00
Matt Wells
248b02ea9e
fix another spiderdb corruption core
2014-07-22 06:34:34 -07:00
mwells
cd48799030
try to fix core on neo
2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9
fix core from adding/deleting collection
2014-07-12 08:23:40 -07:00
Matt Wells
98b317b421
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Parms.cpp
Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
3162c83473
add some debug msgs
2014-06-27 08:28:28 -07:00
mwells
651f0f27ac
only send localcrawlinfo if it has
...
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
mwells
c314e61968
make sectiondb stats just a special case of facets
2014-06-17 16:39:02 -06:00
mwells
6dcbc10e92
spider proxy updates.
2014-06-03 11:38:44 -07:00
mwells
ba2329808b
fix siteListIsEmpty bug causing spider to
...
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
7ad9058f77
when doing a query reindex on a json
...
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
f8e561e6f4
more new site list api fixes
2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878
new site list updates
2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a
more site list based spider fixes to be more like gsa
2014-03-08 20:52:31 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
48b5330d9c
only skip checking to spider a url of its
...
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
9d0dca71db
fix rapid coll delete bug some more.
2014-02-16 20:13:06 -08:00