Commit Graph

102 Commits

Author SHA1 Message Date
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
f483fccc2e if no crawl regex, and it has a crawl pattern consisting of
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
mwells
5a508cad69 upped MAX_SPIDERS from 100 to 300.
watch out for oom though.
2014-09-03 07:25:40 -07:00
Matt Wells
d2b1196a85 Merge branch 'diffbot-testing' into testing 2014-07-22 10:47:33 -07:00
Matt Wells
248b02ea9e fix another spiderdb corruption core 2014-07-22 06:34:34 -07:00
mwells
cd48799030 try to fix core on neo 2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
3162c83473 add some debug msgs 2014-06-27 08:28:28 -07:00
mwells
651f0f27ac only send localcrawlinfo if it has
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
mwells
c314e61968 make sectiondb stats just a special case of facets 2014-06-17 16:39:02 -06:00
mwells
6dcbc10e92 spider proxy updates. 2014-06-03 11:38:44 -07:00
mwells
ba2329808b fix siteListIsEmpty bug causing spider to
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
7ad9058f77 when doing a query reindex on a json
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
f8e561e6f4 more new site list api fixes 2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878 new site list updates 2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a more site list based spider fixes to be more like gsa 2014-03-08 20:52:31 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
48b5330d9c only skip checking to spider a url of its
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
9d0dca71db fix rapid coll delete bug some more. 2014-02-16 20:13:06 -08:00
Matt Wells
f8135e628e fall back to hop count if priority
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
6c9a44367f code checkpoint 2014-02-09 12:38:40 -07:00
Matt Wells
e60576c8eb another code checkpoint 2014-02-08 22:57:30 -07:00
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
9c26b85c2f fixed contenthash32 logic for json objects.
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb do not store 40 urls in doledb if
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e added winnertable to avoid dups in winnertree. 2014-02-04 20:09:43 -08:00
Matt Wells
3312400fee checkpoint for faster spider code. 2014-02-04 16:15:27 -08:00
Matt Wells
93021b2f13 Merge branch 'diffbot'
Conflicts:

	Collectiondb.cpp
	Spider.cpp
	Spider.h
2014-02-01 11:31:00 -07:00
Matt Wells
3a6a271dd9 make crawl sync bug fixes.
fix Puz crawl from dying out on host 9
because spider reply did not resuscitate waiting
tree for its ip.
fix mike's zola crawl with a repeat of 3 days
from not incmreneting the round because it had
maxrounds 0, which means to ignore... assume 0
means to ignore now. send out 0xc1 crawl info
requests to even dead hosts so we can at least use
their last known good info.
2014-01-25 13:47:03 -08:00
Matt Wells
dd663eb9f7 fix round based spidering some more 2014-01-23 15:03:37 -08:00
Matt Wells
5f890f5d4f minor doc update 2014-01-22 15:52:04 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
9354d06493 menu updates. 2014-01-21 13:01:37 -08:00
Matt Wells
41cdfcef96 inc spider limits in various places 2014-01-20 18:51:15 -08:00
Matt Wells
946a683e39 quite a few spider fixes 2014-01-20 16:45:27 -08:00
Matt Wells
5c86d8a122 simplified spiderdb.cpp scanSpiderdb()
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
471599e9e7 formatting 2014-01-19 10:44:19 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-07 12:10:59 -08:00