Matt
4e8a42e024
text replacements for bad int32_t substitutions
2014-11-17 18:24:38 -08:00
Matt
4c19453ea9
working with -m32 for basic testing.
...
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7
replaced unsigned long long with uint64_t
2014-10-30 13:30:39 -06:00
mwells
f483fccc2e
if no crawl regex, and it has a crawl pattern consisting of
...
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
mwells
5a508cad69
upped MAX_SPIDERS from 100 to 300.
...
watch out for oom though.
2014-09-03 07:25:40 -07:00
Matt Wells
d2b1196a85
Merge branch 'diffbot-testing' into testing
2014-07-22 10:47:33 -07:00
Matt Wells
248b02ea9e
fix another spiderdb corruption core
2014-07-22 06:34:34 -07:00
mwells
cd48799030
try to fix core on neo
2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9
fix core from adding/deleting collection
2014-07-12 08:23:40 -07:00
Matt Wells
98b317b421
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Parms.cpp
Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
3162c83473
add some debug msgs
2014-06-27 08:28:28 -07:00
mwells
651f0f27ac
only send localcrawlinfo if it has
...
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
mwells
c314e61968
make sectiondb stats just a special case of facets
2014-06-17 16:39:02 -06:00
mwells
6dcbc10e92
spider proxy updates.
2014-06-03 11:38:44 -07:00
mwells
ba2329808b
fix siteListIsEmpty bug causing spider to
...
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
7ad9058f77
when doing a query reindex on a json
...
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
f8e561e6f4
more new site list api fixes
2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878
new site list updates
2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a
more site list based spider fixes to be more like gsa
2014-03-08 20:52:31 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
48b5330d9c
only skip checking to spider a url of its
...
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
9d0dca71db
fix rapid coll delete bug some more.
2014-02-16 20:13:06 -08:00
Matt Wells
f8135e628e
fall back to hop count if priority
...
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
6c9a44367f
code checkpoint
2014-02-09 12:38:40 -07:00
Matt Wells
e60576c8eb
another code checkpoint
2014-02-08 22:57:30 -07:00
Matt Wells
ecc10c2cb9
dup cache fixes. do not add dups to spiderdb either.
2014-02-05 14:09:35 -08:00
Matt Wells
9c26b85c2f
fixed contenthash32 logic for json objects.
...
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb
do not store 40 urls in doledb if
...
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e
added winnertable to avoid dups in winnertree.
2014-02-04 20:09:43 -08:00
Matt Wells
3312400fee
checkpoint for faster spider code.
2014-02-04 16:15:27 -08:00
Matt Wells
93021b2f13
Merge branch 'diffbot'
...
Conflicts:
Collectiondb.cpp
Spider.cpp
Spider.h
2014-02-01 11:31:00 -07:00
Matt Wells
3a6a271dd9
make crawl sync bug fixes.
...
fix Puz crawl from dying out on host 9
because spider reply did not resuscitate waiting
tree for its ip.
fix mike's zola crawl with a repeat of 3 days
from not incmreneting the round because it had
maxrounds 0, which means to ignore... assume 0
means to ignore now. send out 0xc1 crawl info
requests to even dead hosts so we can at least use
their last known good info.
2014-01-25 13:47:03 -08:00
Matt Wells
dd663eb9f7
fix round based spidering some more
2014-01-23 15:03:37 -08:00
Matt Wells
5f890f5d4f
minor doc update
2014-01-22 15:52:04 -08:00
Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
9354d06493
menu updates.
2014-01-21 13:01:37 -08:00
Matt Wells
41cdfcef96
inc spider limits in various places
2014-01-20 18:51:15 -08:00
Matt Wells
946a683e39
quite a few spider fixes
2014-01-20 16:45:27 -08:00
Matt Wells
5c86d8a122
simplified spiderdb.cpp scanSpiderdb()
...
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f
took out pagecount table. just hafta scan
...
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
089d7f34a0
more spiderdb spider request fixes
2014-01-19 18:00:56 -08:00
Matt Wells
471599e9e7
formatting
2014-01-19 10:44:19 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
5b7170e8c6
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
...
Conflicts:
Json.cpp
PageAddUrl.cpp
PageStats.cpp
Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee
tons of changes from live github on neo.
...
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
909022642d
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-07 12:10:59 -08:00