Matt Wells
f8135e628e
fall back to hop count if priority
...
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
0a4963f597
do not allow spot/seeds to be added to collnum being repaired
...
or rebuilt.
2014-02-16 15:18:50 -08:00
Matt Wells
fe0f2d3537
allow coll delete if not the one being repaired
2014-02-16 10:55:34 -08:00
Matt Wells
32526a9b25
more checksum fixes for json. fixes for
...
repair/rebuild procedure.
2014-02-16 10:46:41 -08:00
Matt Wells
df59d3946a
fix content hash issues for json. do not
...
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
734ce1fc55
fix core from a high priority
...
injection insert records at the same
time as a lower priority spider.
2014-02-14 10:51:02 -08:00
Matt Wells
3271f22995
Merge branch 'diffbot-testing' into diffbot
2014-02-13 11:25:59 -08:00
Matt Wells
dc8b9090e8
fix out of alloc slots core
2014-02-13 11:21:39 -08:00
Matt Wells
08b103f3a4
Merge branch 'diffbot-testing' into diffbot
...
Conflicts:
Spider.cpp
2014-02-13 10:11:56 -08:00
Matt Wells
c3d8a143be
fix bug of process regex being ignored
...
when crawl regex was specified.
2014-02-13 10:06:14 -08:00
Matt Wells
4eee547391
do not do fuzzy deduping if &icc=1 (include cached copy)
...
is true for search results.
2014-02-13 08:51:03 -08:00
Matt Wells
cd6069e5a6
send single space to socket if not streaming
...
and search results still not ready after 10 seconds.
send it every 10 seconds to prevent client from closing socket.
sped up all downloads, json and csv, but not doing "fuzzy"
deduping of search results, but just deduping on page
content hash. added TcpSocket::m_numDestroys to ensure we
do not send heartbeat on a socket that was closed and
re-opened for another client.
2014-02-13 08:45:13 -08:00
Matt Wells
5f0ebb4aef
fix stack overflow
2014-02-13 00:01:49 -08:00
Matt Wells
5db23c2eec
fi infinite loop core thing.
2014-02-12 21:43:23 -08:00
Matt Wells
a9737ea97d
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-02-12 21:20:01 -08:00
Matt Wells
d42e2377e7
return json download as search results now.
...
all smokes have passed.
2014-02-12 21:19:32 -08:00
Matt Wells
8bb17de3c5
pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
...
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
25eae3da39
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-02-12 13:21:57 -08:00
Matt Wells
0e48bbcea9
fix a core from bad return values
2014-02-12 13:21:30 -08:00
Matt Wells
51d514f276
use supplied mime if supplied when injecting
2014-02-11 13:02:30 -08:00
Matt Wells
69fa6662bc
EDOCUNCHANGED fixes for diffbot
2014-02-10 16:23:39 -08:00
Matt Wells
44a9e08d38
fix EDOCUNCHANGED logic.
2014-02-10 14:56:22 -08:00
Matt Wells
debd9089e8
better logging msg when updating parm.
2014-02-10 11:29:24 -08:00
Matt Wells
b634d06287
fix some cores. use olddoc contenthash
...
for msg13 call for EDOCUNCHANGED errors.
2014-02-07 18:28:09 -08:00
Matt Wells
252d24dc2a
fix core of page spiders
2014-02-07 10:46:10 -08:00
Matt Wells
573a04bccd
fix bug in gbminint.
2014-02-06 21:36:47 -08:00
Matt Wells
edef3acf37
remove bugg line
2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e
take out buggy statement.
2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d
formatting fixes
2014-02-06 21:06:31 -08:00
Matt Wells
2d4af1aefe
index numbers as integers too, not just floats
...
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d
show lastSpidered time at end of json item.
...
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8
many more fixes for streaming mode
2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52
fixes for streaming mode.
2014-02-06 16:28:42 -08:00
Matt Wells
5787b15884
Merge branch 'diffbot' into diffbot-testing
2014-02-06 15:26:21 -08:00
Matt Wells
8f6a4ee9b6
do not save collrecs all the time.
...
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
845611ae1b
&stream=1 stream mode fixes.
2014-02-06 15:23:53 -08:00
Matt Wells
4cfe69a96f
minor link updates
2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056
get streaming time sliced results working
2014-02-06 14:25:44 -08:00
Matt Wells
106077c163
fix spiderrequest deduping some more
2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937
more faster spider fixes. tried to fix
...
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f
raise spiderdb minfilestomerge from 2 to 3
...
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99
delete from winnertable when deleting from winnertree
2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8
fix too many spiders out per ip some more
2014-02-05 17:11:45 -08:00
Matt Wells
2842350e6d
gb.conf spiders back on
2014-02-05 16:59:06 -08:00
Matt Wells
5c8b9af1d3
fix rdbcache corruption from -O2 compile bug.
...
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00
Matt Wells
951e9d5068
wait 180 secs for diffbot reply
2014-02-05 15:46:26 -08:00
Matt Wells
c60dcf4ecb
show userobots for bulk jobs
2014-02-05 15:45:39 -08:00
Matt Wells
d9f0d57c0c
core fixes. csv fixes.
2014-02-05 14:56:22 -08:00
Matt Wells
ecc10c2cb9
dup cache fixes. do not add dups to spiderdb either.
2014-02-05 14:09:35 -08:00
Matt Wells
7806a8a68c
fix excessive dupcache deduping.
2014-02-05 13:41:15 -08:00