Commit Graph

1029 Commits

Author SHA1 Message Date
Matt Wells
f8135e628e fall back to hop count if priority
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
0a4963f597 do not allow spot/seeds to be added to collnum being repaired
or rebuilt.
2014-02-16 15:18:50 -08:00
Matt Wells
fe0f2d3537 allow coll delete if not the one being repaired 2014-02-16 10:55:34 -08:00
Matt Wells
32526a9b25 more checksum fixes for json. fixes for
repair/rebuild procedure.
2014-02-16 10:46:41 -08:00
Matt Wells
df59d3946a fix content hash issues for json. do not
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
734ce1fc55 fix core from a high priority
injection insert records at the same
time as a lower priority spider.
2014-02-14 10:51:02 -08:00
Matt Wells
3271f22995 Merge branch 'diffbot-testing' into diffbot 2014-02-13 11:25:59 -08:00
Matt Wells
dc8b9090e8 fix out of alloc slots core 2014-02-13 11:21:39 -08:00
Matt Wells
08b103f3a4 Merge branch 'diffbot-testing' into diffbot
Conflicts:
	Spider.cpp
2014-02-13 10:11:56 -08:00
Matt Wells
c3d8a143be fix bug of process regex being ignored
when crawl regex was specified.
2014-02-13 10:06:14 -08:00
Matt Wells
4eee547391 do not do fuzzy deduping if &icc=1 (include cached copy)
is true for search results.
2014-02-13 08:51:03 -08:00
Matt Wells
cd6069e5a6 send single space to socket if not streaming
and search results still not ready after 10 seconds.
send it every 10 seconds to prevent client from closing socket.
sped up all downloads, json and csv, but not doing "fuzzy"
deduping of search results, but just deduping on page
content hash. added TcpSocket::m_numDestroys to ensure we
do not send heartbeat on a socket that was closed and
re-opened for another client.
2014-02-13 08:45:13 -08:00
Matt Wells
5f0ebb4aef fix stack overflow 2014-02-13 00:01:49 -08:00
Matt Wells
5db23c2eec fi infinite loop core thing. 2014-02-12 21:43:23 -08:00
Matt Wells
a9737ea97d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-12 21:20:01 -08:00
Matt Wells
d42e2377e7 return json download as search results now.
all smokes have passed.
2014-02-12 21:19:32 -08:00
Matt Wells
8bb17de3c5 pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
25eae3da39 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-12 13:21:57 -08:00
Matt Wells
0e48bbcea9 fix a core from bad return values 2014-02-12 13:21:30 -08:00
Matt Wells
51d514f276 use supplied mime if supplied when injecting 2014-02-11 13:02:30 -08:00
Matt Wells
69fa6662bc EDOCUNCHANGED fixes for diffbot 2014-02-10 16:23:39 -08:00
Matt Wells
44a9e08d38 fix EDOCUNCHANGED logic. 2014-02-10 14:56:22 -08:00
Matt Wells
debd9089e8 better logging msg when updating parm. 2014-02-10 11:29:24 -08:00
Matt Wells
b634d06287 fix some cores. use olddoc contenthash
for msg13 call for EDOCUNCHANGED errors.
2014-02-07 18:28:09 -08:00
Matt Wells
252d24dc2a fix core of page spiders 2014-02-07 10:46:10 -08:00
Matt Wells
573a04bccd fix bug in gbminint. 2014-02-06 21:36:47 -08:00
Matt Wells
edef3acf37 remove bugg line 2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e take out buggy statement. 2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d formatting fixes 2014-02-06 21:06:31 -08:00
Matt Wells
2d4af1aefe index numbers as integers too, not just floats
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d show lastSpidered time at end of json item.
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8 many more fixes for streaming mode 2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52 fixes for streaming mode. 2014-02-06 16:28:42 -08:00
Matt Wells
5787b15884 Merge branch 'diffbot' into diffbot-testing 2014-02-06 15:26:21 -08:00
Matt Wells
8f6a4ee9b6 do not save collrecs all the time.
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
845611ae1b &stream=1 stream mode fixes. 2014-02-06 15:23:53 -08:00
Matt Wells
4cfe69a96f minor link updates 2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056 get streaming time sliced results working 2014-02-06 14:25:44 -08:00
Matt Wells
106077c163 fix spiderrequest deduping some more 2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937 more faster spider fixes. tried to fix
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f raise spiderdb minfilestomerge from 2 to 3
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99 delete from winnertable when deleting from winnertree 2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8 fix too many spiders out per ip some more 2014-02-05 17:11:45 -08:00
Matt Wells
2842350e6d gb.conf spiders back on 2014-02-05 16:59:06 -08:00
Matt Wells
5c8b9af1d3 fix rdbcache corruption from -O2 compile bug.
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00
Matt Wells
951e9d5068 wait 180 secs for diffbot reply 2014-02-05 15:46:26 -08:00
Matt Wells
c60dcf4ecb show userobots for bulk jobs 2014-02-05 15:45:39 -08:00
Matt Wells
d9f0d57c0c core fixes. csv fixes. 2014-02-05 14:56:22 -08:00
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
7806a8a68c fix excessive dupcache deduping. 2014-02-05 13:41:15 -08:00