Commit Graph

1460 Commits

Author SHA1 Message Date
mwells
a811462d5f spider proxy stuff compiles now 2014-05-30 15:05:00 -07:00
mwells
8fb8669da1 more spider proxy updates. 2014-05-29 21:17:51 -06:00
Matt Wells
d928a16211 Merge branch 'diffbot-testing' into diffbot-matt 2014-05-27 15:22:38 -07:00
Matt Wells
f341dba0c8 got the general framework for load-balanced/reliabled
floaters in place for the distributed spider network.
need to fill in the blanks now.
2014-05-27 15:21:12 -07:00
Daniel Steinberg
7448e8a1ff don't use "expand" for mode= requests or non-analyze requests 2014-05-26 20:38:44 -07:00
Matt Wells
2d4fb483b2 disambiguate error msg 2014-05-26 10:46:10 -07:00
Matt Wells
8234aaed23 put lastspidertimeutc back in because we need
it for debugging.
2014-05-23 09:43:46 -07:00
Matt Wells
e3b6f6b74e a second fix for crawls saying they're done and
then resuming. it seems to happen when we turn
spiders off then back on again. so hack that.
2014-05-23 07:29:18 -07:00
Matt Wells
1f4dc2df97 fix bug in spider scan
of spiderdb for unique firstips
2014-05-22 13:08:01 -07:00
Matt Wells
68fcffb2da speed up scan of spiderdb
to repopulate waiting tree by jumping over
last firstip.
2014-05-22 12:20:03 -07:00
Matt Wells
e9c4c9bb9a fix possible loss of data when doing reads
on especially doledb.
2014-05-22 11:06:56 -07:00
Matt Wells
1660805f66 more useful logging for debugging 2014-05-22 10:36:44 -07:00
Matt Wells
32735677d2 wait 45 seconds before ending round, not 30
to try to fix some issues...
2014-05-22 08:32:19 -07:00
Matt Wells
935cc72e19 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-21 13:55:29 -07:00
Matt Wells
b8886c399c show start/end job times on pagecrawlbot. 2014-05-21 13:55:01 -07:00
Matt Wells
61fc015014 fix potential diffbot injection bug 2014-05-21 12:21:29 -07:00
Matt Wells
b0c87b355c log update 2014-05-21 10:09:50 -07:00
Matt Wells
45df139ccb update logging 2014-05-21 10:05:49 -07:00
Matt Wells
7ad9058f77 when doing a query reindex on a json
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
34afc7c7cf Merge branch 'diffbot-dan' into diffbot-testing 2014-05-21 05:30:56 -07:00
Daniel Steinberg
e39dffadcf use "expand" option when calling Diffbot 2014-05-20 22:00:46 -07:00
Matt Wells
4b587f168b fix bug of not including empty responses when &icc=1 2014-05-20 21:07:21 -07:00
Matt Wells
c729b51ae5 fixed exact # search results hit count
when using min/max/sort operators.
2014-05-20 13:45:00 -07:00
Matt Wells
6664faa792 fix printing back-to-back commas when showing
results in json with &icc=1.
2014-05-20 13:23:29 -07:00
Matt Wells
cd3e11b6ee Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-16 18:48:06 -07:00
Matt Wells
d2cc117d82 fix oops 2014-05-16 18:47:52 -07:00
Matt Wells
526be98ec8 fix core scenario when diffbot reply that was injected
using &diffbotreply= contains the http mime.
2014-05-16 18:46:39 -07:00
Matt Wells
baf1ccb7d5 note updates 2014-05-16 09:52:41 -07:00
Matt Wells
eea5dff0f5 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-16 09:38:42 -07:00
Matt Wells
a22396c344 quick doc update 2014-05-16 09:38:32 -07:00
Matt Wells
2484147403 fix core 2014-05-16 09:30:46 -07:00
Matt Wells
1af8ca846f Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-16 08:08:42 -07:00
Matt Wells
a81f2145bd fix sendmail ip to 127.0.0.1 2014-05-16 08:08:20 -07:00
Matt Wells
4684298965 minor doc update 2014-05-16 08:01:29 -07:00
Matt Wells
2ce6ed266a fix another core from a 0 docid 2014-05-16 07:59:04 -07:00
Matt Wells
6d9fdc975b fix core from not setting m_gotClusterRecs in Msg39.cpp 2014-05-16 06:32:51 -07:00
Matt Wells
5c2cc973a8 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-15 18:27:13 -07:00
Matt Wells
a303bda1f8 fix core 2014-05-15 15:10:57 -07:00
Matt Wells
b38f62c7dc nothing 2014-05-15 14:15:05 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Daniel Steinberg
fc5cfa2a62 move list of bulk urls to new directory earlier. May fix Defect #2218 if there is something that is causing the bulk job to restart before this function returns 2014-05-15 13:35:32 -07:00
Daniel Steinberg
6afa3f2561 save spots to disk as space separated 2014-05-14 14:40:46 -07:00
Matt Wells
00b652581f fix boolean query containing quoted phrase 2014-05-14 11:22:07 -07:00
Matt Wells
8ac7fdfa24 Msg39::controlLoop now works 2014-05-14 11:02:09 -07:00
Matt Wells
d95cbb42d6 Merge branch 'diffbot-testing' into diffbot-matt 2014-05-14 10:52:45 -07:00
Matt Wells
db543ddd9f nothing 2014-05-14 09:37:59 -07:00
Matt Wells
40bca5d120 try to fix msg22 core some more 2014-05-14 08:16:47 -07:00
Matt Wells
48df53e74f Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Conflicts:
	Msg22.cpp
2014-05-14 07:48:23 -07:00
Matt Wells
0242fe88ff try to fix msg22 based cores 2014-05-14 07:46:32 -07:00
Matt Wells
88eb44827f fix avail docid logic some more for indexing
spdier replies
2014-05-13 21:27:05 -07:00