Commit Graph

1376 Commits

Author SHA1 Message Date
Daniel Steinberg
0efac8c156 Defect #2080: seed URLs duplicated 2014-03-25 17:25:55 -07:00
Daniel Steinberg
e1b1b15a38 bigger buffer 2014-03-25 16:34:40 -07:00
Daniel Steinberg
9846061dff when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder 2014-03-25 16:20:24 -07:00
Daniel Steinberg
ab90c06d8d add TODO for regex checking 2014-03-25 13:05:43 -07:00
Daniel Steinberg
1ff6c1fae0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan 2014-03-25 12:53:37 -07:00
Daniel Steinberg
b8836745f0 use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122) 2014-03-25 12:51:08 -07:00
mwells
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
mwells
502752aba4 doc updates 2014-03-21 08:59:13 -07:00
Matt Wells
b33121af7d make all field names lower case without
spaces when we hash them to make the
prefixhash. since json names often have
mixed case field names and spaces.
2014-03-20 16:08:02 -07:00
Matt Wells
98a10d4936 Merge branch 'testing' into diffbot-testing 2014-03-20 15:50:49 -07:00
Matt Wells
bbc8fc0c79 always show admin link 2014-03-20 15:48:51 -07:00
Matt Wells
99bd9319fd temp hack to reduce network comm
between trinity and neo
2014-03-20 15:42:34 -07:00
Matt Wells
67202f3731 Merge branch 'diffbot' into diffbot-testing 2014-03-20 15:39:03 -07:00
Matt Wells
5ed19026d9 temp debug comments 2014-03-20 15:33:37 -07:00
Matt Wells
b8d0e95035 Merge branch 'diffbot' into diffbot-testing 2014-03-20 10:26:55 -07:00
mwells
ca0843aa8b more bool query fixes. 2014-03-20 10:03:25 -07:00
mwells
cfbec626e8 more righteous fixes for bool queries 2014-03-19 13:51:32 -07:00
mwells
ab3368b5a0 more bool fixes. not operator support. 2014-03-19 09:38:45 -07:00
mwells
1bb91149d6 more bool fixes 2014-03-18 14:42:50 -07:00
mwells
652892dc10 more bool fixes 2014-03-18 14:37:59 -07:00
mwells
f392826b1e nested bool query fixes 2014-03-18 14:08:59 -07:00
mwells
b7d80fd02d more bool query fixes 2014-03-18 13:41:36 -07:00
mwells
b31eaee9fd simple bool queries work 2014-03-18 12:07:29 -07:00
Matt Wells
d4302e3301 fix core 2014-03-18 11:12:50 -07:00
Matt Wells
3b97682cc3 more bool query fixes 2014-03-18 10:44:56 -07:00
Matt Wells
6e23d37e47 Merge branch 'diffbot' into diffbot-testing 2014-03-17 17:27:28 -07:00
mwells
54cc8088fb more bool query fixes. hopefully this will do it,
but still can do some optimizations for speed.
2014-03-17 17:00:08 -07:00
Matt Wells
9d3c35ad17 nothing 2014-03-17 13:53:19 -07:00
Matt Wells
4abf56a75d cleanups 2014-03-16 18:06:22 -07:00
Matt Wells
d2511d0bef host table cleanups 2014-03-16 17:14:47 -07:00
Matt Wells
5057fdaf14 aesthetic cleanups 2014-03-16 17:12:04 -07:00
Matt Wells
d320bf9d75 spidering back on in main's coll.conf 2014-03-16 15:06:39 -07:00
Matt Wells
c513ad9418 Merge branch 'diffbot' into testing 2014-03-16 14:51:22 -07:00
Matt Wells
acd05aa740 fix a few minor bugs.
/master/->/admin/ and crawl type mismatch.
2014-03-16 10:34:58 -07:00
Matt Wells
edbd61b0c5 thread fixes. if pthread_create fails then
keep thread queue and just return. will try to
relaunch later. do not count delete keys towards
shard rebalance count.
2014-03-15 20:07:02 -07:00
Matt Wells
5ca411e3e2 tuning the rebalance loop 2014-03-15 14:56:11 -07:00
Matt Wells
86147fe22c tight merge during rebalance to save
disk space, so neg recs annihilate pos recs.
2014-03-14 23:37:30 -07:00
Matt Wells
6c704f6fdf Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-03-14 22:16:40 -07:00
Matt Wells
e37eebd76f when rebalancing wait for merge to complete before scanning
more
2014-03-14 22:16:25 -07:00
Matt Wells
82ac3fab6c merge fixes 2014-03-14 22:15:08 -07:00
Matt Wells
df46a6fc1d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot-matt 2014-03-14 19:32:10 -07:00
Matt Wells
1f162ce7b2 update localhosts.conf too 2014-03-14 19:20:23 -07:00
Matt Wells
553aefdb55 keep files tightly merged when doing rebalanced
to avoid running out of disk space
2014-03-14 19:19:41 -07:00
mwells
cb483c42ea more fixes for bool searching before
using a slightly different and simpler approach
2014-03-13 16:00:23 -07:00
mwells
7812f5c746 more bool fixes. still needs a little more work 2014-03-13 13:54:23 -07:00
mwells
3b2d981dff more fixes for new boolean logic. 2014-03-13 13:09:33 -07:00
Matt Wells
fb0123ad53 nothing 2014-03-13 11:27:28 -07:00
Matt Wells
9acb7ef0f4 fix core &token= core 2014-03-13 07:57:06 -07:00
Daniel Steinberg
7b5816f194 updated error message 2014-03-12 20:56:27 -07:00
Matt Wells
018258bcaa Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-12 20:55:21 -07:00