Daniel Steinberg
41e3988fbc
not a conf file
2014-03-10 13:57:13 -07:00
Daniel Steinberg
4a7bf5d4d0
Story #2040 : store raw URL submissions for customer bulk jobs
2014-03-10 13:50:30 -07:00
Matt Wells
bfcb7082f4
fix bug from nuking doledb on a new collection.
2014-03-10 13:48:00 -07:00
Matt Wells
bd4484db3c
Merge branch 'testing' into diffbot-testing
2014-03-10 12:08:23 -07:00
Matt Wells
9debee20dc
Merge branch 'diffbot' into testing
2014-03-09 20:44:09 -07:00
Matt Wells
662b6d4b32
doc updates
2014-03-09 20:43:49 -07:00
Matt Wells
90ff2c2a25
update example site lists
2014-03-09 20:35:45 -07:00
Matt Wells
82db7240a3
simple print update
2014-03-09 19:43:32 -07:00
Matt Wells
f7b7274ff1
replace "exact:" directive with "seed:"
...
really the same thing.
2014-03-09 19:35:20 -07:00
Matt Wells
f8e561e6f4
more new site list api fixes
2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878
new site list updates
2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a
more site list based spider fixes to be more like gsa
2014-03-08 20:52:31 -07:00
Matt Wells
aab165ed20
fix bad return value from function
2014-03-08 19:32:56 -08:00
Matt Wells
4cb66c31bf
get this new api spidering
2014-03-08 12:02:20 -07:00
Matt Wells
624c1d4e68
nuke doledb fixes
2014-03-08 10:51:15 -07:00
Matt Wells
29694f4efe
startup fixes
2014-03-08 10:25:56 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
14817df7a9
new site patterns api stuff
2014-03-08 09:23:32 -07:00
Matt Wells
72fab5b61e
Do not end a crawl while urls are still being spidered
...
because they might add more links to spiderdb when
they finally complete.
2014-03-07 09:30:12 -08:00
Matt Wells
7cdd411ef1
Merge branch 'diffbot' into diffbot-testing
2014-03-07 09:26:47 -08:00
Matt Wells
c143ee1fba
fix core when creating a new collection because
...
we incremented m_numRecs but did not grow the ptr buffer.
also added support for localgb.conf so we can use that
instead of gb.conf to avoid git push/pull conflicts.
2014-03-07 09:05:14 -08:00
Matt Wells
dcd42e455e
Merge branch 'diffbot' into diffbot-testing
2014-03-07 09:02:29 -08:00
Matt Wells
f777e6cccd
Merge branch 'diffbot' into diffbot-testing
2014-03-07 08:23:21 -08:00
Matt Wells
d6177019ec
minor fix
2014-03-07 08:07:09 -08:00
Matt Wells
434dd182d4
fix mem leak. always harvest links
...
for custom crawls.
2014-03-06 21:24:39 -08:00
Matt Wells
e351d2a6f1
get searching on token working
2014-03-06 17:01:41 -08:00
Matt Wells
27e8e810d2
use collnum instead of coll string.
...
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
d74f748e93
search all collections under a token if "&token" is
...
given but not "&c=..."
2014-03-06 11:00:43 -08:00
Matt Wells
ca2d307229
revert gb.conf
2014-03-06 10:47:03 -08:00
Matt Wells
97e46dbf4e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-03-06 10:45:45 -08:00
Matt Wells
efa92b16fd
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-03-06 10:45:35 -08:00
Matt Wells
25cf0efdbf
first compiled stab at multi collection searching.
2014-03-06 10:45:13 -08:00
Matt Wells
451a092378
fix core from changing parms while evaluating
...
a url.
2014-03-06 07:47:43 -08:00
Matt Wells
0962e243a4
Merge branch 'diffbot' into diffbot-testing
2014-03-05 07:43:25 -08:00
Matt Wells
58a1feeea5
specify &header=1 explicitly to get json serp header
...
lest we break our clients parsers
2014-03-05 07:41:59 -08:00
Matt Wells
13e33bc261
fix jezebel crawl from hanging.
2014-03-04 19:45:26 -08:00
Matt Wells
1b62f1582b
print memtable when almost full so we can see
...
where the leak is. more spiders for ethan.
do not try to get diffbot reply if page is already json.
likely it is an injected diffbot json reply.
2014-03-04 18:19:50 -08:00
Matt Wells
603cd67758
fix csv downloads some more
2014-03-04 12:07:46 -08:00
Matt Wells
2ab9aaeeaa
streaming csv fixes
2014-03-04 11:04:26 -08:00
Matt Wells
866b09d25e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-03-04 10:46:28 -08:00
Matt Wells
b1381cc610
make csv streamable, faster and take almost no memory.
2014-03-04 10:45:57 -08:00
Matt Wells
280dcb85cf
fix for passing testUpdatedContent smoketest
2014-03-04 09:09:51 -08:00
Matt Wells
ab9f2b33c1
definition updates
2014-03-04 08:37:39 -07:00
Matt Wells
1acb16b1ee
tweak empty doledb priority logic.
...
anchor it more to m_doleIpTable for more
reliability. seems like it was causing some
slow dows during spidering. seems more
continuous now.
2014-03-03 13:48:59 -08:00
Matt Wells
48b5330d9c
only skip checking to spider a url of its
...
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
282dad6cef
deal with no coll recs when getting
...
link text using msg25. do not share
g_lineTable between collections.
2014-03-03 08:04:24 -08:00
Matt Wells
ff8a0b4ef1
do not let all collections share the same line table
...
in linkdb.cpp
2014-03-03 07:50:11 -08:00
Matt Wells
a82abe8260
added ^ operator to url crawl patterns.
...
good for tmz crawl.
2014-03-02 14:57:59 -08:00
Matt Wells
7fd6bbd7f5
added ^ support to url crawl expressions
2014-03-02 14:41:25 -08:00
Matt Wells
e4d425c18f
fix coll being deleted when getting link text.
2014-03-02 14:24:49 -08:00