Commit Graph

1358 Commits

Author SHA1 Message Date
Daniel Steinberg
41e3988fbc not a conf file 2014-03-10 13:57:13 -07:00
Daniel Steinberg
4a7bf5d4d0 Story #2040: store raw URL submissions for customer bulk jobs 2014-03-10 13:50:30 -07:00
Matt Wells
bfcb7082f4 fix bug from nuking doledb on a new collection. 2014-03-10 13:48:00 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
9debee20dc Merge branch 'diffbot' into testing 2014-03-09 20:44:09 -07:00
Matt Wells
662b6d4b32 doc updates 2014-03-09 20:43:49 -07:00
Matt Wells
90ff2c2a25 update example site lists 2014-03-09 20:35:45 -07:00
Matt Wells
82db7240a3 simple print update 2014-03-09 19:43:32 -07:00
Matt Wells
f7b7274ff1 replace "exact:" directive with "seed:"
really the same thing.
2014-03-09 19:35:20 -07:00
Matt Wells
f8e561e6f4 more new site list api fixes 2014-03-09 18:15:57 -07:00
Matt Wells
11e8c16878 new site list updates 2014-03-09 17:53:24 -07:00
Matt Wells
ed626b162a more site list based spider fixes to be more like gsa 2014-03-08 20:52:31 -07:00
Matt Wells
aab165ed20 fix bad return value from function 2014-03-08 19:32:56 -08:00
Matt Wells
4cb66c31bf get this new api spidering 2014-03-08 12:02:20 -07:00
Matt Wells
624c1d4e68 nuke doledb fixes 2014-03-08 10:51:15 -07:00
Matt Wells
29694f4efe startup fixes 2014-03-08 10:25:56 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
14817df7a9 new site patterns api stuff 2014-03-08 09:23:32 -07:00
Matt Wells
72fab5b61e Do not end a crawl while urls are still being spidered
because they might add more links to spiderdb when
they finally complete.
2014-03-07 09:30:12 -08:00
Matt Wells
7cdd411ef1 Merge branch 'diffbot' into diffbot-testing 2014-03-07 09:26:47 -08:00
Matt Wells
c143ee1fba fix core when creating a new collection because
we incremented m_numRecs but did not grow the ptr buffer.
also added support for localgb.conf so we can use that
instead of gb.conf to avoid git push/pull conflicts.
2014-03-07 09:05:14 -08:00
Matt Wells
dcd42e455e Merge branch 'diffbot' into diffbot-testing 2014-03-07 09:02:29 -08:00
Matt Wells
f777e6cccd Merge branch 'diffbot' into diffbot-testing 2014-03-07 08:23:21 -08:00
Matt Wells
d6177019ec minor fix 2014-03-07 08:07:09 -08:00
Matt Wells
434dd182d4 fix mem leak. always harvest links
for custom crawls.
2014-03-06 21:24:39 -08:00
Matt Wells
e351d2a6f1 get searching on token working 2014-03-06 17:01:41 -08:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
d74f748e93 search all collections under a token if "&token" is
given but not "&c=..."
2014-03-06 11:00:43 -08:00
Matt Wells
ca2d307229 revert gb.conf 2014-03-06 10:47:03 -08:00
Matt Wells
97e46dbf4e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-06 10:45:45 -08:00
Matt Wells
efa92b16fd Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-06 10:45:35 -08:00
Matt Wells
25cf0efdbf first compiled stab at multi collection searching. 2014-03-06 10:45:13 -08:00
Matt Wells
451a092378 fix core from changing parms while evaluating
a url.
2014-03-06 07:47:43 -08:00
Matt Wells
0962e243a4 Merge branch 'diffbot' into diffbot-testing 2014-03-05 07:43:25 -08:00
Matt Wells
58a1feeea5 specify &header=1 explicitly to get json serp header
lest we break our clients parsers
2014-03-05 07:41:59 -08:00
Matt Wells
13e33bc261 fix jezebel crawl from hanging. 2014-03-04 19:45:26 -08:00
Matt Wells
1b62f1582b print memtable when almost full so we can see
where the leak is. more spiders for ethan.
do not try to get diffbot reply if page is already json.
likely it is an injected diffbot json reply.
2014-03-04 18:19:50 -08:00
Matt Wells
603cd67758 fix csv downloads some more 2014-03-04 12:07:46 -08:00
Matt Wells
2ab9aaeeaa streaming csv fixes 2014-03-04 11:04:26 -08:00
Matt Wells
866b09d25e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-04 10:46:28 -08:00
Matt Wells
b1381cc610 make csv streamable, faster and take almost no memory. 2014-03-04 10:45:57 -08:00
Matt Wells
280dcb85cf fix for passing testUpdatedContent smoketest 2014-03-04 09:09:51 -08:00
Matt Wells
ab9f2b33c1 definition updates 2014-03-04 08:37:39 -07:00
Matt Wells
1acb16b1ee tweak empty doledb priority logic.
anchor it more to m_doleIpTable for more
reliability. seems like it was causing some
slow dows during spidering. seems more
continuous now.
2014-03-03 13:48:59 -08:00
Matt Wells
48b5330d9c only skip checking to spider a url of its
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
282dad6cef deal with no coll recs when getting
link text using msg25. do not share
g_lineTable between collections.
2014-03-03 08:04:24 -08:00
Matt Wells
ff8a0b4ef1 do not let all collections share the same line table
in linkdb.cpp
2014-03-03 07:50:11 -08:00
Matt Wells
a82abe8260 added ^ operator to url crawl patterns.
good for tmz crawl.
2014-03-02 14:57:59 -08:00
Matt Wells
7fd6bbd7f5 added ^ support to url crawl expressions 2014-03-02 14:41:25 -08:00
Matt Wells
e4d425c18f fix coll being deleted when getting link text. 2014-03-02 14:24:49 -08:00