Commit Graph

1244 Commits

Author SHA1 Message Date
Matt Wells
aab165ed20 fix bad return value from function 2014-03-08 19:32:56 -08:00
Matt Wells
4cb66c31bf get this new api spidering 2014-03-08 12:02:20 -07:00
Matt Wells
624c1d4e68 nuke doledb fixes 2014-03-08 10:51:15 -07:00
Matt Wells
29694f4efe startup fixes 2014-03-08 10:25:56 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
14817df7a9 new site patterns api stuff 2014-03-08 09:23:32 -07:00
Matt Wells
72fab5b61e Do not end a crawl while urls are still being spidered
because they might add more links to spiderdb when
they finally complete.
2014-03-07 09:30:12 -08:00
Matt Wells
7cdd411ef1 Merge branch 'diffbot' into diffbot-testing 2014-03-07 09:26:47 -08:00
Matt Wells
c143ee1fba fix core when creating a new collection because
we incremented m_numRecs but did not grow the ptr buffer.
also added support for localgb.conf so we can use that
instead of gb.conf to avoid git push/pull conflicts.
2014-03-07 09:05:14 -08:00
Matt Wells
dcd42e455e Merge branch 'diffbot' into diffbot-testing 2014-03-07 09:02:29 -08:00
Matt Wells
f777e6cccd Merge branch 'diffbot' into diffbot-testing 2014-03-07 08:23:21 -08:00
Matt Wells
d6177019ec minor fix 2014-03-07 08:07:09 -08:00
Matt Wells
434dd182d4 fix mem leak. always harvest links
for custom crawls.
2014-03-06 21:24:39 -08:00
Matt Wells
e351d2a6f1 get searching on token working 2014-03-06 17:01:41 -08:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
d74f748e93 search all collections under a token if "&token" is
given but not "&c=..."
2014-03-06 11:00:43 -08:00
Matt Wells
ca2d307229 revert gb.conf 2014-03-06 10:47:03 -08:00
Matt Wells
97e46dbf4e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-06 10:45:45 -08:00
Matt Wells
efa92b16fd Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-06 10:45:35 -08:00
Matt Wells
25cf0efdbf first compiled stab at multi collection searching. 2014-03-06 10:45:13 -08:00
Matt Wells
451a092378 fix core from changing parms while evaluating
a url.
2014-03-06 07:47:43 -08:00
Matt Wells
0962e243a4 Merge branch 'diffbot' into diffbot-testing 2014-03-05 07:43:25 -08:00
Matt Wells
58a1feeea5 specify &header=1 explicitly to get json serp header
lest we break our clients parsers
2014-03-05 07:41:59 -08:00
Matt Wells
13e33bc261 fix jezebel crawl from hanging. 2014-03-04 19:45:26 -08:00
Matt Wells
1b62f1582b print memtable when almost full so we can see
where the leak is. more spiders for ethan.
do not try to get diffbot reply if page is already json.
likely it is an injected diffbot json reply.
2014-03-04 18:19:50 -08:00
Matt Wells
603cd67758 fix csv downloads some more 2014-03-04 12:07:46 -08:00
Matt Wells
2ab9aaeeaa streaming csv fixes 2014-03-04 11:04:26 -08:00
Matt Wells
866b09d25e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-03-04 10:46:28 -08:00
Matt Wells
b1381cc610 make csv streamable, faster and take almost no memory. 2014-03-04 10:45:57 -08:00
Matt Wells
280dcb85cf fix for passing testUpdatedContent smoketest 2014-03-04 09:09:51 -08:00
Matt Wells
ab9f2b33c1 definition updates 2014-03-04 08:37:39 -07:00
Matt Wells
1acb16b1ee tweak empty doledb priority logic.
anchor it more to m_doleIpTable for more
reliability. seems like it was causing some
slow dows during spidering. seems more
continuous now.
2014-03-03 13:48:59 -08:00
Matt Wells
48b5330d9c only skip checking to spider a url of its
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
282dad6cef deal with no coll recs when getting
link text using msg25. do not share
g_lineTable between collections.
2014-03-03 08:04:24 -08:00
Matt Wells
ff8a0b4ef1 do not let all collections share the same line table
in linkdb.cpp
2014-03-03 07:50:11 -08:00
Matt Wells
a82abe8260 added ^ operator to url crawl patterns.
good for tmz crawl.
2014-03-02 14:57:59 -08:00
Matt Wells
7fd6bbd7f5 added ^ support to url crawl expressions 2014-03-02 14:41:25 -08:00
Matt Wells
e4d425c18f fix coll being deleted when getting link text. 2014-03-02 14:24:49 -08:00
Daniel Steinberg
bb5016e88b add the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned 2014-03-01 11:16:17 -08:00
Matt Wells
aeb2833d20 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-28 11:46:44 -08:00
Matt Wells
11efab9862 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-28 08:23:59 -08:00
Matt Wells
c596d38e60 fix core from getting title of json object 2014-02-28 08:18:09 -08:00
Matt Wells
5f3aa24805 took out restrictDomain logic. now we always
only follow links on the same domain as the seed
UNLESS a url crawl pattern or a url crawl regex
was specified.
2014-02-27 19:53:17 -08:00
Matt Wells
42f254125e fix core in new link text logic.
empty msg25 replies are ok if g_errno is set.
2014-02-27 13:56:32 -08:00
Matt Wells
365fc16606 fix core in "wait in line" logic
when getting link info in Linkdb.cpp.
2014-02-27 09:22:35 -08:00
Matt Wells
af9eb8fb73 need to allow clients to not restrict to
seed domains.
2014-02-26 22:27:22 -08:00
Matt Wells
927f4626ee Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-26 22:26:13 -08:00
Matt Wells
eaca38cbfd fix new result streaming logic some more 2014-02-26 21:42:43 -08:00
Matt Wells
0933884191 fix super fast and mem efficient search
results streaming code.
2014-02-26 21:18:08 -08:00
Matt Wells
f11e25024a Merge branch 'diffbot' into diffbot-testing 2014-02-26 20:34:06 -08:00