Commit Graph

353 Commits

Author SHA1 Message Date
Matt Wells
61fc015014 fix potential diffbot injection bug 2014-05-21 12:21:29 -07:00
Matt Wells
7ad9058f77 when doing a query reindex on a json
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
34afc7c7cf Merge branch 'diffbot-dan' into diffbot-testing 2014-05-21 05:30:56 -07:00
Daniel Steinberg
e39dffadcf use "expand" option when calling Diffbot 2014-05-20 22:00:46 -07:00
Matt Wells
5c2cc973a8 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-05-15 18:27:13 -07:00
Matt Wells
a303bda1f8 fix core 2014-05-15 15:10:57 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
db543ddd9f nothing 2014-05-14 09:37:59 -07:00
Matt Wells
40bca5d120 try to fix msg22 core some more 2014-05-14 08:16:47 -07:00
Matt Wells
88eb44827f fix avail docid logic some more for indexing
spdier replies
2014-05-13 21:27:05 -07:00
Matt Wells
c5ae5ca4b5 v3 support for tokenized diffbot replies
using the "objects" array in the json.
2014-05-12 16:13:24 -07:00
Matt Wells
8d1c4e3097 Merge branch 'diffbot-testing' into diffbot-dan 2014-05-12 15:33:15 -07:00
Matt Wells
5f7bbe7523 fix diffbot smoke tests. do not index spider replies
for custom crawls.
2014-05-12 15:14:11 -07:00
Daniel Steinberg
78e2bd8171 start implementing handling for array of "objects" 2014-05-12 15:04:36 -07:00
mwells
467e70bd98 improvements for thumbnail generator. 2014-05-11 08:44:38 -07:00
mwells
35a94afcc9 thumb display fixes 2014-05-10 14:30:30 -07:00
mwells
1a13342782 fix thumbnail printing. 2014-05-10 14:24:13 -07:00
mwells
898ffa40bc image core fix. image log cleanups. 2014-05-10 12:46:10 -07:00
mwells
6e922722da tree repair logic. 2014-05-10 12:32:01 -07:00
mwells
7e1429cc30 more bug fixes 2014-05-10 08:22:26 -07:00
mwells
4c2a6a2519 minor fix 2014-05-10 07:58:26 -07:00
mwells
2b37f56e4c Merge branch 'diffbot-matt' into testing 2014-05-10 07:56:45 -07:00
mwells
38a79888b6 Merge branch 'diffbot-testing' into testing 2014-05-10 07:49:29 -07:00
mwells
ed816b2c11 a few bug fixes 2014-05-10 07:48:23 -07:00
Matt Wells
e70f760d87 us gbstatus: and gbstatusmsg: field operators 2014-05-09 18:10:38 -07:00
Matt Wells
b1cd0cac86 indexing spider replies now working.
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
Matt Wells
941c8f1892 now added CT_STATUS type results into serps.
one for each spider reply we add so we can query
spider replies. using url: or type:status etc.
2014-05-09 13:52:12 -07:00
Matt Wells
eb49094343 try to start indexing spider replies
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
0daced51df Merge branch 'diffbot-testing' into diffbot-matt 2014-05-02 14:34:04 -07:00
Matt Wells
060f7da967 fix data corruption detection and repair bug.
do not core on corrupt http reply missing \0.
just set the g_errno to ECORRUPTDATA.
give more informative corruption log msgs.
2014-05-01 10:38:00 -07:00
Matt Wells
066a01cba6 Merge branch 'diffbot-testing' into diffbot-matt 2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c fixed bug for product title extraction.
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
de4a0a13a8 more thumbnail generation updates 2014-04-27 11:05:30 -07:00
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
82726879a2 support base64 generated thumbnails in serps. 2014-04-24 14:04:57 -07:00
Matt Wells
9edd5c8264 thumbnail generation support back in. 2014-04-24 10:13:45 -07:00
mwells
2adf5b9bc5 more awesome fixes 2014-04-09 13:31:11 -07:00
mwells
be99155986 more updates 2014-04-09 11:03:31 -07:00
mwells
9e1199f113 hack about 35%ish done 2014-04-08 19:34:43 -07:00
Matt Wells
d6434191d1 nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
582349334f do not use certain other json fields
when computing checksum for deduping.
like stats, querystring, ...
2014-03-27 12:20:53 -07:00
Matt Wells
402377d2e6 fix bug of gbmin, gbmax etc. not working.
floats were being rounded down to ints
in most cases it seems. so .9 -> 0 etc.
2014-03-26 11:56:06 -07:00
mwells
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
Matt Wells
b33121af7d make all field names lower case without
spaces when we hash them to make the
prefixhash. since json names often have
mixed case field names and spaces.
2014-03-20 16:08:02 -07:00
Matt Wells
5ed19026d9 temp debug comments 2014-03-20 15:33:37 -07:00
Matt Wells
5c2e78e5fa Merge branch 'diffbot' into diffbot-testing 2014-03-10 20:26:30 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
82db7240a3 simple print update 2014-03-09 19:43:32 -07:00
Matt Wells
aab165ed20 fix bad return value from function 2014-03-08 19:32:56 -08:00