Matt Wells
61fc015014
fix potential diffbot injection bug
2014-05-21 12:21:29 -07:00
Matt Wells
7ad9058f77
when doing a query reindex on a json
...
child url we need to add the spider request
of the original parent url and make sure
it does not get "EDOCUNCHANGED" error.
then the possibly new json child objects
won't get indexed.
2014-05-21 05:43:53 -07:00
Matt Wells
34afc7c7cf
Merge branch 'diffbot-dan' into diffbot-testing
2014-05-21 05:30:56 -07:00
Daniel Steinberg
e39dffadcf
use "expand" option when calling Diffbot
2014-05-20 22:00:46 -07:00
Matt Wells
5c2cc973a8
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-05-15 18:27:13 -07:00
Matt Wells
a303bda1f8
fix core
2014-05-15 15:10:57 -07:00
Matt Wells
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
db543ddd9f
nothing
2014-05-14 09:37:59 -07:00
Matt Wells
40bca5d120
try to fix msg22 core some more
2014-05-14 08:16:47 -07:00
Matt Wells
88eb44827f
fix avail docid logic some more for indexing
...
spdier replies
2014-05-13 21:27:05 -07:00
Matt Wells
c5ae5ca4b5
v3 support for tokenized diffbot replies
...
using the "objects" array in the json.
2014-05-12 16:13:24 -07:00
Matt Wells
8d1c4e3097
Merge branch 'diffbot-testing' into diffbot-dan
2014-05-12 15:33:15 -07:00
Matt Wells
5f7bbe7523
fix diffbot smoke tests. do not index spider replies
...
for custom crawls.
2014-05-12 15:14:11 -07:00
Daniel Steinberg
78e2bd8171
start implementing handling for array of "objects"
2014-05-12 15:04:36 -07:00
mwells
467e70bd98
improvements for thumbnail generator.
2014-05-11 08:44:38 -07:00
mwells
35a94afcc9
thumb display fixes
2014-05-10 14:30:30 -07:00
mwells
1a13342782
fix thumbnail printing.
2014-05-10 14:24:13 -07:00
mwells
898ffa40bc
image core fix. image log cleanups.
2014-05-10 12:46:10 -07:00
mwells
6e922722da
tree repair logic.
2014-05-10 12:32:01 -07:00
mwells
7e1429cc30
more bug fixes
2014-05-10 08:22:26 -07:00
mwells
4c2a6a2519
minor fix
2014-05-10 07:58:26 -07:00
mwells
2b37f56e4c
Merge branch 'diffbot-matt' into testing
2014-05-10 07:56:45 -07:00
mwells
38a79888b6
Merge branch 'diffbot-testing' into testing
2014-05-10 07:49:29 -07:00
mwells
ed816b2c11
a few bug fixes
2014-05-10 07:48:23 -07:00
Matt Wells
e70f760d87
us gbstatus: and gbstatusmsg: field operators
2014-05-09 18:10:38 -07:00
Matt Wells
b1cd0cac86
indexing spider replies now working.
...
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
Matt Wells
941c8f1892
now added CT_STATUS type results into serps.
...
one for each spider reply we add so we can query
spider replies. using url: or type:status etc.
2014-05-09 13:52:12 -07:00
Matt Wells
eb49094343
try to start indexing spider replies
...
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
0daced51df
Merge branch 'diffbot-testing' into diffbot-matt
2014-05-02 14:34:04 -07:00
Matt Wells
060f7da967
fix data corruption detection and repair bug.
...
do not core on corrupt http reply missing \0.
just set the g_errno to ECORRUPTDATA.
give more informative corruption log msgs.
2014-05-01 10:38:00 -07:00
Matt Wells
066a01cba6
Merge branch 'diffbot-testing' into diffbot-matt
2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c
fixed bug for product title extraction.
...
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
de4a0a13a8
more thumbnail generation updates
2014-04-27 11:05:30 -07:00
Matt Wells
20a2729827
added jobCreationTimeUTC and jobCompletionTimeUTC
...
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
82726879a2
support base64 generated thumbnails in serps.
2014-04-24 14:04:57 -07:00
Matt Wells
9edd5c8264
thumbnail generation support back in.
2014-04-24 10:13:45 -07:00
mwells
2adf5b9bc5
more awesome fixes
2014-04-09 13:31:11 -07:00
mwells
be99155986
more updates
2014-04-09 11:03:31 -07:00
mwells
9e1199f113
hack about 35%ish done
2014-04-08 19:34:43 -07:00
Matt Wells
d6434191d1
nomenclature changes to reduce collissions.
...
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
582349334f
do not use certain other json fields
...
when computing checksum for deduping.
like stats, querystring, ...
2014-03-27 12:20:53 -07:00
Matt Wells
402377d2e6
fix bug of gbmin, gbmax etc. not working.
...
floats were being rounded down to ints
in most cases it seems. so .9 -> 0 etc.
2014-03-26 11:56:06 -07:00
mwells
b6e5424e32
do not download bulkjob urls in crawlbot.
...
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
Matt Wells
b33121af7d
make all field names lower case without
...
spaces when we hash them to make the
prefixhash. since json names often have
mixed case field names and spaces.
2014-03-20 16:08:02 -07:00
Matt Wells
5ed19026d9
temp debug comments
2014-03-20 15:33:37 -07:00
Matt Wells
5c2e78e5fa
Merge branch 'diffbot' into diffbot-testing
2014-03-10 20:26:30 -07:00
Matt Wells
bd4484db3c
Merge branch 'testing' into diffbot-testing
2014-03-10 12:08:23 -07:00
Matt Wells
82db7240a3
simple print update
2014-03-09 19:43:32 -07:00
Matt Wells
aab165ed20
fix bad return value from function
2014-03-08 19:32:56 -08:00