Commit Graph

99 Commits

Author SHA1 Message Date
mwells
9b94ce2e40 Merge branch 'diffbot-testing' into testing
Conflicts:
	HttpRequest.cpp
	TcpServer.cpp
2014-08-07 15:15:08 -07:00
mwells
734973bb81 do not increment pagedownloadsuccesses if url
does not match crawl pattern.
2014-08-07 11:22:31 -07:00
mwells
947be58f10 Merge branch 'diffbot-testing' into testing
Conflicts:
	HttpRequest.cpp
	Msg13.cpp
	XmlDoc.cpp
2014-08-05 17:19:53 -07:00
mwells
cc1ceaaac2 fix nyt.com cookie redir bug.
fixed bug when POSTing injection request with multipart/form-data.
2014-08-05 17:04:11 -07:00
mwells
dff04eff45 fix facet/xpath lookup stuff. 2014-07-30 10:41:21 -07:00
mwells
3cc54b72cc qa updates 2014-07-28 19:15:31 -07:00
mwells
5ae476f34e print facets for each search result 2014-07-08 19:38:54 -07:00
mwells
1af75c5d88 send back facet field/value pairs in msg20reply 2014-07-08 14:22:55 -07:00
mwells
e658ebc8f6 fix up sections page some more. useful
for debugging sections stuff.
2014-07-08 10:31:42 -07:00
mwells
c8567f8a24 sectioning stuff working halfway decent.
still need to do docid-based stats perhaps.
need to scroll to section hash when clicking
the 'sections' link.
2014-07-07 16:46:38 -07:00
mwells
d9ae010371 shard gbfacetstr:gbxpathsitehash123456 terms by termid for speed.
got them working again multicasting a msg 0x39 to the appropriate shard.
set special msg39request flag for better performance for those guys.
2014-07-07 12:32:27 -07:00
mwells
6434e5cc04 Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
	Parms.h
2014-07-07 09:49:59 -07:00
mwells
81a89f5975 added support for &geth1tag=1 for xml feeds. 2014-07-05 16:08:48 -07:00
mwells
7bd37dfaa2 facet updates 2014-06-28 10:26:08 -06:00
mwells
222a454d67 sectiondb/facet updates 2014-06-28 09:00:55 -06:00
mwells
0033ac3407 more sectiondb faceting updates 2014-06-20 17:46:46 -07:00
mwells
b0e82edc93 new facet crap compiling now. 2014-06-20 12:28:50 -07:00
mwells
a09d4cd723 Merge branch 'master' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Pages.cpp
	XmlDoc.cpp
	gb.conf
2014-06-20 09:35:39 -07:00
Matt Wells
aaec46f612 added gbdocspiderdate and gbdocindexdate terms
just for docs and not spider reply "documents".
do not index plain terms for CT_STATUS spider reply
docs. create gb.conf if does not exist, take out of
repo.
2014-06-19 15:27:46 -07:00
mwells
c314e61968 make sectiondb stats just a special case of facets 2014-06-17 16:39:02 -06:00
mwells
d71922168e facetize the sectiondb stuff 2014-06-16 20:40:35 -07:00
mwells
5c0b371dc9 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	HttpServer.cpp
	Make.depend
	Parms.cpp
	Parms.h
2014-06-13 11:00:09 -07:00
mwells
20c4ac4205 got it marking up html now with sectiondb stats.
seems to work ok.
2014-06-12 14:42:08 -07:00
mwells
ea90e7f755 more fixes for sectiondb markup code 2014-06-12 13:05:45 -07:00
mwells
e4ce9bc9ac squidproxycache/floaters/sectiondbtagging all compiles.
need to do run-time debugging now.
2014-06-11 17:57:28 -07:00
mwells
6f70282ba2 almost got sectiondb integration compiling 2014-06-11 17:24:58 -07:00
mwells
1e10c676d5 parm updates for injecting 2014-06-11 17:24:33 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
40bca5d120 try to fix msg22 core some more 2014-05-14 08:16:47 -07:00
Matt Wells
8d1c4e3097 Merge branch 'diffbot-testing' into diffbot-dan 2014-05-12 15:33:15 -07:00
Daniel Steinberg
78e2bd8171 start implementing handling for array of "objects" 2014-05-12 15:04:36 -07:00
mwells
467e70bd98 improvements for thumbnail generator. 2014-05-11 08:44:38 -07:00
mwells
6e922722da tree repair logic. 2014-05-10 12:32:01 -07:00
Matt Wells
b1cd0cac86 indexing spider replies now working.
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
Matt Wells
941c8f1892 now added CT_STATUS type results into serps.
one for each spider reply we add so we can query
spider replies. using url: or type:status etc.
2014-05-09 13:52:12 -07:00
Matt Wells
eb49094343 try to start indexing spider replies
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
Matt Wells
de4a0a13a8 more thumbnail generation updates 2014-04-27 11:05:30 -07:00
Matt Wells
82726879a2 support base64 generated thumbnails in serps. 2014-04-24 14:04:57 -07:00
mwells
9e1199f113 hack about 35%ish done 2014-04-08 19:34:43 -07:00
Matt Wells
72f1312652 new linkdb code compiling. 2014-02-20 17:27:28 -08:00
Matt Wells
df59d3946a fix content hash issues for json. do not
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
8bb17de3c5 pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
2d4af1aefe index numbers as integers too, not just floats
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects.
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
061bf70a51 show EXACT diffbot url used in logs
for easier replication
2014-01-22 18:25:18 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
16f8af0d57 added awesome streaming mode support
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00