mwells
9b94ce2e40
Merge branch 'diffbot-testing' into testing
...
Conflicts:
HttpRequest.cpp
TcpServer.cpp
2014-08-07 15:15:08 -07:00
mwells
734973bb81
do not increment pagedownloadsuccesses if url
...
does not match crawl pattern.
2014-08-07 11:22:31 -07:00
mwells
947be58f10
Merge branch 'diffbot-testing' into testing
...
Conflicts:
HttpRequest.cpp
Msg13.cpp
XmlDoc.cpp
2014-08-05 17:19:53 -07:00
mwells
cc1ceaaac2
fix nyt.com cookie redir bug.
...
fixed bug when POSTing injection request with multipart/form-data.
2014-08-05 17:04:11 -07:00
mwells
dff04eff45
fix facet/xpath lookup stuff.
2014-07-30 10:41:21 -07:00
mwells
3cc54b72cc
qa updates
2014-07-28 19:15:31 -07:00
mwells
5ae476f34e
print facets for each search result
2014-07-08 19:38:54 -07:00
mwells
1af75c5d88
send back facet field/value pairs in msg20reply
2014-07-08 14:22:55 -07:00
mwells
e658ebc8f6
fix up sections page some more. useful
...
for debugging sections stuff.
2014-07-08 10:31:42 -07:00
mwells
c8567f8a24
sectioning stuff working halfway decent.
...
still need to do docid-based stats perhaps.
need to scroll to section hash when clicking
the 'sections' link.
2014-07-07 16:46:38 -07:00
mwells
d9ae010371
shard gbfacetstr:gbxpathsitehash123456 terms by termid for speed.
...
got them working again multicasting a msg 0x39 to the appropriate shard.
set special msg39request flag for better performance for those guys.
2014-07-07 12:32:27 -07:00
mwells
6434e5cc04
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
Parms.h
2014-07-07 09:49:59 -07:00
mwells
81a89f5975
added support for &geth1tag=1 for xml feeds.
2014-07-05 16:08:48 -07:00
mwells
7bd37dfaa2
facet updates
2014-06-28 10:26:08 -06:00
mwells
222a454d67
sectiondb/facet updates
2014-06-28 09:00:55 -06:00
mwells
0033ac3407
more sectiondb faceting updates
2014-06-20 17:46:46 -07:00
mwells
b0e82edc93
new facet crap compiling now.
2014-06-20 12:28:50 -07:00
mwells
a09d4cd723
Merge branch 'master' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Pages.cpp
XmlDoc.cpp
gb.conf
2014-06-20 09:35:39 -07:00
Matt Wells
aaec46f612
added gbdocspiderdate and gbdocindexdate terms
...
just for docs and not spider reply "documents".
do not index plain terms for CT_STATUS spider reply
docs. create gb.conf if does not exist, take out of
repo.
2014-06-19 15:27:46 -07:00
mwells
c314e61968
make sectiondb stats just a special case of facets
2014-06-17 16:39:02 -06:00
mwells
d71922168e
facetize the sectiondb stuff
2014-06-16 20:40:35 -07:00
mwells
5c0b371dc9
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
HttpServer.cpp
Make.depend
Parms.cpp
Parms.h
2014-06-13 11:00:09 -07:00
mwells
20c4ac4205
got it marking up html now with sectiondb stats.
...
seems to work ok.
2014-06-12 14:42:08 -07:00
mwells
ea90e7f755
more fixes for sectiondb markup code
2014-06-12 13:05:45 -07:00
mwells
e4ce9bc9ac
squidproxycache/floaters/sectiondbtagging all compiles.
...
need to do run-time debugging now.
2014-06-11 17:57:28 -07:00
mwells
6f70282ba2
almost got sectiondb integration compiling
2014-06-11 17:24:58 -07:00
mwells
1e10c676d5
parm updates for injecting
2014-06-11 17:24:33 -07:00
Matt Wells
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
40bca5d120
try to fix msg22 core some more
2014-05-14 08:16:47 -07:00
Matt Wells
8d1c4e3097
Merge branch 'diffbot-testing' into diffbot-dan
2014-05-12 15:33:15 -07:00
Daniel Steinberg
78e2bd8171
start implementing handling for array of "objects"
2014-05-12 15:04:36 -07:00
mwells
467e70bd98
improvements for thumbnail generator.
2014-05-11 08:44:38 -07:00
mwells
6e922722da
tree repair logic.
2014-05-10 12:32:01 -07:00
Matt Wells
b1cd0cac86
indexing spider replies now working.
...
use type:status to see them or
gbstatus:success or gbstatus:tcp or gbstatus:0.
2014-05-09 18:07:38 -07:00
Matt Wells
941c8f1892
now added CT_STATUS type results into serps.
...
one for each spider reply we add so we can query
spider replies. using url: or type:status etc.
2014-05-09 13:52:12 -07:00
Matt Wells
eb49094343
try to start indexing spider replies
...
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
Matt Wells
de4a0a13a8
more thumbnail generation updates
2014-04-27 11:05:30 -07:00
Matt Wells
82726879a2
support base64 generated thumbnails in serps.
2014-04-24 14:04:57 -07:00
mwells
9e1199f113
hack about 35%ish done
2014-04-08 19:34:43 -07:00
Matt Wells
72f1312652
new linkdb code compiling.
2014-02-20 17:27:28 -08:00
Matt Wells
df59d3946a
fix content hash issues for json. do not
...
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
8bb17de3c5
pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
...
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
2d4af1aefe
index numbers as integers too, not just floats
...
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
6a45e42128
added ability to treat <link xyz.com rel=canoical> as meta redirects.
...
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
061bf70a51
show EXACT diffbot url used in logs
...
for easier replication
2014-01-22 18:25:18 -08:00
Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
58d0c444ac
fixes for the global index quota system
2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0
more spiderdb spider request fixes
2014-01-19 18:00:56 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
16f8af0d57
added awesome streaming mode support
...
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00