Matt Wells
72f1312652
new linkdb code compiling.
2014-02-20 17:27:28 -08:00
Matt Wells
df59d3946a
fix content hash issues for json. do not
...
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
8bb17de3c5
pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
...
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
2d4af1aefe
index numbers as integers too, not just floats
...
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
6a45e42128
added ability to treat <link xyz.com rel=canoical> as meta redirects.
...
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
061bf70a51
show EXACT diffbot url used in logs
...
for easier replication
2014-01-22 18:25:18 -08:00
Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
58d0c444ac
fixes for the global index quota system
2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0
more spiderdb spider request fixes
2014-01-19 18:00:56 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
16f8af0d57
added awesome streaming mode support
...
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
0844dbf72a
added url process pattern and regex to
...
xmldoc.cpp.
2014-01-17 11:08:23 -08:00
Matt Wells
8a49e87a61
got code with shard rebalancing compiling.
...
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
128e055120
take out datedb. no longer used. we store
...
dates in posdb since it has larger keys than
indexdb.
2014-01-09 13:39:28 -08:00
Matt Wells
4f64677b4f
get new global preemptive cache
...
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
mwells
82494baa89
move CollectionRec stuff into Collectiondb files
...
for simplicity.
2013-12-10 15:28:04 -08:00
Matt Wells
78a4cfe6da
forgot to push the .h files
2013-12-07 22:12:48 -07:00
Matt Wells
5e4b5a112c
Merge branch 'master' into diffbot
...
Conflicts:
PageResults.cpp
Threads.cpp
XmlDoc.cpp
XmlDoc.h
2013-12-07 11:34:26 -07:00
Matt Wells
1bc80ab552
fixed pagereindex. we now add spiderreplies
...
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
2013-12-07 10:01:17 -07:00
Matt Wells
3155869fbf
added new log msg for
...
recording cpu time for summary generation.
2013-12-01 11:53:41 -07:00
Matt Wells
5ee2be8fcf
fixed data corruption bug. m_finalCrawlDelay
...
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
8bb086ac60
crawldelay works now but it measures
...
from the end of the download, not the
beginning.
2013-11-26 12:58:14 -08:00
mwells
dc226dde0e
fix LinkInfo mem leaks
2013-11-16 17:50:32 -08:00
Matt Wells
be213ca28f
now fix embedded products and images in the diffbot
...
json reply properly!
2013-11-14 12:51:34 -08:00
Matt Wells
7f038235e1
hack in a type:product or type:image
...
since product and image json elements
are taken from an array and lack those.
2013-11-12 16:57:14 -08:00
Matt Wells
fbcd6b8afd
display json objects that are not in arrays
...
in csv. show csv header. how to deal
with heterogenous object lists?
index spiderdate: for gbsortby:spiderdate.
added gbrevsortby: support.
2013-11-12 13:51:52 -08:00
Matt Wells
09f28b2f26
now we index all numbers that have field names
...
(so can't just be a number in the body) but it
can be in a meta tag or json item. then use
like gbsortby:products.offerPrice to sort the
search results (json objects) by that.
2013-11-08 16:16:13 -08:00
Matt Wells
8c9d5d824b
support for gbcontenthash:xxxxx for doing
...
exact match deduping. highest site rank
page wins, on ties, lowest docid wins.
2013-11-04 13:47:13 -08:00
Matt Wells
4477472903
just selecting a url to crawl should
...
count as a pagedownloadattempt --
the CrawlInfo counter. removed urlsExamined
count because it was too confusing.
2013-10-28 22:38:15 -07:00
mwells
469be5f216
moved email logic from xmldoc into spider.cpp.
...
add maxCrawlRounds parm. added crawlStatus
msg in json output to indicate why crawl stopped.
2013-10-23 12:49:32 -07:00
Matt Wells
b589b17e63
fix collection resetting.
2013-10-18 15:21:00 -07:00
Matt Wells
fe8ebd23a3
added simplified redirect urls to spiderdb
...
as a new spiderrequest. made XmlDoc::getLinks()
call m_links.set(redirUrl.getUrl()) so that it is
treated like an outlink on the page and gets added
from addOutlinkSpiderRecsToMetaList().
2013-10-17 12:06:12 -07:00
Matt Wells
74c2742ced
fix mem leak of LinkInfo.
...
fixed json output from injecting url.
2013-10-16 17:17:28 -07:00
Matt Wells
fc17521697
Merge branch 'master' into diffbot
...
Conflicts:
Hostdb.cpp
Makefile
PageResults.cpp
PageRoot.cpp
Pages.cpp
Rdb.cpp
SearchInput.cpp
SearchInput.h
Spider.cpp
Spider.h
XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
e565a861ae
give nice reply from seed in json.
...
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
2013-10-16 14:03:14 -07:00
mwells
a562c65627
another code checkpoint. new json api
...
for crawlbot. new url filters for crawlbot.
2013-10-14 16:10:48 -06:00
mwells
c949bfe315
ignore certain errors and index the doc anyway
...
so we at least have it in our dmoz index with its
designated title and summary from dmoz.
2013-10-13 00:02:25 -07:00
mwells
c283e85e40
add support for noindex meta tag.
...
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
2013-10-12 22:50:23 -07:00
mwells
d300dc42f7
added XmlDoc::getDmozTitles() and related
...
functions.
2013-10-12 21:56:25 -07:00
Matt Wells
283ec2f6b4
email and webhook alerts when spider runs out of urls
...
to spider.
2013-10-09 11:42:56 -07:00
Matt Wells
3702a05d64
add sendEmailThroughMandrill() to send
...
through mail chimp http api.
2013-10-08 18:01:38 -07:00
Matt Wells
9eecfd378c
added support for pageprocesspattern again.
...
|| separated strings to find in m_content
before sending to diffbot.
2013-10-08 17:08:58 -07:00
mwells
59b491f007
return fake tag recs for links if
...
usefakeips meta tag is given. saves
some lookups in tagdb when adding gbdmoz.urls.txt.*
files which have tons of links each. like 500,000.
2013-10-06 16:42:32 -07:00
mwells
000caa5a26
support for usefakeips meta tag
2013-10-06 00:10:07 -06:00
mwells
6c2c9f7774
trying to bring back dmoz integration.
2013-10-02 22:34:21 -06:00
mwells
3fecb3eb1f
got email and url notification code compiling.
...
when crawl hits a limit we do notifications.
2013-10-01 15:14:39 -06:00
mwells
20952eedbe
customizable api list in url filters
2013-09-30 09:18:22 -06:00
mwells
9bf8bf7712
add spider reply even on g_errno now with an error
...
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
mwells
fd081478de
fix crawlbot to work on a distributed network
...
as far as adding/deleting/resetting colls
and updating parms. ideally we'd have a Colldb
Rdb where each key was a parm. that would make
syncing easier if a host went down, then it would
get the negative/positive colldb parm keys later.
so it could sync up on all your operations as long
as all your operations in terms of adding and deleting
database key/value pairs.
2013-09-26 22:41:05 -06:00
Matt Wells
f90d20f4dd
diffbot api integration updates
2013-09-18 15:07:47 -07:00