Commit Graph

60 Commits

Author SHA1 Message Date
Matt Wells
72f1312652 new linkdb code compiling. 2014-02-20 17:27:28 -08:00
Matt Wells
df59d3946a fix content hash issues for json. do not
hash url/resolved_url/html fields. do exact
order-independent hashes of remaining field/value pairs.
used for setting EDOCUNCHANGED and doing spidertime/querytime
deduping. also do not index "html" json field because it
is huge, slow and redundant. convert "date" field
into a number so we can sort/constrain by article pub date.
2014-02-15 14:40:56 -08:00
Matt Wells
8bb17de3c5 pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent
so diffbot reply will not be update in the index if it is unchanged
thereby keeping lastCrawlTimeUTC the same.
2014-02-12 18:42:14 -08:00
Matt Wells
2d4af1aefe index numbers as integers too, not just floats
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects.
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
061bf70a51 show EXACT diffbot url used in logs
for easier replication
2014-01-22 18:25:18 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
16f8af0d57 added awesome streaming mode support
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
0844dbf72a added url process pattern and regex to
xmldoc.cpp.
2014-01-17 11:08:23 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
128e055120 take out datedb. no longer used. we store
dates in posdb since it has larger keys than
indexdb.
2014-01-09 13:39:28 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
mwells
82494baa89 move CollectionRec stuff into Collectiondb files
for simplicity.
2013-12-10 15:28:04 -08:00
Matt Wells
78a4cfe6da forgot to push the .h files 2013-12-07 22:12:48 -07:00
Matt Wells
5e4b5a112c Merge branch 'master' into diffbot
Conflicts:

	PageResults.cpp
	Threads.cpp
	XmlDoc.cpp
	XmlDoc.h
2013-12-07 11:34:26 -07:00
Matt Wells
1bc80ab552 fixed pagereindex. we now add spiderreplies
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
2013-12-07 10:01:17 -07:00
Matt Wells
3155869fbf added new log msg for
recording cpu time for summary generation.
2013-12-01 11:53:41 -07:00
Matt Wells
5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
8bb086ac60 crawldelay works now but it measures
from the end of the download, not the
beginning.
2013-11-26 12:58:14 -08:00
mwells
dc226dde0e fix LinkInfo mem leaks 2013-11-16 17:50:32 -08:00
Matt Wells
be213ca28f now fix embedded products and images in the diffbot
json reply properly!
2013-11-14 12:51:34 -08:00
Matt Wells
7f038235e1 hack in a type:product or type:image
since product and image json elements
are taken from an array and lack those.
2013-11-12 16:57:14 -08:00
Matt Wells
fbcd6b8afd display json objects that are not in arrays
in csv. show csv header. how to deal
with heterogenous object lists?
index spiderdate: for gbsortby:spiderdate.
added gbrevsortby: support.
2013-11-12 13:51:52 -08:00
Matt Wells
09f28b2f26 now we index all numbers that have field names
(so can't just be a number in the body) but it
can be in a meta tag or json item. then use
like gbsortby:products.offerPrice to sort the
search results (json objects) by that.
2013-11-08 16:16:13 -08:00
Matt Wells
8c9d5d824b support for gbcontenthash:xxxxx for doing
exact match deduping. highest site rank
page wins, on ties, lowest docid wins.
2013-11-04 13:47:13 -08:00
Matt Wells
4477472903 just selecting a url to crawl should
count as a pagedownloadattempt --
the CrawlInfo counter. removed urlsExamined
count because it was too confusing.
2013-10-28 22:38:15 -07:00
mwells
469be5f216 moved email logic from xmldoc into spider.cpp.
add maxCrawlRounds parm. added crawlStatus
msg in json output to indicate why crawl stopped.
2013-10-23 12:49:32 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
fe8ebd23a3 added simplified redirect urls to spiderdb
as a new spiderrequest. made XmlDoc::getLinks()
call m_links.set(redirUrl.getUrl()) so that it is
treated like an outlink on the page and gets added
from addOutlinkSpiderRecsToMetaList().
2013-10-17 12:06:12 -07:00
Matt Wells
74c2742ced fix mem leak of LinkInfo.
fixed json output from injecting url.
2013-10-16 17:17:28 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
e565a861ae give nice reply from seed in json.
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
2013-10-16 14:03:14 -07:00
mwells
a562c65627 another code checkpoint. new json api
for crawlbot. new url filters for crawlbot.
2013-10-14 16:10:48 -06:00
mwells
c949bfe315 ignore certain errors and index the doc anyway
so we at least have it in our dmoz index with its
designated title and summary from dmoz.
2013-10-13 00:02:25 -07:00
mwells
c283e85e40 add support for noindex meta tag.
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
2013-10-12 22:50:23 -07:00
mwells
d300dc42f7 added XmlDoc::getDmozTitles() and related
functions.
2013-10-12 21:56:25 -07:00
Matt Wells
283ec2f6b4 email and webhook alerts when spider runs out of urls
to spider.
2013-10-09 11:42:56 -07:00
Matt Wells
3702a05d64 add sendEmailThroughMandrill() to send
through mail chimp http api.
2013-10-08 18:01:38 -07:00
Matt Wells
9eecfd378c added support for pageprocesspattern again.
|| separated strings to find in m_content
before sending to diffbot.
2013-10-08 17:08:58 -07:00
mwells
59b491f007 return fake tag recs for links if
usefakeips meta tag is given. saves
some lookups in tagdb when adding gbdmoz.urls.txt.*
files which have tons of links each. like 500,000.
2013-10-06 16:42:32 -07:00
mwells
000caa5a26 support for usefakeips meta tag 2013-10-06 00:10:07 -06:00
mwells
6c2c9f7774 trying to bring back dmoz integration. 2013-10-02 22:34:21 -06:00
mwells
3fecb3eb1f got email and url notification code compiling.
when crawl hits a limit we do notifications.
2013-10-01 15:14:39 -06:00
mwells
20952eedbe customizable api list in url filters 2013-09-30 09:18:22 -06:00
mwells
9bf8bf7712 add spider reply even on g_errno now with an error
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
mwells
fd081478de fix crawlbot to work on a distributed network
as far as adding/deleting/resetting  colls
and updating parms. ideally we'd have a Colldb
Rdb where each key was a parm. that would make
syncing easier if a host went down, then it would
get the negative/positive colldb parm keys later.
so it could sync up on all your operations as long
as all your operations in terms of adding and deleting
database key/value pairs.
2013-09-26 22:41:05 -06:00
Matt Wells
f90d20f4dd diffbot api integration updates 2013-09-18 15:07:47 -07:00