open-source-search-engine

mirror of https://github.com/gigablast/open-source-search-engine.git synced 2024-10-04 12:17:35 +03:00

Author	SHA1	Message	Date
Matt Wells	72f1312652	new linkdb code compiling.	2014-02-20 17:27:28 -08:00
Matt Wells	df59d3946a	fix content hash issues for json. do not hash url/resolved_url/html fields. do exact order-independent hashes of remaining field/value pairs. used for setting EDOCUNCHANGED and doing spidertime/querytime deduping. also do not index "html" json field because it is huge, slow and redundant. convert "date" field into a number so we can sort/constrain by article pub date.	2014-02-15 14:40:56 -08:00
Matt Wells	8bb17de3c5	pass smoketest: TestOnlyProcessIfNew.testNotUpdatedContent so diffbot reply will not be update in the index if it is unchanged thereby keeping lastCrawlTimeUTC the same.	2014-02-12 18:42:14 -08:00
Matt Wells	2d4af1aefe	index numbers as integers too, not just floats so we can sort by spider date without losing 128 seconds of resolution.	2014-02-06 20:57:54 -08:00
Matt Wells	6a45e42128	added ability to treat <link xyz.com rel=canoical> as meta redirects. should help us dedup. added a function to do looser deduping of spider pages although current not enabled, we are still using the more strict one. added documentation on how we dedup to developer.html for jon to take a look at.	2014-01-30 10:04:09 -08:00
Matt Wells	061bf70a51	show EXACT diffbot url used in logs for easier replication	2014-01-22 18:25:18 -08:00
Matt Wells	45cb5c9a0c	fix bugs to try to get sharding working on crawlbot today	2014-01-21 13:58:21 -08:00
Matt Wells	58d0c444ac	fixes for the global index quota system	2014-01-19 19:38:23 -08:00
Matt Wells	089d7f34a0	more spiderdb spider request fixes	2014-01-19 18:00:56 -08:00
Matt Wells	4606e88721	code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code.	2014-01-18 21:19:26 -08:00
Matt Wells	16f8af0d57	added awesome streaming mode support to tcpserver.cpp for sending back json objects as we get them from shards. and as we get them in small pieces so we don't go oom. made that code much simpler and more reliable in the long run.	2014-01-17 16:26:17 -08:00
Matt Wells	0844dbf72a	added url process pattern and regex to xmldoc.cpp.	2014-01-17 11:08:23 -08:00
Matt Wells	8a49e87a61	got code with shard rebalancing compiling. now we store a "sharded by termid" bit in posdb key for checksums, etc keys that are not sharded by docid. save having to do disk seeks on every host in the cluster to do a dup check, etc.	2014-01-11 16:08:42 -08:00
Matt Wells	128e055120	take out datedb. no longer used. we store dates in posdb since it has larger keys than indexdb.	2014-01-09 13:39:28 -08:00
Matt Wells	4f64677b4f	get new global preemptive cache logic compiling, with section voting stats.	2014-01-05 11:51:09 -08:00
mwells	82494baa89	move CollectionRec stuff into Collectiondb files for simplicity.	2013-12-10 15:28:04 -08:00
Matt Wells	78a4cfe6da	forgot to push the .h files	2013-12-07 22:12:48 -07:00
Matt Wells	5e4b5a112c	Merge branch 'master' into diffbot Conflicts: PageResults.cpp Threads.cpp XmlDoc.cpp XmlDoc.h	2013-12-07 11:34:26 -07:00
Matt Wells	1bc80ab552	fixed pagereindex. we now add spiderreplies for internal errors like ENOMEM or ENOTFOUND to try to avoid the "CRITICAL CRITICAL" msgs. these are considered temporary errors.	2013-12-07 10:01:17 -07:00
Matt Wells	3155869fbf	added new log msg for recording cpu time for summary generation.	2013-12-01 11:53:41 -07:00
Matt Wells	5ee2be8fcf	fixed data corruption bug. m_finalCrawlDelay was being stored in xmldoc titlerec header.	2013-11-27 14:18:15 -08:00
Matt Wells	8bb086ac60	crawldelay works now but it measures from the end of the download, not the beginning.	2013-11-26 12:58:14 -08:00
mwells	dc226dde0e	fix LinkInfo mem leaks	2013-11-16 17:50:32 -08:00
Matt Wells	be213ca28f	now fix embedded products and images in the diffbot json reply properly!	2013-11-14 12:51:34 -08:00
Matt Wells	7f038235e1	hack in a type:product or type:image since product and image json elements are taken from an array and lack those.	2013-11-12 16:57:14 -08:00
Matt Wells	fbcd6b8afd	display json objects that are not in arrays in csv. show csv header. how to deal with heterogenous object lists? index spiderdate: for gbsortby:spiderdate. added gbrevsortby: support.	2013-11-12 13:51:52 -08:00
Matt Wells	09f28b2f26	now we index all numbers that have field names (so can't just be a number in the body) but it can be in a meta tag or json item. then use like gbsortby:products.offerPrice to sort the search results (json objects) by that.	2013-11-08 16:16:13 -08:00
Matt Wells	8c9d5d824b	support for gbcontenthash:xxxxx for doing exact match deduping. highest site rank page wins, on ties, lowest docid wins.	2013-11-04 13:47:13 -08:00
Matt Wells	4477472903	just selecting a url to crawl should count as a pagedownloadattempt -- the CrawlInfo counter. removed urlsExamined count because it was too confusing.	2013-10-28 22:38:15 -07:00
mwells	469be5f216	moved email logic from xmldoc into spider.cpp. add maxCrawlRounds parm. added crawlStatus msg in json output to indicate why crawl stopped.	2013-10-23 12:49:32 -07:00
Matt Wells	b589b17e63	fix collection resetting.	2013-10-18 15:21:00 -07:00
Matt Wells	fe8ebd23a3	added simplified redirect urls to spiderdb as a new spiderrequest. made XmlDoc::getLinks() call m_links.set(redirUrl.getUrl()) so that it is treated like an outlink on the page and gets added from addOutlinkSpiderRecsToMetaList().	2013-10-17 12:06:12 -07:00
Matt Wells	74c2742ced	fix mem leak of LinkInfo. fixed json output from injecting url.	2013-10-16 17:17:28 -07:00
Matt Wells	fc17521697	Merge branch 'master' into diffbot Conflicts: Hostdb.cpp Makefile PageResults.cpp PageRoot.cpp Pages.cpp Rdb.cpp SearchInput.cpp SearchInput.h Spider.cpp Spider.h XmlDoc.cpp	2013-10-16 14:28:42 -07:00
Matt Wells	e565a861ae	give nice reply from seed in json. show how many outlinks from same domain were found and how many outlinks were filtered. same for addurls (bulk add).	2013-10-16 14:03:14 -07:00
mwells	a562c65627	another code checkpoint. new json api for crawlbot. new url filters for crawlbot.	2013-10-14 16:10:48 -06:00
mwells	c949bfe315	ignore certain errors and index the doc anyway so we at least have it in our dmoz index with its designated title and summary from dmoz.	2013-10-13 00:02:25 -07:00
mwells	c283e85e40	add support for noindex meta tag. use it in the gbdmoz.urls.txt.* files that contain the dmoz urls we want to spider.	2013-10-12 22:50:23 -07:00
mwells	d300dc42f7	added XmlDoc::getDmozTitles() and related functions.	2013-10-12 21:56:25 -07:00
Matt Wells	283ec2f6b4	email and webhook alerts when spider runs out of urls to spider.	2013-10-09 11:42:56 -07:00
Matt Wells	3702a05d64	add sendEmailThroughMandrill() to send through mail chimp http api.	2013-10-08 18:01:38 -07:00
Matt Wells	9eecfd378c	added support for pageprocesspattern again. \|\| separated strings to find in m_content before sending to diffbot.	2013-10-08 17:08:58 -07:00
mwells	59b491f007	return fake tag recs for links if usefakeips meta tag is given. saves some lookups in tagdb when adding gbdmoz.urls.txt.* files which have tons of links each. like 500,000.	2013-10-06 16:42:32 -07:00
mwells	000caa5a26	support for usefakeips meta tag	2013-10-06 00:10:07 -06:00
mwells	6c2c9f7774	trying to bring back dmoz integration.	2013-10-02 22:34:21 -06:00
mwells	3fecb3eb1f	got email and url notification code compiling. when crawl hits a limit we do notifications.	2013-10-01 15:14:39 -06:00
mwells	20952eedbe	customizable api list in url filters	2013-09-30 09:18:22 -06:00
mwells	9bf8bf7712	add spider reply even on g_errno now with an error code of EINTERNAL error in the spider reply. no longer just sit on the lock. this was blocking an entire ip when just lock sitting for 3 hrs. and only do read rate timeouts if there was at least one byte read. this was causing diffbot reply to read rate timeout after just 60 seconds even though its timeout was specified as 90 seconds.	2013-09-29 09:22:20 -06:00
mwells	fd081478de	fix crawlbot to work on a distributed network as far as adding/deleting/resetting colls and updating parms. ideally we'd have a Colldb Rdb where each key was a parm. that would make syncing easier if a host went down, then it would get the negative/positive colldb parm keys later. so it could sync up on all your operations as long as all your operations in terms of adding and deleting database key/value pairs.	2013-09-26 22:41:05 -06:00
Matt Wells	f90d20f4dd	diffbot api integration updates	2013-09-18 15:07:47 -07:00

1 2

60 Commits