open-source-search-engine

mirror of https://github.com/gigablast/open-source-search-engine.git synced 2024-10-04 20:27:43 +03:00

Author	SHA1	Message	Date
Matt Wells	cb111a1efa	fix doledb empty logic	2013-12-19 13:06:35 -08:00
Matt Wells	16e91375f4	bring in changes from live beta from ~/github. limit spiders to 50, not 500 to prevent oom. resume killed merges that had num files shrunk even if down to one file. show collnum in spider queue. remove back-to-back whitespace, and make all space a ' ' for getting the doc checksum for deduping.	2013-12-12 12:58:58 -08:00
Matt Wells	144e2c898e	save resources by not doing reads on an empty doledb priority. stop saving allSpidersOn and Off parms.	2013-12-08 14:07:31 -07:00
Matt Wells	03219a3057	add regex support back in	2013-12-03 16:23:05 -08:00
Matt Wells	c3517ee019	Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Conflicts: Spider.cpp	2013-11-22 17:37:42 -08:00
Matt Wells	9d9a976b4f	fix bug of perpetual round incrementing ad nauseam.	2013-11-22 11:14:03 -08:00
Matt Wells	dcae4682e8	new api. tossed action/expression and added urlCrawlPattern/urlProcessPattern/apiUrl	2013-11-20 16:41:28 -08:00
Matt Wells	a8ffc6e50b	indicate diffbot processing errors in the urls csv	2013-11-18 17:38:14 -08:00
Matt Wells	7d3b52fb3a	if intersect thread takes forever was causing msg5 reads to block forever and spider round was getting incremented. fixed a few bugs around that issue.	2013-11-18 16:20:30 -08:00
Matt Wells	8d9f000f11	make getNumSpidersOutPerIp() specific to a coll so another coll does not prevent a coll from popuating its own waiting tree.	2013-11-18 14:13:28 -08:00
Matt Wells	34bffc2cc6	1-second crawl info sleep wrapper update	2013-11-04 16:02:03 -08:00
Matt Wells	7b319e5948	show more info in the urls csv file. record whether we processed the url or not in the SpiderReply. normalize /index.html etc. to / for the outlinks. in Links.cpp class.	2013-11-04 10:49:31 -08:00
Matt Wells	adf4d258ae	better crawl status reporting. allow for _ in coll names.	2013-10-30 10:00:46 -07:00
Matt Wells	c39b45ff88	fix crawl round end detection etc. inc round counter even if not repeating crawl	2013-10-23 15:53:59 -07:00
Matt Wells	1fb85db307	url filters fixes.	2013-10-21 13:44:30 -07:00
Matt Wells	889583ec4b	now we can reset collection mid stream	2013-10-18 17:49:36 -07:00
Matt Wells	b589b17e63	fix collection resetting.	2013-10-18 15:21:00 -07:00
Matt Wells	fc17521697	Merge branch 'master' into diffbot Conflicts: Hostdb.cpp Makefile PageResults.cpp PageRoot.cpp Pages.cpp Rdb.cpp SearchInput.cpp SearchInput.h Spider.cpp Spider.h XmlDoc.cpp	2013-10-16 14:28:42 -07:00
mwells	be01041e36	added support for new url filter: "lastspidertime>={roundstart} --> IGNORE" so we can spider all urls before we advance to the next spider round and re-spider everything again. CollectionRec::m_spiderRoundStartTime and CollectionRec::m_spiderRoundNum are the new collection rec parms. show the round stuff on url filters page.	2013-10-10 18:47:46 -06:00
mwells	2bb8b818d6	more bug fixes with notification system.	2013-10-09 16:28:15 -06:00
mwells	c1c5c4e3d0	send notifications if no urls available for immediate spidering.	2013-10-09 15:24:35 -06:00
mwells	612f2872f7	use addurl to add the gbdmoz url files to gigablast. it should index just those dmoz urls, and not spider their links. it should ignore external errors like ETCPTIMEDOUT when indexing so it will be identical to dmoz.	2013-10-05 23:22:51 -06:00
Matt Wells	fe97e08281	move from groups to shards. got rid of annoying groupid bit mask thing.	2013-10-04 16:18:56 -07:00
mwells	e1bde7b7fe	fixed bug of getting lock from the wrong group.	2013-10-04 12:42:01 -06:00
mwells	d4aa65c0fe	try to fix spiders with m_msg5StartKey logic.	2013-10-04 09:39:05 -06:00
mwells	0edcbcc7d8	printlocktable() function	2013-09-29 10:20:14 -06:00
mwells	c216f7b2a7	use 48 bit url hash for lock keys again. query reindex recs can just use their prob docids as fake uh48s. we need it so we can avoid the fakedb record and just use the spider reply to trigger a 5-second lock expiration. a little simpler. added logdebugspiderwait for waiting tree debugging. fixed per ip spider limiting. fixed losing spiders down blackhole from updateCrawlInfo. check UrlLock::m_confirmed when counting outstanding spiders on one ip since may have a lock on one host but not get granted on all! it calls confirmLockAcquisition() when it gets fully granted the lock so it can set UrlLock::confirmed.	2013-09-29 00:09:46 -06:00
mwells	9730e5f3ef	fix lost spiders from updating crawl info. fix maxspidersperip limitation not being obeyed. removed fakedb. only add "0" time waiting tree keys to waiting tree. only scanSpiderdb() will change their times to a future time or add them to doledb directly. confirmLockAcquisition() will not add to waitingtree if max spiders per ip limit would be exceeded. an incoming spider reply will trigger the add to waiting tree with a time of "0".	2013-09-28 13:12:33 -06:00
mwells	0b5a45e8aa	more api updates. added m_avoidSpiderLinks to spider request so urldata=xxxx can turn link spidering off. probably desirable so its default. so &spiderlinks=[0\|1] applies to urldata as well as injecturl=	2013-09-25 17:51:43 -06:00
mwells	40192249f9	spider speedups and fixes.	2013-09-25 11:58:03 -06:00
mwells	b16d8519fc	more spider fixes. still need more speedups when spidering multiple spiders on same ip.	2013-09-24 16:40:14 -06:00
mwells	e594af898a	seems like we can spider multiple urls from same ip at same time now.	2013-09-24 09:32:26 -06:00
mwells	b90ef3de0d	more spider fixes. right after getting lock, use msg12 to remove rec from doledb/doleiptable and add 0 entry to waiting table so doledb is again immediately repopulated with that firstIp so we can spider multiple urls from the same ip at the same time.	2013-09-23 20:25:28 -06:00
mwells	83e87fc755	fixed ability to spider multiple urls from the same IP at the same time. Also respects sameIpWait constraints.	2013-09-20 15:42:48 -07:00
mwells	05400a0c25	updated spider code documentation.	2013-09-20 11:19:24 -07:00
Matt Wells	bcc55dc46b	fixed a couple bugs. Added more documentation into Spider.h.	2013-09-19 18:21:52 -07:00
Matt Wells	47465f6d90	more fixes. trying to fix spiders to spider multiple urls from same ip...	2013-09-19 11:13:40 -07:00
Matt Wells	29f5c5d644	added isonsamesubdomain and isonsamedomain	2013-09-18 16:45:37 -07:00
Matt Wells	d982997b0c	streamline crawl stats.	2013-09-13 17:34:39 -07:00
Matt Wells	a412c798bf	Merge branch 'master' into diffbot Conflicts: PageResults.cpp	2013-09-13 09:24:28 -07:00
Matt Wells	5dc7bd2ab4	integrate diffbot from svn back into git.	2013-09-13 09:23:18 -07:00
Matt Wells	03706131fe	documentation updates in Spider.h.	2013-09-08 13:42:02 -07:00
Matt Wells	9696c7936a	Merge branch 'master' into diffbot	2013-08-30 16:33:00 -07:00
Matt Wells	94e6492916	removed MAX_COLL_RECS so we can have unlimited collections, really limited by the sizeof(collnum_t) only now, which is 16bits, 15bits unsigned, which is the limitation. can always expand this so we can have more than 32k collections.	2013-08-30 16:20:38 -07:00
mwells	2e9c8f7c6e	Merge branch 'master' of github.com:gigablast/open-source-search-engine	2013-08-29 21:17:46 -06:00
mwells	84fae9a3c6	Fix issue of reading spiderrequests from doledb at the very first key in spiderdb. causes lots of positive/negative key annihilations. we end up re-reading like 300 times in some cases just to get a url from a doledb priority.	2013-08-29 21:16:59 -06:00
mwells	ca2a024d04	fixed up thread/spider log msgs. fixed core from calling fprintf in alarm signal missed quickpoll handler.	2013-08-29 21:15:42 -06:00
Matt Wells	f6e560c1f4	Initial file population.	2013-08-02 13:12:24 -07:00

48 Commits