open-source-search-engine

mirror of https://github.com/gigablast/open-source-search-engine.git synced 2024-10-04 12:17:35 +03:00

Author	SHA1	Message	Date
mwells	257a7e3c10	first stab at swapping out collection recs to save memory when # of collections is high	2014-09-29 11:37:05 -07:00
mwells	10f897e5be	use gbsystem() not system() so it can turn off alarms since it forks.	2014-09-11 05:01:55 -07:00
mwells	38cef7d52e	fix # docs and recs bug.	2014-08-28 07:45:43 -07:00
mwells	c3699f0da5	fix bugs found from qa tests.	2014-08-25 14:34:30 -07:00
mwells	e45c0d32f6	Merge branch 'diffbot-testing' into testing	2014-08-15 17:05:22 -07:00
Matt Wells	2af299da2c	various fixes. prioritize process only urls over crawl urls to get data faster. do not merge on high negative rec concentration. we need to fix that more. allow simplified redirs again for custom crawls to avoid too many dups. raise crawlinfo delay from 1 sec to 5 secs to reduce network usage for now. add back in injection enabled parm, but hidden.	2014-08-15 10:27:50 -07:00
Matt Wells	d0bc187a77	more core fixes. more stability.	2014-07-16 12:52:51 -07:00
Matt Wells	6b797f5023	more core stability fixes. prevent core dumps	2014-07-16 12:07:39 -07:00
Matt Wells	8ac691f324	fix merging getting clogged by so many collections tring to merge tagdb at once	2014-06-05 21:27:33 -07:00
Matt Wells	4298e4e752	sanity checks for debugging duplicate titledb file bug.	2014-06-04 12:15:12 -07:00
mwells	45b8bb3421	log msg cleanups	2014-05-11 21:55:44 -07:00
mwells	6e922722da	tree repair logic.	2014-05-10 12:32:01 -07:00
mwells	7e1429cc30	more bug fixes	2014-05-10 08:22:26 -07:00
mwells	8e381504a1	fix makeTrashDir()	2014-05-10 08:02:46 -07:00
mwells	2b37f56e4c	Merge branch 'diffbot-matt' into testing	2014-05-10 07:56:45 -07:00
mwells	ed816b2c11	a few bug fixes	2014-05-10 07:48:23 -07:00
mwells	81369b786c	make trash dir for image thumbs automatically	2014-04-29 17:01:48 -06:00
Matt Wells	d4302e3301	fix core	2014-03-18 11:12:50 -07:00
Matt Wells	bd4484db3c	Merge branch 'testing' into diffbot-testing	2014-03-10 12:08:23 -07:00
Matt Wells	624c1d4e68	nuke doledb fixes	2014-03-08 10:51:15 -07:00
Matt Wells	27e8e810d2	use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum.	2014-03-06 15:48:11 -08:00
Matt Wells	a6b7e088f5	take out tfndb, unused. fix core from diffbot url too long.	2014-02-26 01:07:13 -08:00
Matt Wells	32526a9b25	more checksum fixes for json. fixes for repair/rebuild procedure.	2014-02-16 10:46:41 -08:00
Matt Wells	106077c163	fix spiderrequest deduping some more	2014-02-06 09:47:18 -08:00
Matt Wells	4029b0b937	more faster spider fixes. tried to fix corrupt rdbcache.	2014-02-06 09:25:27 -08:00
Matt Wells	ecc10c2cb9	dup cache fixes. do not add dups to spiderdb either.	2014-02-05 14:09:35 -08:00
Matt Wells	4606e88721	code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code.	2014-01-18 21:19:26 -08:00
Matt Wells	980d63632a	more msg5 re-read fixes. stop re-reading if increasing minrecsizes did nothing. fix tight merges so they work over all colls. fix merge counting to be fast and not loop over all rdbbases which could be thousands. add num mirrors to rebalance.txt. fix updateCrawlInfo to wait for all replies. critical error!	2014-01-16 13:38:22 -08:00
Matt Wells	f8c2329bd2	rebalancer fixes	2014-01-15 15:42:59 -08:00
Matt Wells	8a49e87a61	got code with shard rebalancing compiling. now we store a "sharded by termid" bit in posdb key for checksums, etc keys that are not sharded by docid. save having to do disk seeks on every host in the cluster to do a dup check, etc.	2014-01-11 16:08:42 -08:00
Matt Wells	c0447de3a1	watch out for NULL "base" after a coll delete.	2013-12-29 01:32:40 -08:00
Matt Wells	d8a9a3f4e3	fix parm sync code some more. added localhosts.conf to the 'gb install' dist.	2013-12-27 14:00:37 -08:00
Matt Wells	048b715962	if coll is deleted or reset in a middle of a dump or merge then stop the dump/merge with ENOCOLLREC error. avoid calling "base->" functions since it could be NULL if deleted.	2013-12-25 17:12:09 -08:00
Matt Wells	3f19ece776	parmdb updates	2013-12-16 17:07:15 -08:00
Matt Wells	617a0ff76e	parmdb fixes	2013-12-16 16:04:43 -08:00
Matt Wells	6c652c1cc6	more parmdb fixes	2013-12-16 15:39:24 -08:00
mwells	76bb3d05e1	clean up logging so i can see what's going on	2013-12-10 16:41:30 -08:00
mwells	82494baa89	move CollectionRec stuff into Collectiondb files for simplicity.	2013-12-10 15:28:04 -08:00
mwells	f2d5661965	parmdb overhaul. support collection add/del sync when host comes back online. use udp not tcp. host #0 can now handle a new incoming request while a parm change is currently outstanding. all missed "command" parms will be received when a dead host comes back online, too, like a tight merge for instance. does not use msg4, uses msg3e and msg3f for syncing and sending parms.	2013-12-10 13:09:55 -08:00
Matt Wells	06edfddf31	a bunch of bug fixes, mostly spider related. also some for pagereindex.	2013-12-07 21:56:37 -07:00
Matt Wells	fe1a7d1a75	rdbbase not fully resetting? it was trying to dump to coll directories that had been moved to trash folder. and printing out "deleted from under us". at least it was corrupting data in RdbMem this time because i added m_dumpErrno logic.	2013-11-15 09:01:58 -08:00
Matt Wells	eb719849a6	do not core on this dump error	2013-11-13 19:04:22 -08:00
Matt Wells	a31b13ad61	fix a few bugs.	2013-11-13 13:27:22 -08:00
Matt Wells	3afac4812d	fix bug of trying to del/reset coll while disable writing was engaged. we already had it check to see if tree was saving, but not if writes were disabled. so it gets ETRYAGAIN and retries later.	2013-11-10 09:40:32 -08:00
Matt Wells	396a88799a	fix bad bug of basically emptying out all our data on auto-save!	2013-11-06 19:49:20 -08:00
Matt Wells	0655160c26	fixed quite a few nasty bugs. collectionrec neg/pos key counting overruns.	2013-11-06 15:44:50 -08:00
Matt Wells	b83dd59913	fix bug when we nuke a collnum from a tree right in the middle of when saving rdb trees in process.cpp.	2013-10-30 12:27:08 -07:00
Matt Wells	2d413578f2	track down some nasty cores. fix for waiting tree out of sync.	2013-10-29 16:37:14 -07:00
Matt Wells	240da39873	Merge branch 'master' into diffbot	2013-10-25 12:32:02 -07:00
Matt Wells	605289e130	fix a couple collection related bugs causing cores in crawlbot.	2013-10-21 11:38:33 -07:00
Matt Wells	54915dc384	fix data corruption in RdbMem buffer when running with threads disabled.	2013-10-19 19:37:29 -07:00
Matt Wells	889583ec4b	now we can reset collection mid stream	2013-10-18 17:49:36 -07:00
Matt Wells	b589b17e63	fix collection resetting.	2013-10-18 15:21:00 -07:00
Matt Wells	57ee9739e5	fix addColl() logic for collectionless rdbs	2013-10-16 14:38:09 -07:00
Matt Wells	fc17521697	Merge branch 'master' into diffbot Conflicts: Hostdb.cpp Makefile PageResults.cpp PageRoot.cpp Pages.cpp Rdb.cpp SearchInput.cpp SearchInput.h Spider.cpp Spider.h XmlDoc.cpp	2013-10-16 14:28:42 -07:00
mwells	3374ce450a	fix a couple catdb generation bugs. MAX_CATIDS violation causing corruption. not saving catdb tree to catdb-saved.dat causing missing catdb recs.	2013-10-12 20:33:04 -07:00
mwells	71d5d05f7c	use catdb/ subdir not cat/ for consistency.	2013-10-04 21:35:13 -06:00
Matt Wells	fe97e08281	move from groups to shards. got rid of annoying groupid bit mask thing.	2013-10-04 16:18:56 -07:00
mwells	10dad2e6bd	fixed bug of not removing spider lock in addSpiderReply() because isAssignedToUs() was there.	2013-10-03 10:45:19 -06:00
mwells	6c2c9f7774	trying to bring back dmoz integration.	2013-10-02 22:34:21 -06:00
mwells	45941e4b2f	fix notification system.	2013-10-01 17:30:06 -06:00
mwells	9730e5f3ef	fix lost spiders from updating crawl info. fix maxspidersperip limitation not being obeyed. removed fakedb. only add "0" time waiting tree keys to waiting tree. only scanSpiderdb() will change their times to a future time or add them to doledb directly. confirmLockAcquisition() will not add to waitingtree if max spiders per ip limit would be exceeded. an incoming spider reply will trigger the add to waiting tree with a time of "0".	2013-09-28 13:12:33 -06:00
mwells	e594af898a	seems like we can spider multiple urls from same ip at same time now.	2013-09-24 09:32:26 -06:00
mwells	8461e33b53	fixed more spider bugs.	2013-09-23 21:26:27 -07:00
mwells	b90ef3de0d	more spider fixes. right after getting lock, use msg12 to remove rec from doledb/doleiptable and add 0 entry to waiting table so doledb is again immediately repopulated with that firstIp so we can spider multiple urls from the same ip at the same time.	2013-09-23 20:25:28 -06:00
mwells	7c31ecff4a	fixed fakedb key support.	2013-09-23 15:16:23 -06:00
mwells	4d33737ac1	fakedb fixes	2013-09-23 08:19:54 -07:00
mwells	83e87fc755	fixed ability to spider multiple urls from the same IP at the same time. Also respects sameIpWait constraints.	2013-09-20 15:42:48 -07:00
Matt Wells	c81f700bf0	get reset collection kinda working.	2013-09-17 14:13:44 -07:00
Matt Wells	4c11265a98	more updates to crawlbot api	2013-09-16 13:59:11 -07:00
Matt Wells	a412c798bf	Merge branch 'master' into diffbot Conflicts: PageResults.cpp	2013-09-13 09:24:28 -07:00
Matt Wells	5dc7bd2ab4	integrate diffbot from svn back into git.	2013-09-13 09:23:18 -07:00
mwells	dcf45dd69d	dump out doledb to disk when it has more than 50,000 negative keys to avoid positive/negative key annihilations delays.	2013-09-08 15:09:54 -06:00
Matt Wells	c58df10155	fix major bug causing spiders not to work.	2013-09-04 11:01:24 -07:00
Matt Wells	9696c7936a	Merge branch 'master' into diffbot	2013-08-30 16:33:00 -07:00
Matt Wells	94e6492916	removed MAX_COLL_RECS so we can have unlimited collections, really limited by the sizeof(collnum_t) only now, which is 16bits, 15bits unsigned, which is the limitation. can always expand this so we can have more than 32k collections.	2013-08-30 16:20:38 -07:00
mwells	900bbf8fba	try to fix the bug of the spiders kinda getting stuck and now spidering to their max potential because of doledb record annihilations at the top of the spider priority queue in spiderdb of SpiderRequests. was causing lots of re-reads in Msg5.cpp of doledb, like over 300 rounds, very slow.	2013-08-29 21:59:02 -06:00
mwells	84fae9a3c6	Fix issue of reading spiderrequests from doledb at the very first key in spiderdb. causes lots of positive/negative key annihilations. we end up re-reading like 300 times in some cases just to get a url from a doledb priority.	2013-08-29 21:16:59 -06:00
Matt Wells	f6e560c1f4	Initial file population.	2013-08-02 13:12:24 -07:00

1 2 3

129 Commits