Commit Graph

129 Commits

Author SHA1 Message Date
mwells
257a7e3c10 first stab at swapping out collection recs
to save memory when # of collections is high
2014-09-29 11:37:05 -07:00
mwells
10f897e5be use gbsystem() not system() so it can turn off alarms
since it forks.
2014-09-11 05:01:55 -07:00
mwells
38cef7d52e fix # docs and recs bug. 2014-08-28 07:45:43 -07:00
mwells
c3699f0da5 fix bugs found from qa tests. 2014-08-25 14:34:30 -07:00
mwells
e45c0d32f6 Merge branch 'diffbot-testing' into testing 2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c various fixes.
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
Matt Wells
d0bc187a77 more core fixes. more stability. 2014-07-16 12:52:51 -07:00
Matt Wells
6b797f5023 more core stability fixes. prevent core dumps 2014-07-16 12:07:39 -07:00
Matt Wells
8ac691f324 fix merging getting clogged by so many
collections tring to merge tagdb at once
2014-06-05 21:27:33 -07:00
Matt Wells
4298e4e752 sanity checks for debugging duplicate
titledb file bug.
2014-06-04 12:15:12 -07:00
mwells
45b8bb3421 log msg cleanups 2014-05-11 21:55:44 -07:00
mwells
6e922722da tree repair logic. 2014-05-10 12:32:01 -07:00
mwells
7e1429cc30 more bug fixes 2014-05-10 08:22:26 -07:00
mwells
8e381504a1 fix makeTrashDir() 2014-05-10 08:02:46 -07:00
mwells
2b37f56e4c Merge branch 'diffbot-matt' into testing 2014-05-10 07:56:45 -07:00
mwells
ed816b2c11 a few bug fixes 2014-05-10 07:48:23 -07:00
mwells
81369b786c make trash dir for image thumbs automatically 2014-04-29 17:01:48 -06:00
Matt Wells
d4302e3301 fix core 2014-03-18 11:12:50 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
624c1d4e68 nuke doledb fixes 2014-03-08 10:51:15 -07:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
a6b7e088f5 take out tfndb, unused. fix core
from diffbot url too long.
2014-02-26 01:07:13 -08:00
Matt Wells
32526a9b25 more checksum fixes for json. fixes for
repair/rebuild procedure.
2014-02-16 10:46:41 -08:00
Matt Wells
106077c163 fix spiderrequest deduping some more 2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937 more faster spider fixes. tried to fix
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
980d63632a more msg5 re-read fixes.
stop re-reading if increasing minrecsizes did nothing.
fix tight merges so they work over all colls.
fix merge counting to be fast and not loop over
all rdbbases which could be thousands.
add num mirrors to rebalance.txt.
fix updateCrawlInfo to wait for all replies. critical error!
2014-01-16 13:38:22 -08:00
Matt Wells
f8c2329bd2 rebalancer fixes 2014-01-15 15:42:59 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
c0447de3a1 watch out for NULL "base" after a coll delete. 2013-12-29 01:32:40 -08:00
Matt Wells
d8a9a3f4e3 fix parm sync code some more.
added localhosts.conf  to the 'gb install' dist.
2013-12-27 14:00:37 -08:00
Matt Wells
048b715962 if coll is deleted or reset in a middle of a dump
or merge then stop the dump/merge with ENOCOLLREC
error. avoid calling "base->" functions since it
could be NULL if deleted.
2013-12-25 17:12:09 -08:00
Matt Wells
3f19ece776 parmdb updates 2013-12-16 17:07:15 -08:00
Matt Wells
617a0ff76e parmdb fixes 2013-12-16 16:04:43 -08:00
Matt Wells
6c652c1cc6 more parmdb fixes 2013-12-16 15:39:24 -08:00
mwells
76bb3d05e1 clean up logging so i can see what's going on 2013-12-10 16:41:30 -08:00
mwells
82494baa89 move CollectionRec stuff into Collectiondb files
for simplicity.
2013-12-10 15:28:04 -08:00
mwells
f2d5661965 parmdb overhaul. support collection add/del
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
Matt Wells
06edfddf31 a bunch of bug fixes, mostly spider related.
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
fe1a7d1a75 rdbbase not fully resetting? it was
trying to dump to coll directories that
had been moved to trash folder.
and printing out "deleted from under us".
at least it was corrupting data in RdbMem
this time because i added m_dumpErrno logic.
2013-11-15 09:01:58 -08:00
Matt Wells
eb719849a6 do not core on this dump error 2013-11-13 19:04:22 -08:00
Matt Wells
a31b13ad61 fix a few bugs. 2013-11-13 13:27:22 -08:00
Matt Wells
3afac4812d fix bug of trying to del/reset coll while
disable writing was engaged. we already
had it check to see if tree was saving,
but not if writes were disabled. so it
gets ETRYAGAIN and retries later.
2013-11-10 09:40:32 -08:00
Matt Wells
396a88799a fix bad bug of basically emptying out all our data
on auto-save!
2013-11-06 19:49:20 -08:00
Matt Wells
0655160c26 fixed quite a few nasty bugs.
collectionrec neg/pos key counting overruns.
2013-11-06 15:44:50 -08:00
Matt Wells
b83dd59913 fix bug when we nuke a collnum
from a tree right in the middle of when
saving rdb trees in process.cpp.
2013-10-30 12:27:08 -07:00
Matt Wells
2d413578f2 track down some nasty cores. fix
for waiting tree out of sync.
2013-10-29 16:37:14 -07:00
Matt Wells
240da39873 Merge branch 'master' into diffbot 2013-10-25 12:32:02 -07:00
Matt Wells
605289e130 fix a couple collection related bugs
causing cores in crawlbot.
2013-10-21 11:38:33 -07:00
Matt Wells
54915dc384 fix data corruption in RdbMem buffer
when running with threads disabled.
2013-10-19 19:37:29 -07:00
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
57ee9739e5 fix addColl() logic for collectionless rdbs 2013-10-16 14:38:09 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
mwells
3374ce450a fix a couple catdb generation bugs.
MAX_CATIDS violation causing corruption.
not saving catdb tree to catdb-saved.dat
causing missing catdb recs.
2013-10-12 20:33:04 -07:00
mwells
71d5d05f7c use catdb/ subdir not cat/ for consistency. 2013-10-04 21:35:13 -06:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
10dad2e6bd fixed bug of not removing spider lock
in addSpiderReply() because isAssignedToUs()
was there.
2013-10-03 10:45:19 -06:00
mwells
6c2c9f7774 trying to bring back dmoz integration. 2013-10-02 22:34:21 -06:00
mwells
45941e4b2f fix notification system. 2013-10-01 17:30:06 -06:00
mwells
9730e5f3ef fix lost spiders from updating crawl info.
fix maxspidersperip limitation not being obeyed.
removed fakedb.
only add "0" time waiting tree keys to waiting tree.
only scanSpiderdb() will change their times to
a future time or add them to doledb directly.
confirmLockAcquisition() will not add to waitingtree
if max spiders per ip limit would be exceeded.
an incoming spider reply will trigger the add to
waiting tree with a time of "0".
2013-09-28 13:12:33 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
8461e33b53 fixed more spider bugs. 2013-09-23 21:26:27 -07:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
7c31ecff4a fixed fakedb key support. 2013-09-23 15:16:23 -06:00
mwells
4d33737ac1 fakedb fixes 2013-09-23 08:19:54 -07:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
Matt Wells
c81f700bf0 get reset collection kinda working. 2013-09-17 14:13:44 -07:00
Matt Wells
4c11265a98 more updates to crawlbot api 2013-09-16 13:59:11 -07:00
Matt Wells
a412c798bf Merge branch 'master' into diffbot
Conflicts:
	PageResults.cpp
2013-09-13 09:24:28 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
mwells
dcf45dd69d dump out doledb to disk when it has more than
50,000 negative keys to avoid positive/negative
key annihilations delays.
2013-09-08 15:09:54 -06:00
Matt Wells
c58df10155 fix major bug causing spiders not to work. 2013-09-04 11:01:24 -07:00
Matt Wells
9696c7936a Merge branch 'master' into diffbot 2013-08-30 16:33:00 -07:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
mwells
900bbf8fba try to fix the bug of the spiders kinda getting
stuck and now spidering to their max potential
because of doledb record annihilations at the
top of the spider priority queue in spiderdb of
SpiderRequests. was causing lots of re-reads
in Msg5.cpp of doledb, like over 300 rounds,
very slow.
2013-08-29 21:59:02 -06:00
mwells
84fae9a3c6 Fix issue of reading spiderrequests from
doledb at the very first key in spiderdb.
causes lots of positive/negative key annihilations.
we end up re-reading like 300 times in some
cases just to get a url from a doledb priority.
2013-08-29 21:16:59 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00