Commit Graph

48 Commits

Author SHA1 Message Date
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
10f897e5be use gbsystem() not system() so it can turn off alarms
since it forks.
2014-09-11 05:01:55 -07:00
mwells
d9ae010371 shard gbfacetstr:gbxpathsitehash123456 terms by termid for speed.
got them working again multicasting a msg 0x39 to the appropriate shard.
set special msg39request flag for better performance for those guys.
2014-07-07 12:32:27 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
e9ff8c48d8 try to remove the sluggishness from
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
mwells
a09d4cd723 Merge branch 'master' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Pages.cpp
	XmlDoc.cpp
	gb.conf
2014-06-20 09:35:39 -07:00
mwells
494c43d5dd fix gb execution in main.cpp::getcwd2() function. 2014-06-19 06:03:11 -07:00
mwells
584af942d4 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Make.depend
	Parms.cpp
2014-06-16 20:42:28 -07:00
Matt Wells
549f8eb5bc fix bug in hosts.conf when expanding working dir. 2014-06-16 11:32:10 -07:00
mwells
4a2717a88f Merge branch 'diffbot-testing' into diffbot-matt 2014-06-09 12:42:54 -07:00
mwells
d57ce8a2df simplify compilation more. remove clones() 2014-06-07 14:26:11 -07:00
mwells
a1f1daad16 Merge branch 'master' into diffbot-matt
Conflicts:
	Spider.cpp
2014-06-03 11:41:46 -07:00
mwells
a811462d5f spider proxy stuff compiles now 2014-05-30 15:05:00 -07:00
Matt Wells
b0f9227bbc path fixes for gb startup 2014-05-25 10:28:13 -04:00
Matt Wells
037067170c fix for symlinks in host paths in hosts.conf 2014-05-12 20:50:11 -07:00
Matt Wells
5f7bbe7523 fix diffbot smoke tests. do not index spider replies
for custom crawls.
2014-05-12 15:14:11 -07:00
mwells
a9dc18c866 fix more bugs. 2014-05-11 19:44:41 -07:00
mwells
c3a1c674c3 now we run gb without a hostid.
we use its path and the local ip to identify its
hostid # in the hosts.conf.
2014-05-11 19:36:24 -07:00
mwells
463dc2159f more make install updates 2014-05-11 17:02:15 -07:00
mwells
2b37f56e4c Merge branch 'diffbot-matt' into testing 2014-05-10 07:56:45 -07:00
mwells
f19014cc6c fixed missing / 2014-05-10 06:39:36 -07:00
Matt Wells
9edd5c8264 thumbnail generation support back in. 2014-04-24 10:13:45 -07:00
mwells
72dc660598 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	HttpRequest.h
	PageBasic.cpp
	coll.main.0/coll.conf
2014-04-09 11:18:39 -07:00
mwells
1b5c6a6278 create hosts.conf into cwd if not there.
pretty up logging system.
update admin.html
2014-04-06 21:12:52 -07:00
mwells
23e5a94ddf move log file in the binary itself now. 2014-04-06 14:02:51 -07:00
Matt Wells
a6b7e088f5 take out tfndb, unused. fix core
from diffbot url too long.
2014-02-26 01:07:13 -08:00
Matt Wells
e8a6d8f345 fix another core from freening wrong byte sized
crawl info reply.
2014-01-30 20:16:41 -08:00
Matt Wells
3a6a271dd9 make crawl sync bug fixes.
fix Puz crawl from dying out on host 9
because spider reply did not resuscitate waiting
tree for its ip.
fix mike's zola crawl with a repeat of 3 days
from not incmreneting the round because it had
maxrounds 0, which means to ignore... assume 0
means to ignore now. send out 0xc1 crawl info
requests to even dead hosts so we can at least use
their last known good info.
2014-01-25 13:47:03 -08:00
Matt Wells
e3f769dffe fixes for sudden revitilization of dead crawls. 2014-01-25 11:03:15 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
d091c7e959 fix hostsinagreement bug 2014-01-14 11:24:32 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
f64b53bfb3 almost done with rebalancing code 2014-01-10 14:12:58 -08:00
Matt Wells
141a76c322 try localhosts.conf before hosts.conf 2013-12-26 09:32:22 -08:00
Matt Wells
f7e7acb398 minor log msg updates.
updated admin.html to give some performance and
storage capacity info.
2013-12-09 23:16:24 -07:00
Matt Wells
fb7096dc5d num-mirrors: updates 2013-10-24 14:59:35 -07:00
Matt Wells
f65a2fd625 support num-mirrors: instead of index-splits:
directive.
2013-10-24 14:32:56 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
ddbacab12f fix shard mapping of spiderdb. 2013-10-08 16:35:37 -07:00
Matt Wells
a76e8e42c3 fix json parsing oopsy. 2013-10-08 16:28:25 -07:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
6c2c9f7774 trying to bring back dmoz integration. 2013-10-02 22:34:21 -06:00
mwells
9730e5f3ef fix lost spiders from updating crawl info.
fix maxspidersperip limitation not being obeyed.
removed fakedb.
only add "0" time waiting tree keys to waiting tree.
only scanSpiderdb() will change their times to
a future time or add them to doledb directly.
confirmLockAcquisition() will not add to waitingtree
if max spiders per ip limit would be exceeded.
an incoming spider reply will trigger the add to
waiting tree with a time of "0".
2013-09-28 13:12:33 -06:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
4d33737ac1 fakedb fixes 2013-09-23 08:19:54 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
mwells
be7aab78b7 Fixed bugs with running a proxy.
Added more comments into hosts.conf.
2013-08-08 14:41:38 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00