Commit Graph

46 Commits

Author SHA1 Message Date
Matt Wells
edbd61b0c5 thread fixes. if pthread_create fails then
keep thread queue and just return. will try to
relaunch later. do not count delete keys towards
shard rebalance count.
2014-03-15 20:07:02 -07:00
Matt Wells
68a14de031 security admin fixes 2014-02-12 00:36:09 -07:00
Matt Wells
c9be18615c more parm saving fixes 2014-02-10 22:04:22 -07:00
Matt Wells
ecdd167d9b code checkpoint 2014-02-09 16:41:43 -07:00
Matt Wells
dabd691626 basic admin controls page structure 2014-02-08 00:34:45 -07:00
Matt Wells
239811b024 take out confusing function no longer used 2014-01-28 11:10:59 -08:00
Matt Wells
8a9b1f7a19 added diffbot retry rules.
added maxTotalSpiders parm for
all colls to follow.
tried to fix msg 0x00 socket jam up.
2014-01-22 19:57:38 -08:00
Matt Wells
443bb26f01 disk page cache back on 2014-01-21 19:03:47 -08:00
Matt Wells
33c5d9c07f a lot of times rdb tree has invalid collection
numbers in it so fix our counting algo in case
the collection rec no longer exists!
2014-01-21 19:01:44 -08:00
Matt Wells
e6eb9003b5 more formatting 2014-01-19 01:09:38 -08:00
Matt Wells
1d6ba52dcd list collections in sidebar. 2014-01-09 21:13:41 -08:00
Matt Wells
6660dca57c default parm updates 2014-01-09 20:07:19 -08:00
Matt Wells
c596b6c5a6 default gb.conf update 2014-01-09 19:59:02 -08:00
Matt Wells
d76e7a9c8e highlight non-default value parms. 2014-01-09 19:37:17 -08:00
Matt Wells
2ac8ff2952 compile regex so it's case dependent 2013-12-23 09:30:35 -08:00
Matt Wells
6f2e552bcd fix core in linked list of msg13requests in
case one gets freed
2013-12-20 11:26:46 -08:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
3cc300bf03 spider log debug msg fix.
boost max cpu threads to 10, seems
to have many cores usually.
2013-11-22 14:17:10 -08:00
Matt Wells
43e40208b8 Merge branch 'master' into diffbot
Conflicts:
	SafeBuf.cpp
	SafeBuf.h
	SearchInput.cpp
	XmlDoc.cpp
2013-11-20 15:51:58 -08:00
Matt Wells
245264c2c9 fix respider frequency bug. 2013-10-21 15:06:23 -07:00
mwells
11897f09da turn off log debug msg. 2013-10-16 16:24:08 -06:00
mwells
6052f60c48 speed up dirty word detection since we added a bunch
of new dirty words/phrases.
2013-10-15 22:41:31 -07:00
mwells
2bb8b818d6 more bug fixes with notification system. 2013-10-09 16:28:15 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
mwells
259ec08e09 email hook now works but you have to
supply the IP address of your sendmail
server and it has to allow email
forwarding from host #0's IP. specify
the sendmail server's IP in the Master
Controls.
2013-10-02 09:36:44 -06:00
mwells
20952eedbe customizable api list in url filters 2013-09-30 09:18:22 -06:00
mwells
0edcbcc7d8 printlocktable() function 2013-09-29 10:20:14 -06:00
mwells
9bf8bf7712 add spider reply even on g_errno now with an error
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
mwells
c216f7b2a7 use 48 bit url hash for lock keys again.
query reindex recs can just use their
prob docids as fake uh48s. we need it so we
can avoid the fakedb record and just use
the spider reply to trigger a 5-second
lock expiration. a little simpler. added
logdebugspiderwait for waiting tree debugging.
fixed per ip spider limiting. fixed losing
spiders down blackhole from updateCrawlInfo.
check UrlLock::m_confirmed when counting outstanding
spiders on one ip since may have a lock on one host
but not get granted on all! it calls
confirmLockAcquisition() when it gets fully granted
the lock so it can set UrlLock::confirmed.
2013-09-29 00:09:46 -06:00
mwells
5884951190 only do certain things if running
on a machine in matt wells datacenter.
like fan switching based on temps,
or printing seo links. made seo functions
weak overridable placeholder stubs so if
seo.o is linked in it will override.
include seo.o object if seo.cpp file exists
for automatic seo module building and linking.
2013-09-28 13:43:56 -06:00
mwells
e3c4ce189a fixed cores. fixed json. 2013-09-26 14:28:04 -06:00
mwells
0039b23064 almost done with json api. 2013-09-25 15:37:20 -06:00
mwells
1d92004e06 fix spider flow debug msgs 2013-09-25 12:07:11 -06:00
mwells
8461e33b53 fixed more spider bugs. 2013-09-23 21:26:27 -07:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
7c31ecff4a fixed fakedb key support. 2013-09-23 15:16:23 -06:00
mwells
4d33737ac1 fakedb fixes 2013-09-23 08:19:54 -07:00
Matt Wells
6af02119a1 use cookies to display url filters table. 2013-09-18 13:50:55 -07:00
Matt Wells
487d3f0a0e fix url filters bugs. 2013-09-18 11:02:09 -07:00
Matt Wells
78a334198b Merge branch 'master' into diffbot 2013-09-16 09:05:37 -07:00
Matt Wells
e6f87f5049 do not send email alerts to sysadmin@gigablast. 2013-09-16 08:10:18 -07:00
Matt Wells
6b330da240 cleanup warnings in log. 2013-09-13 14:37:35 -07:00
Matt Wells
19056fc3f2 show "processed" instead of "matched".
other fixes for spider stats. add
new crawl stats. attempts and successes.
2013-09-13 11:51:55 -07:00
mwells
37a6549a58 updates to developer.html developer
documentation. removed a lot of obsolete
information. still needs more work.
2013-08-21 13:09:55 -06:00
mwells
24af21394d dns ip fix in gb.conf. 2013-08-19 15:25:37 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00