Commit Graph

59 Commits

Author SHA1 Message Date
Matt Wells
647d004c04 fix core from sending a url alert, then customer deleting
collection before email alert reply comes back. then it
comes back to a delete collrec and cores.
2015-09-08 15:57:46 -07:00
Matt Wells
74ec812959 try to fix core from adding a file that already exists.
just return an error now. hopefully merge will try again later.
also core if you try to write recs to an rdbmap that
has already had its memory footprint reduced so we can find
that overrun bug.
2015-08-21 14:00:40 -07:00
Matt
a1ed368d82 bring back max mem control into master controls.
it's useful to limit per process mem usage to prevent
oom killer because we can't save if we get killed.
overhaul diskpagecache to just use rdbcache. much simpler
and faster, but disabled for now until debugged more.
reduce min files to merge for crawlbot collections so
they stay more tightly merged to conserve fds and mem.
improved logDebugDisk msgs.
overhauled File.cpp fd pool. now it is way faster and
doesn't use any extra mem. much simpler too. although
could be sped up a little by using a linked list, but
probably is not significant enough to warrant doing right now.
increase mem ptr table from 3M to 8M slots. should really make
dynamic though. fix core from null msg20s[0]->m_r.
only call attemptMergeAll once every 60 seconds really.
do not attempt merge if already merging.
2015-08-14 12:58:54 -06:00
Matt
0970975a57 tested auto proxy use and auto spider (non-proxy) backoff to
3 second crawldelay successfully on the stamps site.
2015-04-30 15:31:09 -07:00
Matt
4a43e1387e better fixes for core from sig alarms 2015-04-13 10:28:43 -06:00
Matt Wells
97d3b185c1 just use INCOMING udp slots/sockets for jam detection.
this will highlight the slow nodes better.
2015-04-08 15:52:43 -06:00
Matt
2ce107e4be keep track of how many times the host exited/cored as an exponent
to the 'x' in the hosts table. this way we can detect hosts that
have restarted many times and fix them.
2015-04-01 16:28:58 -06:00
Matt
76ec7f3a4a add # of tcp connections to hosts table 2015-02-03 14:14:17 -08:00
Matt
fe14079ffe show shards with excessive udp slots to
detect jam up.
2015-01-22 14:47:30 -07:00
Matt Wells
51cda3bac0 fix malformed http reply header 2015-01-15 10:40:23 -08:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt Wells
e5b81cfb04 fix ping age being negative in hosts table bug. 2015-01-05 15:19:46 -08:00
Matt Wells
d57f2264c4 more indicator fixes 2014-12-17 15:11:49 -08:00
Matt Wells
f52e163fb0 fix a couple bugs.
added out of sync indicator.
2014-12-17 14:28:32 -08:00
Matt
465d30e0ee fix ping bug. 2014-12-17 10:43:00 -08:00
Matt Wells
2fd511f002 updates 2014-12-16 17:09:25 -08:00
Matt Wells
d4179634a1 crc fixes 2014-12-16 16:38:54 -08:00
Matt
730b131bbf added new indicators so we can make gb more stable.
now hosts table reports # ooms, disk read corruptions,
closed sockets from overloads, and we # of outstanding
spiders. made ping request a class so we can easily add
new indicators.
2014-12-16 16:22:50 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
8cf5bdc8a2 force gb to recompile version every time
you do a make, so version is updated automatically.
2014-09-19 12:23:40 -07:00
Matt Wells
67ee615d1d log note to updat version if differences detected. 2014-09-19 09:35:35 -07:00
mwells
9d69c1362d added proper version computation to gb 2014-09-19 10:25:48 -06:00
mwells
caee238c46 fixes to make easier to compile on max os x. 2014-08-28 12:55:02 -07:00
mwells
628fe2336f make code compile cleaner. 2014-06-07 14:11:12 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
cf6695f625 speed up getNumTotalRecs() by caching
it basically for 2 seconds since pingserver.cpp
calls it all the time.
2014-02-25 12:14:51 -08:00
Matt Wells
ca4aafa8a6 added host disk usage redbox and stats. 2014-02-12 09:47:44 -07:00
Matt Wells
f420bd2769 checkpoint 2014-02-09 15:09:48 -07:00
Matt Wells
4346fcee29 added recovery mode display in hosts table 2014-02-01 10:16:46 -08:00
Matt Wells
2faba0efd1 fix repeat rounds sticking bug
by adding PF_REBUILDURLFILTERS flag to
spiderroundastarttime parm
2014-01-17 17:17:10 -08:00
Matt Wells
16f8af0d57 added awesome streaming mode support
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
9da106e7ca added ermergency msg box on all admin pages 2014-01-11 20:35:13 -08:00
Matt Wells
eed606601e added emergency msg box on all admin pages 2014-01-11 20:14:44 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
f64b53bfb3 almost done with rebalancing code 2014-01-10 14:12:58 -08:00
Matt Wells
a76f4c6974 just POST a full request for webhook now
so we can do application/json content type
2013-11-07 14:20:15 -08:00
Matt Wells
3e4db4f1bc show all crawl details in url webhook
notification in the post body.
2013-11-07 13:59:43 -08:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
20052e34fe made webhook return the crawl name
and status as X- fields in the mime.
2013-10-28 22:03:10 -07:00
Matt Wells
a5a7ab2434 added spider status msg to json output
to indicate if spider has hit a limit.
no longer disable spiders in xmldoc.cpp
when a crawl/process limit is hit. just
check for limit when spidering urls in
spider.cpp and if it is hit set
CollectionRec::m_spiderStatus[Msg] and
send email from there.
Added maxCrawlRounds parm.
2013-10-23 11:40:30 -07:00
Matt Wells
0e4d96b3f8 added "seeds" to json reply. store seed urls
(and deup them) in collrec. fixed some respidering
issues. any time we re-enter url filters
then rebuild the waiting tree.
2013-10-21 17:35:14 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
a288217e9f a few bug fixes 2013-10-17 18:59:00 -07:00
mwells
ea859ef685 added 'gb emailmandrill' for testing.
got it working. it posts json, not url encoded.
2013-10-09 17:35:51 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
Matt Wells
283ec2f6b4 email and webhook alerts when spider runs out of urls
to spider.
2013-10-09 11:42:56 -07:00