Commit Graph

31 Commits

Author SHA1 Message Date
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
cf6695f625 speed up getNumTotalRecs() by caching
it basically for 2 seconds since pingserver.cpp
calls it all the time.
2014-02-25 12:14:51 -08:00
Matt Wells
ca4aafa8a6 added host disk usage redbox and stats. 2014-02-12 09:47:44 -07:00
Matt Wells
f420bd2769 checkpoint 2014-02-09 15:09:48 -07:00
Matt Wells
4346fcee29 added recovery mode display in hosts table 2014-02-01 10:16:46 -08:00
Matt Wells
2faba0efd1 fix repeat rounds sticking bug
by adding PF_REBUILDURLFILTERS flag to
spiderroundastarttime parm
2014-01-17 17:17:10 -08:00
Matt Wells
16f8af0d57 added awesome streaming mode support
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
9da106e7ca added ermergency msg box on all admin pages 2014-01-11 20:35:13 -08:00
Matt Wells
eed606601e added emergency msg box on all admin pages 2014-01-11 20:14:44 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
f64b53bfb3 almost done with rebalancing code 2014-01-10 14:12:58 -08:00
Matt Wells
a76f4c6974 just POST a full request for webhook now
so we can do application/json content type
2013-11-07 14:20:15 -08:00
Matt Wells
3e4db4f1bc show all crawl details in url webhook
notification in the post body.
2013-11-07 13:59:43 -08:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
20052e34fe made webhook return the crawl name
and status as X- fields in the mime.
2013-10-28 22:03:10 -07:00
Matt Wells
a5a7ab2434 added spider status msg to json output
to indicate if spider has hit a limit.
no longer disable spiders in xmldoc.cpp
when a crawl/process limit is hit. just
check for limit when spidering urls in
spider.cpp and if it is hit set
CollectionRec::m_spiderStatus[Msg] and
send email from there.
Added maxCrawlRounds parm.
2013-10-23 11:40:30 -07:00
Matt Wells
0e4d96b3f8 added "seeds" to json reply. store seed urls
(and deup them) in collrec. fixed some respidering
issues. any time we re-enter url filters
then rebuild the waiting tree.
2013-10-21 17:35:14 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
a288217e9f a few bug fixes 2013-10-17 18:59:00 -07:00
mwells
ea859ef685 added 'gb emailmandrill' for testing.
got it working. it posts json, not url encoded.
2013-10-09 17:35:51 -06:00
mwells
c1c5c4e3d0 send notifications if no urls available
for immediate spidering.
2013-10-09 15:24:35 -06:00
Matt Wells
283ec2f6b4 email and webhook alerts when spider runs out of urls
to spider.
2013-10-09 11:42:56 -07:00
Matt Wells
3702a05d64 add sendEmailThroughMandrill() to send
through mail chimp http api.
2013-10-08 18:01:38 -07:00
Matt Wells
fe97e08281 move from groups to shards. got rid of annoying
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
259ec08e09 email hook now works but you have to
supply the IP address of your sendmail
server and it has to allow email
forwarding from host #0's IP. specify
the sendmail server's IP in the Master
Controls.
2013-10-02 09:36:44 -06:00
mwells
45941e4b2f fix notification system. 2013-10-01 17:30:06 -06:00
mwells
3fecb3eb1f got email and url notification code compiling.
when crawl hits a limit we do notifications.
2013-10-01 15:14:39 -06:00
Matt Wells
a412c798bf Merge branch 'master' into diffbot
Conflicts:
	PageResults.cpp
2013-09-13 09:24:28 -07:00
mwells
34b6d3e74a fixed some cores. brought in fixes from
old repo.
2013-09-08 16:16:13 -06:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00