Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
cf6695f625
speed up getNumTotalRecs() by caching
...
it basically for 2 seconds since pingserver.cpp
calls it all the time.
2014-02-25 12:14:51 -08:00
Matt Wells
ca4aafa8a6
added host disk usage redbox and stats.
2014-02-12 09:47:44 -07:00
Matt Wells
f420bd2769
checkpoint
2014-02-09 15:09:48 -07:00
Matt Wells
4346fcee29
added recovery mode display in hosts table
2014-02-01 10:16:46 -08:00
Matt Wells
2faba0efd1
fix repeat rounds sticking bug
...
by adding PF_REBUILDURLFILTERS flag to
spiderroundastarttime parm
2014-01-17 17:17:10 -08:00
Matt Wells
16f8af0d57
added awesome streaming mode support
...
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
9da106e7ca
added ermergency msg box on all admin pages
2014-01-11 20:35:13 -08:00
Matt Wells
eed606601e
added emergency msg box on all admin pages
2014-01-11 20:14:44 -08:00
Matt Wells
8a49e87a61
got code with shard rebalancing compiling.
...
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
f64b53bfb3
almost done with rebalancing code
2014-01-10 14:12:58 -08:00
Matt Wells
a76f4c6974
just POST a full request for webhook now
...
so we can do application/json content type
2013-11-07 14:20:15 -08:00
Matt Wells
3e4db4f1bc
show all crawl details in url webhook
...
notification in the post body.
2013-11-07 13:59:43 -08:00
Matt Wells
adf4d258ae
better crawl status reporting.
...
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
20052e34fe
made webhook return the crawl name
...
and status as X- fields in the mime.
2013-10-28 22:03:10 -07:00
Matt Wells
a5a7ab2434
added spider status msg to json output
...
to indicate if spider has hit a limit.
no longer disable spiders in xmldoc.cpp
when a crawl/process limit is hit. just
check for limit when spidering urls in
spider.cpp and if it is hit set
CollectionRec::m_spiderStatus[Msg] and
send email from there.
Added maxCrawlRounds parm.
2013-10-23 11:40:30 -07:00
Matt Wells
0e4d96b3f8
added "seeds" to json reply. store seed urls
...
(and deup them) in collrec. fixed some respidering
issues. any time we re-enter url filters
then rebuild the waiting tree.
2013-10-21 17:35:14 -07:00
Matt Wells
b589b17e63
fix collection resetting.
2013-10-18 15:21:00 -07:00
Matt Wells
a288217e9f
a few bug fixes
2013-10-17 18:59:00 -07:00
mwells
ea859ef685
added 'gb emailmandrill' for testing.
...
got it working. it posts json, not url encoded.
2013-10-09 17:35:51 -06:00
mwells
c1c5c4e3d0
send notifications if no urls available
...
for immediate spidering.
2013-10-09 15:24:35 -06:00
Matt Wells
283ec2f6b4
email and webhook alerts when spider runs out of urls
...
to spider.
2013-10-09 11:42:56 -07:00
Matt Wells
3702a05d64
add sendEmailThroughMandrill() to send
...
through mail chimp http api.
2013-10-08 18:01:38 -07:00
Matt Wells
fe97e08281
move from groups to shards. got rid of annoying
...
groupid bit mask thing.
2013-10-04 16:18:56 -07:00
mwells
259ec08e09
email hook now works but you have to
...
supply the IP address of your sendmail
server and it has to allow email
forwarding from host #0 's IP. specify
the sendmail server's IP in the Master
Controls.
2013-10-02 09:36:44 -06:00
mwells
45941e4b2f
fix notification system.
2013-10-01 17:30:06 -06:00
mwells
3fecb3eb1f
got email and url notification code compiling.
...
when crawl hits a limit we do notifications.
2013-10-01 15:14:39 -06:00
Matt Wells
a412c798bf
Merge branch 'master' into diffbot
...
Conflicts:
PageResults.cpp
2013-09-13 09:24:28 -07:00
mwells
34b6d3e74a
fixed some cores. brought in fixes from
...
old repo.
2013-09-08 16:16:13 -06:00
Matt Wells
94e6492916
removed MAX_COLL_RECS so we can have unlimited
...
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00