Commit Graph

200 Commits

Author SHA1 Message Date
mwells
00910e36d7 do not send "children" docs (robots.txt roots etc)
to diffbot.
2013-09-28 10:10:44 -06:00
mwells
321f5cf938 quite a few fixes. something still
overwrite CollectionRec::m_overflow/m_overflow2...
2013-09-27 21:00:40 -06:00
mwells
7cdb3d6f9c fix infinite loop from json parsing and
fix some core dumps.
2013-09-27 17:52:36 -06:00
mwells
83ec4f6de7 ignore a couple cores until we figure out
what happened
2013-09-27 14:25:30 -06:00
mwells
cb1b5f3fe4 fixed diffbot api display bug 2013-09-27 13:39:13 -06:00
mwells
e3bccfa706 use robots txt radio button fixes. 2013-09-27 12:17:22 -06:00
mwells
e7377d72ab fix robots.txt switch. fix collection rec saving.
require collname explicitly for injecturl urldata.
2013-09-27 11:39:23 -06:00
mwells
ac72e31f35 fixed a few more things. 2013-09-27 11:04:46 -06:00
mwells
eb3f657411 fixed distributed support for adding/deleting/resetting
collections. now need to specify collection name
like &addcoll=mycoll when adding a coll.
2013-09-27 10:49:24 -06:00
mwells
f043cc67e4 Merge branch 'master' into diffbot
Conflicts:
	Spider.cpp
2013-09-26 22:43:27 -06:00
mwells
fd081478de fix crawlbot to work on a distributed network
as far as adding/deleting/resetting  colls
and updating parms. ideally we'd have a Colldb
Rdb where each key was a parm. that would make
syncing easier if a host went down, then it would
get the negative/positive colldb parm keys later.
so it could sync up on all your operations as long
as all your operations in terms of adding and deleting
database key/value pairs.
2013-09-26 22:41:05 -06:00
mwells
3a4e5da997 make "id" equivalent to "c". print out "id"
not "name" in the json output collection objects.
perhaps should be "collectionId" not just "id".
2013-09-26 15:32:11 -06:00
mwells
16ead85cfd added support for adding an alias to a collection
using &alias=xxxxx
2013-09-26 14:50:34 -06:00
mwells
e3c4ce189a fixed cores. fixed json. 2013-09-26 14:28:04 -06:00
mwells
8fde0c5343 added support for serialize/deserialize
of TYPE_SAFEBUF parms over distributed network.
2013-09-26 08:56:14 -06:00
mwells
f252dd9189 minor crawlbot gui updates 2013-09-25 19:41:20 -06:00
mwells
65df6dfe52 added some handy links 2013-09-25 18:00:16 -06:00
mwells
0b5a45e8aa more api updates. added m_avoidSpiderLinks to
spider request so urldata=xxxx can turn link
spidering off. probably desirable so its default.
so &spiderlinks=[0|1] applies to urldata as well
as injecturl=
2013-09-25 17:51:43 -06:00
mwells
01fa9fe383 make it proper json output 2013-09-25 17:12:01 -06:00
mwells
6fca32e4b5 minor oops fix. 2013-09-25 17:06:01 -06:00
mwells
5fbf323cb5 json api now shows all collections
and their relevant parms and stats
for /crawlbot?token=xxx&format=json
2013-09-25 16:59:31 -06:00
mwells
d14832f93e new json api code compiles. need to test now. 2013-09-25 16:04:16 -06:00
mwells
0039b23064 almost done with json api. 2013-09-25 15:37:20 -06:00
mwells
50ba93991b minor ui changes 2013-09-25 13:09:02 -06:00
mwells
9dc9114902 added stat page for all collections. 2013-09-25 12:57:07 -06:00
mwells
0fe0147913 fix invisible columns in url filters table. 2013-09-25 12:24:13 -06:00
mwells
1d92004e06 fix spider flow debug msgs 2013-09-25 12:07:11 -06:00
mwells
40192249f9 spider speedups and fixes. 2013-09-25 11:58:03 -06:00
Matt Wells
e34afd21ea fix bug of possibly not removing some locks 2013-09-25 09:28:35 -07:00
Matt Wells
a687380aeb fix a bug of not reading enough spiderdb
records for a given "ip" because short reads
were causing us to bail out early. still not
sure as to the cause of the short reads.
2013-09-24 20:48:48 -07:00
Matt Wells
fbd853fdf7 fix long-standing spider bug causing some
ip queues to not get fully spidered.
2013-09-24 20:44:55 -07:00
mwells
b16d8519fc more spider fixes. still need more speedups
when spidering multiple spiders on same ip.
2013-09-24 16:40:14 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
8461e33b53 fixed more spider bugs. 2013-09-23 21:26:27 -07:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
7c31ecff4a fixed fakedb key support. 2013-09-23 15:16:23 -06:00
mwells
4d33737ac1 fakedb fixes 2013-09-23 08:19:54 -07:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
mwells
05400a0c25 updated spider code documentation. 2013-09-20 11:19:24 -07:00
Matt Wells
fbd62cecba updated compilation instructions. need
to apt-get install gcc-multilib.
2013-09-20 10:06:01 -07:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
a3ea867305 update crawlbot api. 2013-09-18 17:13:36 -07:00
Matt Wells
022caeec04 use -diffbotxyz%li as a more unique appendage.
show token on crawlbot page.
2013-09-18 17:05:41 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00
Matt Wells
8de246d9c4 only show urls being spidered from your coll 2013-09-18 16:29:47 -07:00
Matt Wells
3bdd28ab1d fix spider bug 2013-09-18 16:17:08 -07:00
Matt Wells
7fdbd0f66a delete spider coll when deleting coll 2013-09-18 15:36:30 -07:00
Matt Wells
f90d20f4dd diffbot api integration updates 2013-09-18 15:07:47 -07:00
Matt Wells
70ff54ce03 hide the parms that might scare users away
in the url filters.
2013-09-18 14:27:59 -07:00