Commit Graph

3838 Commits

Author SHA1 Message Date
mwells
16ead85cfd added support for adding an alias to a collection
using &alias=xxxxx
2013-09-26 14:50:34 -06:00
mwells
e3c4ce189a fixed cores. fixed json. 2013-09-26 14:28:04 -06:00
mwells
8fde0c5343 added support for serialize/deserialize
of TYPE_SAFEBUF parms over distributed network.
2013-09-26 08:56:14 -06:00
mwells
f252dd9189 minor crawlbot gui updates 2013-09-25 19:41:20 -06:00
mwells
65df6dfe52 added some handy links 2013-09-25 18:00:16 -06:00
mwells
0b5a45e8aa more api updates. added m_avoidSpiderLinks to
spider request so urldata=xxxx can turn link
spidering off. probably desirable so its default.
so &spiderlinks=[0|1] applies to urldata as well
as injecturl=
2013-09-25 17:51:43 -06:00
mwells
01fa9fe383 make it proper json output 2013-09-25 17:12:01 -06:00
mwells
6fca32e4b5 minor oops fix. 2013-09-25 17:06:01 -06:00
mwells
5fbf323cb5 json api now shows all collections
and their relevant parms and stats
for /crawlbot?token=xxx&format=json
2013-09-25 16:59:31 -06:00
mwells
d14832f93e new json api code compiles. need to test now. 2013-09-25 16:04:16 -06:00
mwells
0039b23064 almost done with json api. 2013-09-25 15:37:20 -06:00
mwells
50ba93991b minor ui changes 2013-09-25 13:09:02 -06:00
mwells
9dc9114902 added stat page for all collections. 2013-09-25 12:57:07 -06:00
mwells
0fe0147913 fix invisible columns in url filters table. 2013-09-25 12:24:13 -06:00
mwells
1d92004e06 fix spider flow debug msgs 2013-09-25 12:07:11 -06:00
mwells
40192249f9 spider speedups and fixes. 2013-09-25 11:58:03 -06:00
Matt Wells
e34afd21ea fix bug of possibly not removing some locks 2013-09-25 09:28:35 -07:00
Matt Wells
a687380aeb fix a bug of not reading enough spiderdb
records for a given "ip" because short reads
were causing us to bail out early. still not
sure as to the cause of the short reads.
2013-09-24 20:48:48 -07:00
Matt Wells
fbd853fdf7 fix long-standing spider bug causing some
ip queues to not get fully spidered.
2013-09-24 20:44:55 -07:00
mwells
b16d8519fc more spider fixes. still need more speedups
when spidering multiple spiders on same ip.
2013-09-24 16:40:14 -06:00
mwells
e594af898a seems like we can spider multiple urls
from same ip at same time now.
2013-09-24 09:32:26 -06:00
mwells
8461e33b53 fixed more spider bugs. 2013-09-23 21:26:27 -07:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
mwells
7c31ecff4a fixed fakedb key support. 2013-09-23 15:16:23 -06:00
mwells
4d33737ac1 fakedb fixes 2013-09-23 08:19:54 -07:00
mwells
83e87fc755 fixed ability to spider multiple urls from the
same IP at the same time. Also respects
sameIpWait constraints.
2013-09-20 15:42:48 -07:00
mwells
05400a0c25 updated spider code documentation. 2013-09-20 11:19:24 -07:00
Matt Wells
fbd62cecba updated compilation instructions. need
to apt-get install gcc-multilib.
2013-09-20 10:06:01 -07:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
a3ea867305 update crawlbot api. 2013-09-18 17:13:36 -07:00
Matt Wells
022caeec04 use -diffbotxyz%li as a more unique appendage.
show token on crawlbot page.
2013-09-18 17:05:41 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00
Matt Wells
8de246d9c4 only show urls being spidered from your coll 2013-09-18 16:29:47 -07:00
Matt Wells
3bdd28ab1d fix spider bug 2013-09-18 16:17:08 -07:00
Matt Wells
7fdbd0f66a delete spider coll when deleting coll 2013-09-18 15:36:30 -07:00
Matt Wells
f90d20f4dd diffbot api integration updates 2013-09-18 15:07:47 -07:00
Matt Wells
70ff54ce03 hide the parms that might scare users away
in the url filters.
2013-09-18 14:27:59 -07:00
Matt Wells
6af02119a1 use cookies to display url filters table. 2013-09-18 13:50:55 -07:00
Matt Wells
04b0a08ef9 propagate showtable=1 when submitting url filters table 2013-09-18 12:38:05 -07:00
Matt Wells
924d1320a2 fix bugs inserting and deleting rows
using TYPE_SAFEBUF parms.
2013-09-18 12:35:01 -07:00
Matt Wells
c1bcebb7bb url filter documentation update. 2013-09-18 12:00:29 -07:00
Matt Wells
459a7e98fb add diffbot dropdown to url filters table 2013-09-18 11:24:16 -07:00
Matt Wells
487d3f0a0e fix url filters bugs. 2013-09-18 11:02:09 -07:00
Matt Wells
39d9760e5d added ismedia url filter to
cover all the jpg,gif,mpeg,css rules.
2013-09-18 09:40:59 -07:00
Matt Wells
c77453348f Merge branch 'master' into diffbot
Conflicts:
	SearchInput.cpp
	XmlDoc.cpp
2013-09-18 09:23:48 -07:00
mwells
d6815f2c9d if family filter enabled (&ff=1) then
prepend "gbadult:0 |" to the query to
restrict to non-adult pages.
2013-09-18 00:11:55 -06:00
mwells
a0032e0eb7 added another log statement for when
debugging the adult content detectory.
we err on the side of caution for the most part.
2013-09-18 00:06:21 -06:00
mwells
119a4c0c22 fix adult content detector 2013-09-17 23:53:17 -06:00
mwells
5ec3803312 fix core in hashing gbisadult:[0|1] term. 2013-09-17 23:27:31 -06:00