Commit Graph

50 Commits

Author SHA1 Message Date
mwells
45941e4b2f fix notification system. 2013-10-01 17:30:06 -06:00
mwells
3fecb3eb1f got email and url notification code compiling.
when crawl hits a limit we do notifications.
2013-10-01 15:14:39 -06:00
mwells
b1809b1a08 just allow user to specify diffbot api
as a url string, not a menu item number selection
from a drop down. still print a fixed drop down
that will set the diffbot api url string directly
though. i.e. use &dapi=http://www.diffbot.com/api/article?
and then gigablast will append the &token=...&url=... to it
before fetching it.
2013-09-30 10:27:28 -06:00
mwells
20952eedbe customizable api list in url filters 2013-09-30 09:18:22 -06:00
mwells
c216f7b2a7 use 48 bit url hash for lock keys again.
query reindex recs can just use their
prob docids as fake uh48s. we need it so we
can avoid the fakedb record and just use
the spider reply to trigger a 5-second
lock expiration. a little simpler. added
logdebugspiderwait for waiting tree debugging.
fixed per ip spider limiting. fixed losing
spiders down blackhole from updateCrawlInfo.
check UrlLock::m_confirmed when counting outstanding
spiders on one ip since may have a lock on one host
but not get granted on all! it calls
confirmLockAcquisition() when it gets fully granted
the lock so it can set UrlLock::confirmed.
2013-09-29 00:09:46 -06:00
mwells
afa2a87542 remove alias parm 2013-09-28 14:17:43 -06:00
mwells
321f5cf938 quite a few fixes. something still
overwrite CollectionRec::m_overflow/m_overflow2...
2013-09-27 21:00:40 -06:00
mwells
eb3f657411 fixed distributed support for adding/deleting/resetting
collections. now need to specify collection name
like &addcoll=mycoll when adding a coll.
2013-09-27 10:49:24 -06:00
mwells
fd081478de fix crawlbot to work on a distributed network
as far as adding/deleting/resetting  colls
and updating parms. ideally we'd have a Colldb
Rdb where each key was a parm. that would make
syncing easier if a host went down, then it would
get the negative/positive colldb parm keys later.
so it could sync up on all your operations as long
as all your operations in terms of adding and deleting
database key/value pairs.
2013-09-26 22:41:05 -06:00
mwells
16ead85cfd added support for adding an alias to a collection
using &alias=xxxxx
2013-09-26 14:50:34 -06:00
mwells
8fde0c5343 added support for serialize/deserialize
of TYPE_SAFEBUF parms over distributed network.
2013-09-26 08:56:14 -06:00
mwells
6fca32e4b5 minor oops fix. 2013-09-25 17:06:01 -06:00
mwells
5fbf323cb5 json api now shows all collections
and their relevant parms and stats
for /crawlbot?token=xxx&format=json
2013-09-25 16:59:31 -06:00
mwells
0039b23064 almost done with json api. 2013-09-25 15:37:20 -06:00
mwells
0fe0147913 fix invisible columns in url filters table. 2013-09-25 12:24:13 -06:00
mwells
b90ef3de0d more spider fixes. right after getting lock,
use msg12 to remove rec from doledb/doleiptable
and add 0 entry to waiting table so doledb is
again immediately repopulated with that firstIp
so we can spider multiple urls from the same ip
at the same time.
2013-09-23 20:25:28 -06:00
Matt Wells
bcc55dc46b fixed a couple bugs. Added more documentation
into Spider.h.
2013-09-19 18:21:52 -07:00
Matt Wells
47465f6d90 more fixes. trying to fix spiders to
spider multiple urls from same ip...
2013-09-19 11:13:40 -07:00
Matt Wells
a3ea867305 update crawlbot api. 2013-09-18 17:13:36 -07:00
Matt Wells
29f5c5d644 added isonsamesubdomain and isonsamedomain 2013-09-18 16:45:37 -07:00
Matt Wells
f90d20f4dd diffbot api integration updates 2013-09-18 15:07:47 -07:00
Matt Wells
70ff54ce03 hide the parms that might scare users away
in the url filters.
2013-09-18 14:27:59 -07:00
Matt Wells
6af02119a1 use cookies to display url filters table. 2013-09-18 13:50:55 -07:00
Matt Wells
924d1320a2 fix bugs inserting and deleting rows
using TYPE_SAFEBUF parms.
2013-09-18 12:35:01 -07:00
Matt Wells
c1bcebb7bb url filter documentation update. 2013-09-18 12:00:29 -07:00
Matt Wells
459a7e98fb add diffbot dropdown to url filters table 2013-09-18 11:24:16 -07:00
Matt Wells
487d3f0a0e fix url filters bugs. 2013-09-18 11:02:09 -07:00
Matt Wells
39d9760e5d added ismedia url filter to
cover all the jpg,gif,mpeg,css rules.
2013-09-18 09:40:59 -07:00
Matt Wells
c77453348f Merge branch 'master' into diffbot
Conflicts:
	SearchInput.cpp
	XmlDoc.cpp
2013-09-18 09:23:48 -07:00
mwells
119a4c0c22 fix adult content detector 2013-09-17 23:53:17 -06:00
Matt Wells
10fcfb6987 minor updates 2013-09-17 17:32:49 -07:00
Matt Wells
98caa3225a fix query prepend logic for json searches 2013-09-17 17:16:39 -07:00
Matt Wells
5e3b727eb5 crawlbot api fixes. 2013-09-17 16:30:57 -07:00
Matt Wells
e50da4d012 crawlbot api fixes 2013-09-17 15:47:44 -07:00
Matt Wells
c16fe8601b more crawlbot api fixes 2013-09-17 15:32:28 -07:00
Matt Wells
4321f02e4e trying to get reset collection working 2013-09-17 12:21:09 -07:00
Matt Wells
fc692202ba fix integration of urls filters into crawlbot page 2013-09-16 16:27:48 -07:00
Matt Wells
a034604cef clean up to remove g_conf.m_useDiffbot 2013-09-16 15:00:43 -07:00
Matt Wells
3dfba4de69 doc updates 2013-09-16 14:29:01 -07:00
Matt Wells
4c11265a98 more updates to crawlbot api 2013-09-16 13:59:11 -07:00
Matt Wells
df96f81e78 fix spidering and other things. 2013-09-16 11:22:07 -07:00
Matt Wells
78a334198b Merge branch 'master' into diffbot 2013-09-16 09:05:37 -07:00
mwells
107037c6a2 new &sites=xyz.com+abc.com+... functionality compiles ok. 2013-09-15 18:14:32 -06:00
Matt Wells
93ce424d99 start working on the main gui for
crawlbot which is /crawlbot
2013-09-13 16:22:07 -07:00
Matt Wells
6b330da240 cleanup warnings in log. 2013-09-13 14:37:35 -07:00
Matt Wells
a412c798bf Merge branch 'master' into diffbot
Conflicts:
	PageResults.cpp
2013-09-13 09:24:28 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
mwells
7aa81abf91 use the "onsite" keyword in your url filters
instead of this "only spider links from same host"
switch to keep things simpler.
2013-09-06 09:37:17 -06:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00