Commit Graph

347 Commits

Author SHA1 Message Date
Matt Wells
f8256c3ef9 fix core from diffbot object
doc not having valid dmoz info
2013-10-16 15:14:39 -07:00
Matt Wells
57ee9739e5 fix addColl() logic for collectionless rdbs 2013-10-16 14:38:09 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
22ef91a6f1 show all colls in json
after deleteCrawl operation
2013-10-16 14:13:28 -07:00
Matt Wells
e565a861ae give nice reply from seed in json.
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
2013-10-16 14:03:14 -07:00
Matt Wells
36e6e21ae3 url filter processing looking good. 2013-10-16 12:19:25 -07:00
Matt Wells
f5e5b0f5d3 fix crawlbot bugs 2013-10-16 12:12:22 -07:00
mwells
d2f39cd1a0 comment updates 2013-10-15 23:13:50 -07:00
mwells
6052f60c48 speed up dirty word detection since we added a bunch
of new dirty words/phrases.
2013-10-15 22:41:31 -07:00
Matt Wells
70e4a57449 add respider freq parm to crawlbot page 2013-10-15 16:57:34 -07:00
Matt Wells
f345c35927 crawlbot fixes. 2013-10-15 16:31:59 -07:00
Matt Wells
f892b828d9 crawlbot api fixes 2013-10-15 14:08:55 -07:00
Matt Wells
00029129fd crawlbot api fixes 2013-10-15 12:40:56 -07:00
Matt Wells
65d6af7791 show expression/action pairs in crawlbot json output. 2013-10-15 11:50:57 -07:00
mwells
ed4eff784b json output crawlbot fixes. 2013-10-15 12:45:23 -06:00
Matt Wells
70eaa542f3 minor change 2013-10-15 11:17:44 -07:00
mwells
a3a9a43ded api nomenclature 2013-10-15 12:31:02 -06:00
mwells
313eb1209e more crawlbot fixes 2013-10-15 12:22:59 -06:00
mwells
d8835acfef crawlbot api work. 2013-10-15 11:54:54 -06:00
mwells
37a9e82060 update the dirty word list. but we still
should remove tags, except maybe outlinks,
and detect the dirty words on what remains.
getting too many false positives in tags still.
2013-10-15 01:01:19 -07:00
mwells
3db726c22e take out references to AdultBit.cpp,
since it is no longer used.
2013-10-14 23:21:58 -07:00
mwells
12bff1e9b0 fix potential problem of tons of points in
our statsdb div graph. use hashtable to dedup
points and save from printing out too many <div>
tags.
2013-10-14 22:52:29 -07:00
mwells
90fca8c171 fix "search in category" link. 2013-10-14 22:39:42 -07:00
mwells
0096877127 fix "statsdb" graph so it seems to work
now.
2013-10-14 22:31:00 -07:00
mwells
9e9ef9c2cc still getting statsdb link to work. a little
better now.
2013-10-14 21:21:27 -07:00
mwells
8f93a72961 start using html div graph for
PageStatsdb.cpp now too.
2013-10-14 20:35:45 -07:00
mwells
a0808df2ae got new diffbot api compiled 2013-10-14 18:19:59 -06:00
mwells
c19310cb7e code checkpoint 2013-10-14 17:19:30 -06:00
mwells
a562c65627 another code checkpoint. new json api
for crawlbot. new url filters for crawlbot.
2013-10-14 16:10:48 -06:00
mwells
5a7d70f7b2 code checkpoint 2013-10-14 13:00:05 -06:00
mwells
80918ca6e3 remove old libplotter references
and files.
2013-10-13 23:48:07 -07:00
mwells
553c28fbe0 get performance graphing working again.
use absolute divs to draw the graph
instead of old gif plotter library.
2013-10-13 23:39:31 -07:00
mwells
81a09f9835 half way done fixing performance graph.
needs more work.
2013-10-13 22:02:21 -07:00
mwells
c7cf6a817a dmoz directory root page search box should
just search all sites in dmoz.
2013-10-13 20:13:15 -07:00
mwells
3ac5838b8f fix the search tabs for the dmoz directory search
box. allow more error types when spidering dmoz docs.
2013-10-13 18:43:45 -07:00
mwells
66364c581a minor fix when indexing dmoz urls. 2013-10-13 17:12:20 -07:00
mwells
876af6d8c6 dmoz support is now updated and re-integrated. 2013-10-13 16:53:28 -07:00
mwells
3bc85cf528 a few cleanups for the new dmoz code. 2013-10-13 16:48:59 -07:00
mwells
0cc78dc2e0 fix dup bug. 2013-10-13 16:06:38 -07:00
mwells
d41d5554da fix dmoz search. 2013-10-13 16:00:44 -07:00
mwells
4cbb31e180 added searchbox for dmoz pages/sites. 2013-10-13 15:45:12 -07:00
mwells
b60bdcc038 documentation updates. fixed sd=0. 2013-10-13 14:24:41 -07:00
mwells
2c7bc9031f documentation updates. 2013-10-13 13:15:31 -07:00
mwells
8547b8f802 print pretty dmoz pages. 2013-10-13 00:39:05 -07:00
mwells
d4b5c37f45 Merge branch 'master' into testing 2013-10-13 00:20:37 -07:00
mwells
fbcaefa6ff so we have spider https sites add
the old gigablast private/public key file.
2013-10-13 00:15:39 -07:00
mwells
65bad44450 try to fix EBADIP stopping a page from
getting indexed into dmoz.
2013-10-13 00:14:16 -07:00
mwells
c949bfe315 ignore certain errors and index the doc anyway
so we at least have it in our dmoz index with its
designated title and summary from dmoz.
2013-10-13 00:02:25 -07:00
mwells
eeb10bb99a fix ip vector logic in xmldoc.cpp. 2013-10-12 23:14:39 -07:00
mwells
c283e85e40 add support for noindex meta tag.
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
2013-10-12 22:50:23 -07:00