Commit Graph

362 Commits

Author SHA1 Message Date
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
ecab57ff0f change collnum of reset collection
so any adds in progress will fail.
2013-10-18 15:46:00 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
50313a815f use seeds and spots now 2013-10-18 11:53:14 -07:00
Matt Wells
a288217e9f a few bug fixes 2013-10-17 18:59:00 -07:00
Matt Wells
84a3aded94 spider round updates correction 2013-10-17 17:18:05 -07:00
Matt Wells
df7fd21253 spider rounds update. 2013-10-17 17:17:19 -07:00
Matt Wells
fe8ebd23a3 added simplified redirect urls to spiderdb
as a new spiderrequest. made XmlDoc::getLinks()
call m_links.set(redirUrl.getUrl()) so that it is
treated like an outlink on the page and gets added
from addOutlinkSpiderRecsToMetaList().
2013-10-17 12:06:12 -07:00
Matt Wells
b9f94d7d45 show cached json objects as application/json
without term highlighting and the disclaimer
2013-10-16 17:54:17 -07:00
Matt Wells
d9b132fd5a make : into . for indexing json names. 2013-10-16 17:43:46 -07:00
Matt Wells
74c2742ced fix mem leak of LinkInfo.
fixed json output from injecting url.
2013-10-16 17:17:28 -07:00
Matt Wells
70c4ef682d printing updates 2013-10-16 16:27:24 -07:00
Matt Wells
ee06428059 fix json indexing and searching 2013-10-16 16:15:28 -07:00
Matt Wells
9d6c3626d8 json indexing/hashing updates. 2013-10-16 15:41:12 -07:00
Matt Wells
bb09b4f742 do not store diffbot api url in
diffbot reply yet. later may want to
store in each diffbot object doc
maybe as part of the json content?
2013-10-16 15:24:22 -07:00
Matt Wells
f8256c3ef9 fix core from diffbot object
doc not having valid dmoz info
2013-10-16 15:14:39 -07:00
Matt Wells
57ee9739e5 fix addColl() logic for collectionless rdbs 2013-10-16 14:38:09 -07:00
Matt Wells
fc17521697 Merge branch 'master' into diffbot
Conflicts:
	Hostdb.cpp
	Makefile
	PageResults.cpp
	PageRoot.cpp
	Pages.cpp
	Rdb.cpp
	SearchInput.cpp
	SearchInput.h
	Spider.cpp
	Spider.h
	XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
22ef91a6f1 show all colls in json
after deleteCrawl operation
2013-10-16 14:13:28 -07:00
Matt Wells
e565a861ae give nice reply from seed in json.
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
2013-10-16 14:03:14 -07:00
Matt Wells
36e6e21ae3 url filter processing looking good. 2013-10-16 12:19:25 -07:00
Matt Wells
f5e5b0f5d3 fix crawlbot bugs 2013-10-16 12:12:22 -07:00
mwells
d2f39cd1a0 comment updates 2013-10-15 23:13:50 -07:00
mwells
6052f60c48 speed up dirty word detection since we added a bunch
of new dirty words/phrases.
2013-10-15 22:41:31 -07:00
Matt Wells
70e4a57449 add respider freq parm to crawlbot page 2013-10-15 16:57:34 -07:00
Matt Wells
f345c35927 crawlbot fixes. 2013-10-15 16:31:59 -07:00
Matt Wells
f892b828d9 crawlbot api fixes 2013-10-15 14:08:55 -07:00
Matt Wells
00029129fd crawlbot api fixes 2013-10-15 12:40:56 -07:00
Matt Wells
65d6af7791 show expression/action pairs in crawlbot json output. 2013-10-15 11:50:57 -07:00
mwells
ed4eff784b json output crawlbot fixes. 2013-10-15 12:45:23 -06:00
Matt Wells
70eaa542f3 minor change 2013-10-15 11:17:44 -07:00
mwells
a3a9a43ded api nomenclature 2013-10-15 12:31:02 -06:00
mwells
313eb1209e more crawlbot fixes 2013-10-15 12:22:59 -06:00
mwells
d8835acfef crawlbot api work. 2013-10-15 11:54:54 -06:00
mwells
37a9e82060 update the dirty word list. but we still
should remove tags, except maybe outlinks,
and detect the dirty words on what remains.
getting too many false positives in tags still.
2013-10-15 01:01:19 -07:00
mwells
3db726c22e take out references to AdultBit.cpp,
since it is no longer used.
2013-10-14 23:21:58 -07:00
mwells
12bff1e9b0 fix potential problem of tons of points in
our statsdb div graph. use hashtable to dedup
points and save from printing out too many <div>
tags.
2013-10-14 22:52:29 -07:00
mwells
90fca8c171 fix "search in category" link. 2013-10-14 22:39:42 -07:00
mwells
0096877127 fix "statsdb" graph so it seems to work
now.
2013-10-14 22:31:00 -07:00
mwells
9e9ef9c2cc still getting statsdb link to work. a little
better now.
2013-10-14 21:21:27 -07:00
mwells
8f93a72961 start using html div graph for
PageStatsdb.cpp now too.
2013-10-14 20:35:45 -07:00
mwells
a0808df2ae got new diffbot api compiled 2013-10-14 18:19:59 -06:00
mwells
c19310cb7e code checkpoint 2013-10-14 17:19:30 -06:00
mwells
a562c65627 another code checkpoint. new json api
for crawlbot. new url filters for crawlbot.
2013-10-14 16:10:48 -06:00
mwells
5a7d70f7b2 code checkpoint 2013-10-14 13:00:05 -06:00
mwells
80918ca6e3 remove old libplotter references
and files.
2013-10-13 23:48:07 -07:00
mwells
553c28fbe0 get performance graphing working again.
use absolute divs to draw the graph
instead of old gif plotter library.
2013-10-13 23:39:31 -07:00
mwells
81a09f9835 half way done fixing performance graph.
needs more work.
2013-10-13 22:02:21 -07:00
mwells
c7cf6a817a dmoz directory root page search box should
just search all sites in dmoz.
2013-10-13 20:13:15 -07:00
mwells
3ac5838b8f fix the search tabs for the dmoz directory search
box. allow more error types when spidering dmoz docs.
2013-10-13 18:43:45 -07:00