Matt Wells
889583ec4b
now we can reset collection mid stream
2013-10-18 17:49:36 -07:00
Matt Wells
ecab57ff0f
change collnum of reset collection
...
so any adds in progress will fail.
2013-10-18 15:46:00 -07:00
Matt Wells
b589b17e63
fix collection resetting.
2013-10-18 15:21:00 -07:00
Matt Wells
50313a815f
use seeds and spots now
2013-10-18 11:53:14 -07:00
Matt Wells
a288217e9f
a few bug fixes
2013-10-17 18:59:00 -07:00
Matt Wells
84a3aded94
spider round updates correction
2013-10-17 17:18:05 -07:00
Matt Wells
df7fd21253
spider rounds update.
2013-10-17 17:17:19 -07:00
Matt Wells
fe8ebd23a3
added simplified redirect urls to spiderdb
...
as a new spiderrequest. made XmlDoc::getLinks()
call m_links.set(redirUrl.getUrl()) so that it is
treated like an outlink on the page and gets added
from addOutlinkSpiderRecsToMetaList().
2013-10-17 12:06:12 -07:00
Matt Wells
b9f94d7d45
show cached json objects as application/json
...
without term highlighting and the disclaimer
2013-10-16 17:54:17 -07:00
Matt Wells
d9b132fd5a
make : into . for indexing json names.
2013-10-16 17:43:46 -07:00
Matt Wells
74c2742ced
fix mem leak of LinkInfo.
...
fixed json output from injecting url.
2013-10-16 17:17:28 -07:00
Matt Wells
70c4ef682d
printing updates
2013-10-16 16:27:24 -07:00
Matt Wells
ee06428059
fix json indexing and searching
2013-10-16 16:15:28 -07:00
Matt Wells
9d6c3626d8
json indexing/hashing updates.
2013-10-16 15:41:12 -07:00
Matt Wells
bb09b4f742
do not store diffbot api url in
...
diffbot reply yet. later may want to
store in each diffbot object doc
maybe as part of the json content?
2013-10-16 15:24:22 -07:00
Matt Wells
f8256c3ef9
fix core from diffbot object
...
doc not having valid dmoz info
2013-10-16 15:14:39 -07:00
Matt Wells
57ee9739e5
fix addColl() logic for collectionless rdbs
2013-10-16 14:38:09 -07:00
Matt Wells
fc17521697
Merge branch 'master' into diffbot
...
Conflicts:
Hostdb.cpp
Makefile
PageResults.cpp
PageRoot.cpp
Pages.cpp
Rdb.cpp
SearchInput.cpp
SearchInput.h
Spider.cpp
Spider.h
XmlDoc.cpp
2013-10-16 14:28:42 -07:00
Matt Wells
22ef91a6f1
show all colls in json
...
after deleteCrawl operation
2013-10-16 14:13:28 -07:00
Matt Wells
e565a861ae
give nice reply from seed in json.
...
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
2013-10-16 14:03:14 -07:00
Matt Wells
36e6e21ae3
url filter processing looking good.
2013-10-16 12:19:25 -07:00
Matt Wells
f5e5b0f5d3
fix crawlbot bugs
2013-10-16 12:12:22 -07:00
mwells
d2f39cd1a0
comment updates
2013-10-15 23:13:50 -07:00
mwells
6052f60c48
speed up dirty word detection since we added a bunch
...
of new dirty words/phrases.
2013-10-15 22:41:31 -07:00
Matt Wells
70e4a57449
add respider freq parm to crawlbot page
2013-10-15 16:57:34 -07:00
Matt Wells
f345c35927
crawlbot fixes.
2013-10-15 16:31:59 -07:00
Matt Wells
f892b828d9
crawlbot api fixes
2013-10-15 14:08:55 -07:00
Matt Wells
00029129fd
crawlbot api fixes
2013-10-15 12:40:56 -07:00
Matt Wells
65d6af7791
show expression/action pairs in crawlbot json output.
2013-10-15 11:50:57 -07:00
mwells
ed4eff784b
json output crawlbot fixes.
2013-10-15 12:45:23 -06:00
Matt Wells
70eaa542f3
minor change
2013-10-15 11:17:44 -07:00
mwells
a3a9a43ded
api nomenclature
2013-10-15 12:31:02 -06:00
mwells
313eb1209e
more crawlbot fixes
2013-10-15 12:22:59 -06:00
mwells
d8835acfef
crawlbot api work.
2013-10-15 11:54:54 -06:00
mwells
37a9e82060
update the dirty word list. but we still
...
should remove tags, except maybe outlinks,
and detect the dirty words on what remains.
getting too many false positives in tags still.
2013-10-15 01:01:19 -07:00
mwells
3db726c22e
take out references to AdultBit.cpp,
...
since it is no longer used.
2013-10-14 23:21:58 -07:00
mwells
12bff1e9b0
fix potential problem of tons of points in
...
our statsdb div graph. use hashtable to dedup
points and save from printing out too many <div>
tags.
2013-10-14 22:52:29 -07:00
mwells
90fca8c171
fix "search in category" link.
2013-10-14 22:39:42 -07:00
mwells
0096877127
fix "statsdb" graph so it seems to work
...
now.
2013-10-14 22:31:00 -07:00
mwells
9e9ef9c2cc
still getting statsdb link to work. a little
...
better now.
2013-10-14 21:21:27 -07:00
mwells
8f93a72961
start using html div graph for
...
PageStatsdb.cpp now too.
2013-10-14 20:35:45 -07:00
mwells
a0808df2ae
got new diffbot api compiled
2013-10-14 18:19:59 -06:00
mwells
c19310cb7e
code checkpoint
2013-10-14 17:19:30 -06:00
mwells
a562c65627
another code checkpoint. new json api
...
for crawlbot. new url filters for crawlbot.
2013-10-14 16:10:48 -06:00
mwells
5a7d70f7b2
code checkpoint
2013-10-14 13:00:05 -06:00
mwells
80918ca6e3
remove old libplotter references
...
and files.
2013-10-13 23:48:07 -07:00
mwells
553c28fbe0
get performance graphing working again.
...
use absolute divs to draw the graph
instead of old gif plotter library.
2013-10-13 23:39:31 -07:00
mwells
81a09f9835
half way done fixing performance graph.
...
needs more work.
2013-10-13 22:02:21 -07:00
mwells
c7cf6a817a
dmoz directory root page search box should
...
just search all sites in dmoz.
2013-10-13 20:13:15 -07:00
mwells
3ac5838b8f
fix the search tabs for the dmoz directory search
...
box. allow more error types when spidering dmoz docs.
2013-10-13 18:43:45 -07:00