Matt Wells
|
f8256c3ef9
|
fix core from diffbot object
doc not having valid dmoz info
|
2013-10-16 15:14:39 -07:00 |
|
Matt Wells
|
57ee9739e5
|
fix addColl() logic for collectionless rdbs
|
2013-10-16 14:38:09 -07:00 |
|
Matt Wells
|
fc17521697
|
Merge branch 'master' into diffbot
Conflicts:
Hostdb.cpp
Makefile
PageResults.cpp
PageRoot.cpp
Pages.cpp
Rdb.cpp
SearchInput.cpp
SearchInput.h
Spider.cpp
Spider.h
XmlDoc.cpp
|
2013-10-16 14:28:42 -07:00 |
|
Matt Wells
|
22ef91a6f1
|
show all colls in json
after deleteCrawl operation
|
2013-10-16 14:13:28 -07:00 |
|
Matt Wells
|
e565a861ae
|
give nice reply from seed in json.
show how many outlinks from same domain were found
and how many outlinks were filtered.
same for addurls (bulk add).
|
2013-10-16 14:03:14 -07:00 |
|
Matt Wells
|
36e6e21ae3
|
url filter processing looking good.
|
2013-10-16 12:19:25 -07:00 |
|
Matt Wells
|
f5e5b0f5d3
|
fix crawlbot bugs
|
2013-10-16 12:12:22 -07:00 |
|
mwells
|
d2f39cd1a0
|
comment updates
|
2013-10-15 23:13:50 -07:00 |
|
mwells
|
6052f60c48
|
speed up dirty word detection since we added a bunch
of new dirty words/phrases.
|
2013-10-15 22:41:31 -07:00 |
|
Matt Wells
|
70e4a57449
|
add respider freq parm to crawlbot page
|
2013-10-15 16:57:34 -07:00 |
|
Matt Wells
|
f345c35927
|
crawlbot fixes.
|
2013-10-15 16:31:59 -07:00 |
|
Matt Wells
|
f892b828d9
|
crawlbot api fixes
|
2013-10-15 14:08:55 -07:00 |
|
Matt Wells
|
00029129fd
|
crawlbot api fixes
|
2013-10-15 12:40:56 -07:00 |
|
Matt Wells
|
65d6af7791
|
show expression/action pairs in crawlbot json output.
|
2013-10-15 11:50:57 -07:00 |
|
mwells
|
ed4eff784b
|
json output crawlbot fixes.
|
2013-10-15 12:45:23 -06:00 |
|
Matt Wells
|
70eaa542f3
|
minor change
|
2013-10-15 11:17:44 -07:00 |
|
mwells
|
a3a9a43ded
|
api nomenclature
|
2013-10-15 12:31:02 -06:00 |
|
mwells
|
313eb1209e
|
more crawlbot fixes
|
2013-10-15 12:22:59 -06:00 |
|
mwells
|
d8835acfef
|
crawlbot api work.
|
2013-10-15 11:54:54 -06:00 |
|
mwells
|
37a9e82060
|
update the dirty word list. but we still
should remove tags, except maybe outlinks,
and detect the dirty words on what remains.
getting too many false positives in tags still.
|
2013-10-15 01:01:19 -07:00 |
|
mwells
|
3db726c22e
|
take out references to AdultBit.cpp,
since it is no longer used.
|
2013-10-14 23:21:58 -07:00 |
|
mwells
|
12bff1e9b0
|
fix potential problem of tons of points in
our statsdb div graph. use hashtable to dedup
points and save from printing out too many <div>
tags.
|
2013-10-14 22:52:29 -07:00 |
|
mwells
|
90fca8c171
|
fix "search in category" link.
|
2013-10-14 22:39:42 -07:00 |
|
mwells
|
0096877127
|
fix "statsdb" graph so it seems to work
now.
|
2013-10-14 22:31:00 -07:00 |
|
mwells
|
9e9ef9c2cc
|
still getting statsdb link to work. a little
better now.
|
2013-10-14 21:21:27 -07:00 |
|
mwells
|
8f93a72961
|
start using html div graph for
PageStatsdb.cpp now too.
|
2013-10-14 20:35:45 -07:00 |
|
mwells
|
a0808df2ae
|
got new diffbot api compiled
|
2013-10-14 18:19:59 -06:00 |
|
mwells
|
c19310cb7e
|
code checkpoint
|
2013-10-14 17:19:30 -06:00 |
|
mwells
|
a562c65627
|
another code checkpoint. new json api
for crawlbot. new url filters for crawlbot.
|
2013-10-14 16:10:48 -06:00 |
|
mwells
|
5a7d70f7b2
|
code checkpoint
|
2013-10-14 13:00:05 -06:00 |
|
mwells
|
80918ca6e3
|
remove old libplotter references
and files.
|
2013-10-13 23:48:07 -07:00 |
|
mwells
|
553c28fbe0
|
get performance graphing working again.
use absolute divs to draw the graph
instead of old gif plotter library.
|
2013-10-13 23:39:31 -07:00 |
|
mwells
|
81a09f9835
|
half way done fixing performance graph.
needs more work.
|
2013-10-13 22:02:21 -07:00 |
|
mwells
|
c7cf6a817a
|
dmoz directory root page search box should
just search all sites in dmoz.
|
2013-10-13 20:13:15 -07:00 |
|
mwells
|
3ac5838b8f
|
fix the search tabs for the dmoz directory search
box. allow more error types when spidering dmoz docs.
|
2013-10-13 18:43:45 -07:00 |
|
mwells
|
66364c581a
|
minor fix when indexing dmoz urls.
|
2013-10-13 17:12:20 -07:00 |
|
mwells
|
876af6d8c6
|
dmoz support is now updated and re-integrated.
|
2013-10-13 16:53:28 -07:00 |
|
mwells
|
3bc85cf528
|
a few cleanups for the new dmoz code.
|
2013-10-13 16:48:59 -07:00 |
|
mwells
|
0cc78dc2e0
|
fix dup bug.
|
2013-10-13 16:06:38 -07:00 |
|
mwells
|
d41d5554da
|
fix dmoz search.
|
2013-10-13 16:00:44 -07:00 |
|
mwells
|
4cbb31e180
|
added searchbox for dmoz pages/sites.
|
2013-10-13 15:45:12 -07:00 |
|
mwells
|
b60bdcc038
|
documentation updates. fixed sd=0.
|
2013-10-13 14:24:41 -07:00 |
|
mwells
|
2c7bc9031f
|
documentation updates.
|
2013-10-13 13:15:31 -07:00 |
|
mwells
|
8547b8f802
|
print pretty dmoz pages.
|
2013-10-13 00:39:05 -07:00 |
|
mwells
|
d4b5c37f45
|
Merge branch 'master' into testing
|
2013-10-13 00:20:37 -07:00 |
|
mwells
|
fbcaefa6ff
|
so we have spider https sites add
the old gigablast private/public key file.
|
2013-10-13 00:15:39 -07:00 |
|
mwells
|
65bad44450
|
try to fix EBADIP stopping a page from
getting indexed into dmoz.
|
2013-10-13 00:14:16 -07:00 |
|
mwells
|
c949bfe315
|
ignore certain errors and index the doc anyway
so we at least have it in our dmoz index with its
designated title and summary from dmoz.
|
2013-10-13 00:02:25 -07:00 |
|
mwells
|
eeb10bb99a
|
fix ip vector logic in xmldoc.cpp.
|
2013-10-12 23:14:39 -07:00 |
|
mwells
|
c283e85e40
|
add support for noindex meta tag.
use it in the gbdmoz.urls.txt.* files
that contain the dmoz urls we want to spider.
|
2013-10-12 22:50:23 -07:00 |
|