Matt
bcdecc63c6
expose "urlip" injection parm to provide ip of url
...
being injected to save gigablast from an ip lookup
if you want.
2015-09-16 09:43:15 -06:00
Matt
f01db79e5f
show inject requests in the spider queue table now
2015-09-11 14:16:26 -06:00
Zak Betz
36b8d384bd
Fixes to injector script.
...
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
Zak Betz
15eb7f659d
Fix some malformed html on hosts page.
...
Fix core when no collection record in injection request.
Add a script to test disk speed.
2015-07-16 12:02:14 -06:00
Matt
46af0e1bce
if url too long return the EURLTOOBIG error code.
...
it prints 'Too many chars in url' as the official error msg.
2015-07-08 21:36:18 -06:00
Matt
815bd7ce0a
quite a few bug fixes.
2015-07-02 17:42:05 -06:00
Zak Betz
32987e76ee
Add json metadata field to page inject.
...
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
Matt Wells
ee5ffef834
fix core
2015-05-05 02:53:42 +00:00
Matt
08e01b5ac8
fix more bugs. new injections seem somewhat stable now.
2015-05-03 21:58:26 -07:00
Matt
ff969d92bb
can inject a single doc now
2015-05-03 21:14:28 -07:00
Matt
bc54282339
complete overhaul of injection pipeline now compiles.
...
should distribute injection requests evenly over the cluster.
uses new InjectionRequest class which sets from httprequest
using parms in Parms.cpp. and easily serializes into a udp request.
very nice. we should use this model going forward.
2015-05-03 19:07:44 -07:00
Matt
b39a065259
checkpoint #2
2015-05-03 17:51:47 -07:00
Matt Wells
0df4abc759
checkpoint
2015-05-04 00:17:17 +00:00
Matt
b0abe597e7
more fixes from qa test.
2015-05-02 14:34:07 -07:00
Matt
16b73a9bdd
now we pass both injection tests in qa.cpp
2015-05-02 12:32:13 -07:00
Matt
ecb6d081d5
fix indexArc()
2015-05-01 23:24:40 -07:00
Matt
5c89bde956
now all container doc logic is in xmldoc
...
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
Matt
0ca27638bc
checkpoint. moved warc and arc looping into xmldoc.
...
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
Matt
ce030fcfb0
now .arc and .arc.gz injections work
2015-04-30 20:25:26 -07:00
Matt
b4d0c53904
fix single url injects
2015-04-30 19:09:07 -07:00
Matt
fbfdde5195
fix for old delimeterized injects. was coring in gb smokes.
2015-04-30 19:07:12 -07:00
Matt
e387c0f154
yay test warc injecting working
2015-04-30 18:45:46 -07:00
Matt
f1663402d9
compiles again now
2015-04-30 18:23:46 -07:00
Matt
2479dd330d
ok, move all the warc/arc parsing/indexing logic into
...
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
Matt
45c0909cb7
injecting warc files nicely now
2015-04-29 19:55:06 -07:00
Matt
21948e15f6
more fixes
2015-04-28 23:30:14 -07:00
Matt
9370c8f52e
more fixes
2015-04-28 23:20:16 -07:00
Matt
0eb415d408
added preliminary support for spidering .warc.gz and .arc.gz files
2015-04-27 21:41:22 -06:00
Matt Wells
38caa517f2
add switches to disable injections or querying
...
from the master controls, for all collections.
2015-03-04 10:49:37 -08:00
Matt
b89f071f7c
quite a few bug fixes from adding the new query
...
syntax qa test.
2014-12-11 18:24:28 -08:00
Matt
0460335861
more permission system updates
2014-12-08 09:49:17 -08:00
mwells
a7462ed1f4
fix injection stuff
2014-12-04 09:29:17 -07:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
mwells
29f928a71e
import fixes
2014-09-25 20:48:34 -07:00
mwells
d4182cf4ed
fix importing function some
2014-09-25 20:33:42 -07:00
Matt Wells
fce036868b
only host #0 should read the import data.
2014-09-25 07:55:30 -07:00
mwells
cb32766645
fix data import function some more. added qa test.
2014-09-24 12:40:39 -07:00
mwells
538f6103d5
get qa tests working again.
...
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
Matt Wells
9dc11e53b1
fix core from too many facet strs
2014-09-21 09:26:13 -07:00
mwells
2ca303b7d7
new import code copiling. now needs runtime testing and
...
qa tests.
2014-09-20 20:12:28 -07:00
mwells
caee238c46
fixes to make easier to compile on max os x.
2014-08-28 12:55:02 -07:00
mwells
e45c0d32f6
Merge branch 'diffbot-testing' into testing
2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c
various fixes.
...
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
c4174a0ca6
fix bug causing qa json facet test to fail
2014-07-30 15:36:08 -07:00
mwells
f405760a25
fix query scraping
2014-07-29 19:51:41 -07:00
Matt Wells
3d1dcb08c1
fix core when getting sections while injecting
2014-07-22 14:23:41 -07:00
Matt Wells
2128b5af37
add support for §ions=1 for /admin/inject api
...
to return the posted content but with fresh sectiondb info
inserted.
2014-07-22 13:11:21 -07:00
mwells
d4218e01d7
inject docs that come through our squid proxy
2014-07-09 12:25:23 -07:00
mwells
dc6c97c59c
basic qa tests running
2014-07-06 18:53:05 -07:00