Commit Graph

86 Commits

Author SHA1 Message Date
Matt
bcdecc63c6 expose "urlip" injection parm to provide ip of url
being injected to save gigablast from an ip lookup
if you want.
2015-09-16 09:43:15 -06:00
Matt
f01db79e5f show inject requests in the spider queue table now 2015-09-11 14:16:26 -06:00
Zak Betz
36b8d384bd Fixes to injector script.
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
Zak Betz
15eb7f659d Fix some malformed html on hosts page.
Fix core when no collection record in injection request.
Add a script to test disk speed.
2015-07-16 12:02:14 -06:00
Matt
46af0e1bce if url too long return the EURLTOOBIG error code.
it prints 'Too many chars in url' as the official error msg.
2015-07-08 21:36:18 -06:00
Matt
815bd7ce0a quite a few bug fixes. 2015-07-02 17:42:05 -06:00
Zak Betz
32987e76ee Add json metadata field to page inject.
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
Matt Wells
ee5ffef834 fix core 2015-05-05 02:53:42 +00:00
Matt
08e01b5ac8 fix more bugs. new injections seem somewhat stable now. 2015-05-03 21:58:26 -07:00
Matt
ff969d92bb can inject a single doc now 2015-05-03 21:14:28 -07:00
Matt
bc54282339 complete overhaul of injection pipeline now compiles.
should distribute injection requests evenly over the cluster.
uses new InjectionRequest class which sets from httprequest
using parms in Parms.cpp. and easily serializes into a udp request.
very nice. we should use this model going forward.
2015-05-03 19:07:44 -07:00
Matt
b39a065259 checkpoint #2 2015-05-03 17:51:47 -07:00
Matt Wells
0df4abc759 checkpoint 2015-05-04 00:17:17 +00:00
Matt
b0abe597e7 more fixes from qa test. 2015-05-02 14:34:07 -07:00
Matt
16b73a9bdd now we pass both injection tests in qa.cpp 2015-05-02 12:32:13 -07:00
Matt
ecb6d081d5 fix indexArc() 2015-05-01 23:24:40 -07:00
Matt
5c89bde956 now all container doc logic is in xmldoc
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
Matt
0ca27638bc checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
Matt
ce030fcfb0 now .arc and .arc.gz injections work 2015-04-30 20:25:26 -07:00
Matt
b4d0c53904 fix single url injects 2015-04-30 19:09:07 -07:00
Matt
fbfdde5195 fix for old delimeterized injects. was coring in gb smokes. 2015-04-30 19:07:12 -07:00
Matt
e387c0f154 yay test warc injecting working 2015-04-30 18:45:46 -07:00
Matt
f1663402d9 compiles again now 2015-04-30 18:23:46 -07:00
Matt
2479dd330d ok, move all the warc/arc parsing/indexing logic into
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
Matt
45c0909cb7 injecting warc files nicely now 2015-04-29 19:55:06 -07:00
Matt
21948e15f6 more fixes 2015-04-28 23:30:14 -07:00
Matt
9370c8f52e more fixes 2015-04-28 23:20:16 -07:00
Matt
0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files 2015-04-27 21:41:22 -06:00
Matt Wells
38caa517f2 add switches to disable injections or querying
from the master controls, for all collections.
2015-03-04 10:49:37 -08:00
Matt
b89f071f7c quite a few bug fixes from adding the new query
syntax qa test.
2014-12-11 18:24:28 -08:00
Matt
0460335861 more permission system updates 2014-12-08 09:49:17 -08:00
mwells
a7462ed1f4 fix injection stuff 2014-12-04 09:29:17 -07:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
mwells
29f928a71e import fixes 2014-09-25 20:48:34 -07:00
mwells
d4182cf4ed fix importing function some 2014-09-25 20:33:42 -07:00
Matt Wells
fce036868b only host #0 should read the import data. 2014-09-25 07:55:30 -07:00
mwells
cb32766645 fix data import function some more. added qa test. 2014-09-24 12:40:39 -07:00
mwells
538f6103d5 get qa tests working again.
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
Matt Wells
9dc11e53b1 fix core from too many facet strs 2014-09-21 09:26:13 -07:00
mwells
2ca303b7d7 new import code copiling. now needs runtime testing and
qa tests.
2014-09-20 20:12:28 -07:00
mwells
caee238c46 fixes to make easier to compile on max os x. 2014-08-28 12:55:02 -07:00
mwells
e45c0d32f6 Merge branch 'diffbot-testing' into testing 2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c various fixes.
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
c4174a0ca6 fix bug causing qa json facet test to fail 2014-07-30 15:36:08 -07:00
mwells
f405760a25 fix query scraping 2014-07-29 19:51:41 -07:00
Matt Wells
3d1dcb08c1 fix core when getting sections while injecting 2014-07-22 14:23:41 -07:00
Matt Wells
2128b5af37 add support for &sections=1 for /admin/inject api
to return the posted content but with fresh sectiondb info
inserted.
2014-07-22 13:11:21 -07:00
mwells
d4218e01d7 inject docs that come through our squid proxy 2014-07-09 12:25:23 -07:00
mwells
dc6c97c59c basic qa tests running 2014-07-06 18:53:05 -07:00