Commit Graph

169 Commits

Author SHA1 Message Date
Dmitry Smirnov
b1ace63607 codespell: spelling corrections 2021-05-06 01:52:55 +10:00
Matt
816d69b34c a lot of bug fixes thanks to isj. 2016-03-29 04:08:17 -06:00
Matt Wells
04a8433256 show gbssParentDocId in status doc for children docs,
like diffbot object docs.
2016-03-22 09:00:10 -07:00
Matt Wells
0b5f417349 if old title rec was corrupted we would get a random docid
when re-spidering the url causing some chaos. now things
should return to normal and we should overwrite the corrupted
titlerec on the next spidering. also, no longer do robots.txt
titlerec lookups. silly.
2016-03-15 23:26:57 -07:00
Matt Wells
8a65d21371 fix the source of lots of corruption in spiderdb and titledb.
rdbmem.cpp was storing in secondary mem which got reset when
dump completed. also do not add keys that are in collnum and
key range of list currently being dumped, return ETRYAGAIN.
added verify writes parm. clean out tree of titledb and spiderdb
corruption on startup.
2016-03-15 15:54:12 -07:00
Matt Wells
9147d6bb02 fix some diffbot crawls.
do not spider pages at the hopcount limit
when 'only spider urls if new' is enabled.
meaning only spider each url once. (unless there is
a temporary error)
fix malformed url bug some more.
added some commented out code for indexing spider replies
(gbss docs) for certain fatal/critical errors, in which
case they are not being indexed.
2015-12-23 13:49:21 -08:00
Matt
fe448173d5 Merge branch 'ia' into testing 2015-11-09 11:14:00 -07:00
Matt
37cc4f2ba8 Merge branch 'diffbot-testing' into testing 2015-11-09 11:13:42 -07:00
Zak Betz
aeca57e9f4 Pass in the buffer size of an injection request so that if the content
length header field is bigger than the actual buffer we won't index
random memory.  Fixes bug with truncated warc captures.
2015-10-28 00:38:08 -06:00
Matt
51d68c4b3d pass proxy info back to diffbot 2015-10-20 15:53:16 -06:00
Zak Betz
ac25435b54 Warc pipe fixes. Fix arcs not processing https. Fix nulls being left
in warc read buffer causing second pass to fail.
2015-10-12 00:30:28 -06:00
Zak Betz
45744d74f3 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream 2015-10-07 08:46:07 -06:00
Zak Betz
c947252fee Add gbcapturedate to individual doc's metadata when injecting warcs. 2015-10-04 01:53:54 -06:00
Zak Betz
6becb55a2b Stream warcs instead of downloading them and unzipping them on disk. 2015-09-30 22:25:59 -06:00
Matt
d4c677170f index metadata on EDOCUNCHANGED errors, and append new meta data
to XmlDoc::ptr_metadata.
2015-09-30 07:57:40 -06:00
Matt
fd6875b94c make warc reading use a thread in xmldoc.cpp 2015-09-12 11:42:27 -06:00
Matt
f01db79e5f show inject requests in the spider queue table now 2015-09-11 14:16:26 -06:00
Matt
09de59f026 do not store cblock, etc. tags into tagdb to save
disk space. added tagdb file cache for better performance,
less disk accesses. will help reduce disk load.
put file cache sizes in master controls and if they change
then update the cache size dynamically.
2015-09-10 12:46:00 -06:00
Zak Betz
36b8d384bd Fixes to injector script.
New colors and metrics on performance graph.
2015-08-13 23:29:20 -06:00
Matt
1966f36c00 fix clock candidate bug 2015-07-01 20:34:39 -06:00
Matt
0cb3a5d44e fix ptr_metadata issue 2015-07-01 20:01:57 -06:00
Matt
1327301a8d Merge branch 'testing' into ia
Conflicts:
	Errno.cpp
	Errno.h
2015-07-01 19:03:56 -06:00
Zak Betz
7b507a70ef Set value length to 0 for something that does not return a string value
in Json.cpp.
Fix the '-' -> '_' when indexing generic fields.
Add a StackBuf macro which is a Safebuf initialized with a small
stack buffer for use in a local scope.
2015-06-30 14:09:57 -06:00
Matt
5f5ce7d12c Merge branch 'diffbot-testing' into testing
Conflicts:
	Makefile
2015-06-18 11:02:21 -06:00
Zak Betz
fab62fab3f Fix gigabit corruption.
Add scaffolding to show json metadata in summaries. *WIP*
2015-06-17 00:27:23 -06:00
Matt Wells
d050fb81b5 fix rebuild code to rebuild spider status docs in index,
and to remove them from titledb if user has disabled
'index spider replies' in the spider controls to save disk.
made them off by default by now since they use some disk.
2015-06-16 16:29:26 -06:00
Zak Betz
32987e76ee Add json metadata field to page inject.
Fix memory leak when spidering warc files.
Add script to inject warcs from internet archives search results.
2015-06-14 20:58:41 -06:00
Matt Wells
0df4abc759 checkpoint 2015-05-04 00:17:17 +00:00
Matt
f63cccaf01 arc indexing works again 2015-05-03 13:10:17 -07:00
Matt
a07c94f85d checkpoint 2015-05-03 12:55:19 -07:00
Matt
a0192318c0 warcs with wget almost working right 2015-05-02 23:50:49 -07:00
Matt
7f75a5a5dc insert wget thread call 2015-05-02 20:46:54 -07:00
Matt Wells
9f27a5c4d1 inject warcs from file on disk since they are so big 2015-05-03 03:07:18 +00:00
Matt
16b73a9bdd now we pass both injection tests in qa.cpp 2015-05-02 12:32:13 -07:00
Matt
5c89bde956 now all container doc logic is in xmldoc
and out of pageinject. compiles. needs testing.
2015-05-01 20:32:54 -07:00
Matt
0ca27638bc checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
Matt
df7dec9c74 Merge branch 'diffbot-testing' into ia
Conflicts:
	XmlDoc.cpp
2015-04-30 17:51:14 -07:00
Matt Wells
ad88ea8ba9 fix gbss related cores. fix bn.com crawling redir bug. 2015-04-30 11:11:27 -07:00
Matt
2479dd330d ok, move all the warc/arc parsing/indexing logic into
pageinject.cpp and out of xmldoc.cpp. it makes more
sense there. since really all we need to do is download
the warc's content and it is like injecting a delimeterized
document in the loop already in pageinject.cpp.
2015-04-29 21:39:18 -07:00
Matt
45c0909cb7 injecting warc files nicely now 2015-04-29 19:55:06 -07:00
Matt
0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files 2015-04-27 21:41:22 -06:00
Matt
71fbdf6518 time axis support 2015-04-24 22:09:10 -06:00
Matt Wells
644ad28912 debugging the hopcount bug 2015-04-19 15:51:29 -06:00
Matt Wells
99454bc8ca added
gbssSentToDiffbotThisTime
and
gbssSentToDiffbotAtSomeTime
to gbss docs to clarify if the url was sent to diffbot at
this crawl time, or any time.
makes it easier to see what is getting processed this
crawl round.
2015-04-14 14:50:39 -07:00
Matt Wells
b08d12a11e fix cores associated with new spider status docs. 2015-04-07 10:33:54 -07:00
Matt
90456222b6 now we add the spider status docs as json documents.
so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -06:00
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt
7cf549bf2a fix spider request overflow/dropping algo. 2015-03-10 13:07:00 -07:00
Matt
579a08d287 fixed link overflow logic. 2015-02-12 15:03:01 -08:00
Matt
3badbb69f4 fix injection bug 2015-02-03 13:00:47 -08:00