open-source-search-engine

mirror of https://github.com/gigablast/open-source-search-engine.git synced 2024-10-04 04:07:13 +03:00

Author	SHA1	Message	Date
Dmitry Smirnov	b1ace63607	codespell: spelling corrections	2021-05-06 01:52:55 +10:00
Matt	816d69b34c	a lot of bug fixes thanks to isj.	2016-03-29 04:08:17 -06:00
Matt Wells	04a8433256	show gbssParentDocId in status doc for children docs, like diffbot object docs.	2016-03-22 09:00:10 -07:00
Matt Wells	0b5f417349	if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly.	2016-03-15 23:26:57 -07:00
Matt Wells	8a65d21371	fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup.	2016-03-15 15:54:12 -07:00
Matt Wells	9147d6bb02	fix some diffbot crawls. do not spider pages at the hopcount limit when 'only spider urls if new' is enabled. meaning only spider each url once. (unless there is a temporary error) fix malformed url bug some more. added some commented out code for indexing spider replies (gbss docs) for certain fatal/critical errors, in which case they are not being indexed.	2015-12-23 13:49:21 -08:00
Matt	fe448173d5	Merge branch 'ia' into testing	2015-11-09 11:14:00 -07:00
Matt	37cc4f2ba8	Merge branch 'diffbot-testing' into testing	2015-11-09 11:13:42 -07:00
Zak Betz	aeca57e9f4	Pass in the buffer size of an injection request so that if the content length header field is bigger than the actual buffer we won't index random memory. Fixes bug with truncated warc captures.	2015-10-28 00:38:08 -06:00
Matt	51d68c4b3d	pass proxy info back to diffbot	2015-10-20 15:53:16 -06:00
Zak Betz	ac25435b54	Warc pipe fixes. Fix arcs not processing https. Fix nulls being left in warc read buffer causing second pass to fail.	2015-10-12 00:30:28 -06:00
Zak Betz	45744d74f3	Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream	2015-10-07 08:46:07 -06:00
Zak Betz	c947252fee	Add gbcapturedate to individual doc's metadata when injecting warcs.	2015-10-04 01:53:54 -06:00
Zak Betz	6becb55a2b	Stream warcs instead of downloading them and unzipping them on disk.	2015-09-30 22:25:59 -06:00
Matt	d4c677170f	index metadata on EDOCUNCHANGED errors, and append new meta data to XmlDoc::ptr_metadata.	2015-09-30 07:57:40 -06:00
Matt	fd6875b94c	make warc reading use a thread in xmldoc.cpp	2015-09-12 11:42:27 -06:00
Matt	f01db79e5f	show inject requests in the spider queue table now	2015-09-11 14:16:26 -06:00
Matt	09de59f026	do not store cblock, etc. tags into tagdb to save disk space. added tagdb file cache for better performance, less disk accesses. will help reduce disk load. put file cache sizes in master controls and if they change then update the cache size dynamically.	2015-09-10 12:46:00 -06:00
Zak Betz	36b8d384bd	Fixes to injector script. New colors and metrics on performance graph.	2015-08-13 23:29:20 -06:00
Matt	1966f36c00	fix clock candidate bug	2015-07-01 20:34:39 -06:00
Matt	0cb3a5d44e	fix ptr_metadata issue	2015-07-01 20:01:57 -06:00
Matt	1327301a8d	Merge branch 'testing' into ia Conflicts: Errno.cpp Errno.h	2015-07-01 19:03:56 -06:00
Zak Betz	7b507a70ef	Set value length to 0 for something that does not return a string value in Json.cpp. Fix the '-' -> '_' when indexing generic fields. Add a StackBuf macro which is a Safebuf initialized with a small stack buffer for use in a local scope.	2015-06-30 14:09:57 -06:00
Matt	5f5ce7d12c	Merge branch 'diffbot-testing' into testing Conflicts: Makefile	2015-06-18 11:02:21 -06:00
Zak Betz	fab62fab3f	Fix gigabit corruption. Add scaffolding to show json metadata in summaries. WIP	2015-06-17 00:27:23 -06:00
Matt Wells	d050fb81b5	fix rebuild code to rebuild spider status docs in index, and to remove them from titledb if user has disabled 'index spider replies' in the spider controls to save disk. made them off by default by now since they use some disk.	2015-06-16 16:29:26 -06:00
Zak Betz	32987e76ee	Add json metadata field to page inject. Fix memory leak when spidering warc files. Add script to inject warcs from internet archives search results.	2015-06-14 20:58:41 -06:00
Matt Wells	0df4abc759	checkpoint	2015-05-04 00:17:17 +00:00
Matt	f63cccaf01	arc indexing works again	2015-05-03 13:10:17 -07:00
Matt	a07c94f85d	checkpoint	2015-05-03 12:55:19 -07:00
Matt	a0192318c0	warcs with wget almost working right	2015-05-02 23:50:49 -07:00
Matt	7f75a5a5dc	insert wget thread call	2015-05-02 20:46:54 -07:00
Matt Wells	9f27a5c4d1	inject warcs from file on disk since they are so big	2015-05-03 03:07:18 +00:00
Matt	16b73a9bdd	now we pass both injection tests in qa.cpp	2015-05-02 12:32:13 -07:00
Matt	5c89bde956	now all container doc logic is in xmldoc and out of pageinject. compiles. needs testing.	2015-05-01 20:32:54 -07:00
Matt	0ca27638bc	checkpoint. moved warc and arc looping into xmldoc. now will any container doc from pageinject into xmldoc. simplifies pageinject.cpp a lot. and sets up a framework for dealing with container docs.	2015-05-01 19:11:13 -07:00
Matt	df7dec9c74	Merge branch 'diffbot-testing' into ia Conflicts: XmlDoc.cpp	2015-04-30 17:51:14 -07:00
Matt Wells	ad88ea8ba9	fix gbss related cores. fix bn.com crawling redir bug.	2015-04-30 11:11:27 -07:00
Matt	2479dd330d	ok, move all the warc/arc parsing/indexing logic into pageinject.cpp and out of xmldoc.cpp. it makes more sense there. since really all we need to do is download the warc's content and it is like injecting a delimeterized document in the loop already in pageinject.cpp.	2015-04-29 21:39:18 -07:00
Matt	45c0909cb7	injecting warc files nicely now	2015-04-29 19:55:06 -07:00
Matt	0eb415d408	added preliminary support for spidering .warc.gz and .arc.gz files	2015-04-27 21:41:22 -06:00
Matt	71fbdf6518	time axis support	2015-04-24 22:09:10 -06:00
Matt Wells	644ad28912	debugging the hopcount bug	2015-04-19 15:51:29 -06:00
Matt Wells	99454bc8ca	added gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round.	2015-04-14 14:50:39 -07:00
Matt Wells	b08d12a11e	fix cores associated with new spider status docs.	2015-04-07 10:33:54 -07:00
Matt	90456222b6	now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc.	2015-03-19 16:17:36 -06:00
Matt	a54471849b	sitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted.	2015-03-17 14:26:16 -06:00
Matt	7cf549bf2a	fix spider request overflow/dropping algo.	2015-03-10 13:07:00 -07:00
Matt	579a08d287	fixed link overflow logic.	2015-02-12 15:03:01 -08:00
Matt	3badbb69f4	fix injection bug	2015-02-03 13:00:47 -08:00

1 2 3 4

169 Commits