open-source-search-engine

mirror of https://github.com/gigablast/open-source-search-engine.git synced 2024-10-04 20:27:43 +03:00

Author	SHA1	Message	Date
Matt	c9b14b1b89	fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags.	2015-03-17 21:20:27 -06:00
Matt	a54471849b	sitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted.	2015-03-17 14:26:16 -06:00
Matt	83be5d7d46	fix links parser so it harvests outlinks from rss feeds' <link> tags. it was doing this before, now it is doing it again.	2015-03-12 17:35:47 -07:00
Matt	8e72d6e4cc	fix a couple critical xml parsing bugs. fixes parsing of rss feeds better and xml in general. fixed qa tests to ignore collection list when doing diff.	2015-03-10 19:13:21 -07:00
mwells	87285ba3cd	use gbmemcpy not memcpy so we can get profiler working again since memcpy can't be interrupted and backtrace() called.	2015-01-13 12:25:42 -07:00
Matt	4c19453ea9	working with -m32 for basic testing. compiles for 64-bit.	2014-11-12 11:38:37 -08:00
Matt	96b8197ad3	now it compiles with -m32	2014-11-10 14:45:11 -08:00
Matt Wells	e7dd8f7956	replace long long with int64_t	2014-10-30 13:36:39 -06:00
mwells	7a0f9fe370	fix support for indexing xml docs. no longer use hacks gbxmltitle and gbxmllinks. no longer convert html entities for xml docs using hacks since we have XmlDoc::hashXmlFields() function. added qaxml() qa test to test xml doc indexing and searching. ignore <?xml> tag when generating xml tag compound name.	2014-09-28 10:43:41 -07:00
Matt Wells	a2beb23d87	added Xml::getCompoundName()	2014-09-28 08:39:46 -07:00
mwells	a72c5dae51	fix <script> tags that immediately end in </script> or never end but hit another <script> or a </gbiframe> tag.	2014-07-14 17:24:20 -07:00
Matt Wells	bc78b21dc6	for json docs only give them a single xmlnode in the Xml.cpp class. hopefully will not get "malformed sections" error anymore. i think that was a result of the json having html tags in it and making unnested html structures which the sections class did not like. TODO: probably do this for CT_TEXT etc. as well.	2014-01-25 08:17:38 -08:00
Matt Wells	f6e560c1f4	Initial file population.	2013-08-02 13:12:24 -07:00

13 Commits