Commit Graph

13 Commits

Author SHA1 Message Date
Matt
c9b14b1b89 fix 'delete' checkbox in url filters.
fix reading in of xml conf files that have </> tags.
2015-03-17 21:20:27 -06:00
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt
83be5d7d46 fix links parser so it harvests outlinks from rss feeds'
<link> tags. it was doing this before, now it is doing it again.
2015-03-12 17:35:47 -07:00
Matt
8e72d6e4cc fix a couple critical xml parsing bugs. fixes
parsing of rss feeds better and xml in general.
fixed qa tests to ignore collection list when doing diff.
2015-03-10 19:13:21 -07:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
mwells
7a0f9fe370 fix support for indexing xml docs.
no longer use hacks gbxmltitle and gbxmllinks.
no longer convert html entities for xml docs using hacks
since we have XmlDoc::hashXmlFields() function.
added qaxml() qa test to test xml doc indexing and searching.
ignore <?xml> tag when generating xml tag compound name.
2014-09-28 10:43:41 -07:00
Matt Wells
a2beb23d87 added Xml::getCompoundName() 2014-09-28 08:39:46 -07:00
mwells
a72c5dae51 fix <script> tags that immediately end in </script> or
never end but hit another <script> or a </gbiframe> tag.
2014-07-14 17:24:20 -07:00
Matt Wells
bc78b21dc6 for json docs only give them a single
xmlnode in the Xml.cpp class. hopefully
will not get "malformed sections" error
anymore. i think that was a result of the
json having html tags in it and making
unnested html structures which the
sections class did not like.
TODO: probably do this for CT_TEXT etc.
as well.
2014-01-25 08:17:38 -08:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00