Commit Graph

63 Commits

Author SHA1 Message Date
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt
83be5d7d46 fix links parser so it harvests outlinks from rss feeds'
<link> tags. it was doing this before, now it is doing it again.
2015-03-12 17:35:47 -07:00
Matt
c6fd5571d2 if "links":[ is specified in diffbot reply then crawlbot
will parse out those links as if they were on the page.
2015-03-10 14:36:44 -07:00
mwells
5b538e7cee fix core in linkdb logic 2015-02-10 21:06:47 -07:00
Matt
0d0951284d fix core when host is down for > 1000 secs
while spidering
2015-01-31 09:22:24 -07:00
Matt
1eb9fdc658 fix some cores. fix debug log linkdb stuff. 2015-01-29 19:42:29 -07:00
Matt
e81b3a19ea fix time_t bugs. 2015-01-29 19:08:27 -07:00
Matt
aeaff79036 fix stack smash from 64-bit conversion some time ago 2015-01-22 12:47:36 -07:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt Wells
d19ee6ceea Merge branch 'diffbot' into diffbot-testing
Conflicts:
	Collectiondb.h
2014-12-11 08:40:55 -08:00
Matt Wells
654084f557 fix 64bit conversion bug. realloc offset should have
been 64bit not 32bit in Linkdb.cpp.
2014-12-03 07:35:14 -08:00
Matt
c5989f4c4c fix new simplied Inlinks code some more 2014-11-18 17:10:48 -08:00
Matt
2977845375 simplify Inlinks class in LinkInfo.cpp.
fix some more 64-bit related cores.
2014-11-18 16:50:31 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
8ca567e001 Merge branch 'master' into testing 2014-09-20 08:26:38 -06:00
mwells
b24071caee do not add crazy urls into spiderdb 2014-09-20 08:26:22 -06:00
Matt Wells
bd67256af6 hack fix for core from corrupted rss item. 2014-09-20 06:17:49 -07:00
mwells
060e887f08 misc/various bug fixes.
fix canonical redir url bug with iframes.
2014-08-28 18:07:22 -07:00
mwells
e45c0d32f6 Merge branch 'diffbot-testing' into testing 2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c various fixes.
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
388806c299 fix file.org prepending www all the time issue 2014-07-28 14:46:37 -07:00
Matt Wells
dc7a78687c fix long-standing core when getting linkinfo
from a collection that got nuked.
2014-07-16 10:40:12 -07:00
Matt Wells
2d4fb483b2 disambiguate error msg 2014-05-26 10:46:10 -07:00
Matt Wells
d6434191d1 nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
acd05aa740 fix a few minor bugs.
/master/->/admin/ and crawl type mismatch.
2014-03-16 10:34:58 -07:00
Matt Wells
edbd61b0c5 thread fixes. if pthread_create fails then
keep thread queue and just return. will try to
relaunch later. do not count delete keys towards
shard rebalance count.
2014-03-15 20:07:02 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
f777e6cccd Merge branch 'diffbot' into diffbot-testing 2014-03-07 08:23:21 -08:00
Matt Wells
434dd182d4 fix mem leak. always harvest links
for custom crawls.
2014-03-06 21:24:39 -08:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
25cf0efdbf first compiled stab at multi collection searching. 2014-03-06 10:45:13 -08:00
Matt Wells
282dad6cef deal with no coll recs when getting
link text using msg25. do not share
g_lineTable between collections.
2014-03-03 08:04:24 -08:00
Matt Wells
ff8a0b4ef1 do not let all collections share the same line table
in linkdb.cpp
2014-03-03 07:50:11 -08:00
Matt Wells
e4d425c18f fix coll being deleted when getting link text. 2014-03-02 14:24:49 -08:00
Matt Wells
42f254125e fix core in new link text logic.
empty msg25 replies are ok if g_errno is set.
2014-02-27 13:56:32 -08:00
Matt Wells
365fc16606 fix core in "wait in line" logic
when getting link info in Linkdb.cpp.
2014-02-27 09:22:35 -08:00
Matt Wells
6716d8f21b remove entry from linetable for linkinfo lookup 2014-02-26 00:27:29 -08:00
Matt Wells
33c8123288 more fixes for new link info code. 2014-02-25 13:53:41 -08:00
Matt Wells
94a55bf9a6 fixes for new link info code so it doesn't
bottleneck. got EFENCE_SIZE working so we
can use efence on large allocs only so we don't
go oom using it. might help finding some of
the out of bounds writing going on.
2014-02-25 10:55:05 -08:00
Matt Wells
72f1312652 new linkdb code compiling. 2014-02-20 17:27:28 -08:00
Matt Wells
9820f14066 checkpoint 2014-02-20 14:54:21 -08:00
Matt Wells
4d2eafe39b added some repair logic for 0001.dat files.
turn of spiderdb disk cache for now.
2014-02-01 10:14:25 -08:00
Matt Wells
bc78b21dc6 for json docs only give them a single
xmlnode in the Xml.cpp class. hopefully
will not get "malformed sections" error
anymore. i think that was a result of the
json having html tags in it and making
unnested html structures which the
sections class did not like.
TODO: probably do this for CT_TEXT etc.
as well.
2014-01-25 08:17:38 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00