Matt
a54471849b
sitemap.xml support for harvesting loc urls.
...
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt
83be5d7d46
fix links parser so it harvests outlinks from rss feeds'
...
<link> tags. it was doing this before, now it is doing it again.
2015-03-12 17:35:47 -07:00
Matt
c6fd5571d2
if "links":[ is specified in diffbot reply then crawlbot
...
will parse out those links as if they were on the page.
2015-03-10 14:36:44 -07:00
mwells
5b538e7cee
fix core in linkdb logic
2015-02-10 21:06:47 -07:00
Matt
0d0951284d
fix core when host is down for > 1000 secs
...
while spidering
2015-01-31 09:22:24 -07:00
Matt
1eb9fdc658
fix some cores. fix debug log linkdb stuff.
2015-01-29 19:42:29 -07:00
Matt
e81b3a19ea
fix time_t bugs.
2015-01-29 19:08:27 -07:00
Matt
aeaff79036
fix stack smash from 64-bit conversion some time ago
2015-01-22 12:47:36 -07:00
mwells
87285ba3cd
use gbmemcpy not memcpy so we can get profiler working again
...
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt Wells
d19ee6ceea
Merge branch 'diffbot' into diffbot-testing
...
Conflicts:
Collectiondb.h
2014-12-11 08:40:55 -08:00
Matt Wells
654084f557
fix 64bit conversion bug. realloc offset should have
...
been 64bit not 32bit in Linkdb.cpp.
2014-12-03 07:35:14 -08:00
Matt
c5989f4c4c
fix new simplied Inlinks code some more
2014-11-18 17:10:48 -08:00
Matt
2977845375
simplify Inlinks class in LinkInfo.cpp.
...
fix some more 64-bit related cores.
2014-11-18 16:50:31 -08:00
Matt
4e8a42e024
text replacements for bad int32_t substitutions
2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6
good checkpoint. quite a few fixes.
2014-11-17 18:13:36 -08:00
Matt
4c19453ea9
working with -m32 for basic testing.
...
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7
replaced unsigned long long with uint64_t
2014-10-30 13:30:39 -06:00
mwells
8ca567e001
Merge branch 'master' into testing
2014-09-20 08:26:38 -06:00
mwells
b24071caee
do not add crazy urls into spiderdb
2014-09-20 08:26:22 -06:00
Matt Wells
bd67256af6
hack fix for core from corrupted rss item.
2014-09-20 06:17:49 -07:00
mwells
060e887f08
misc/various bug fixes.
...
fix canonical redir url bug with iframes.
2014-08-28 18:07:22 -07:00
mwells
e45c0d32f6
Merge branch 'diffbot-testing' into testing
2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c
various fixes.
...
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
388806c299
fix file.org prepending www all the time issue
2014-07-28 14:46:37 -07:00
Matt Wells
dc7a78687c
fix long-standing core when getting linkinfo
...
from a collection that got nuked.
2014-07-16 10:40:12 -07:00
Matt Wells
2d4fb483b2
disambiguate error msg
2014-05-26 10:46:10 -07:00
Matt Wells
d6434191d1
nomenclature changes to reduce collissions.
...
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
acd05aa740
fix a few minor bugs.
...
/master/->/admin/ and crawl type mismatch.
2014-03-16 10:34:58 -07:00
Matt Wells
edbd61b0c5
thread fixes. if pthread_create fails then
...
keep thread queue and just return. will try to
relaunch later. do not count delete keys towards
shard rebalance count.
2014-03-15 20:07:02 -07:00
Matt Wells
bd4484db3c
Merge branch 'testing' into diffbot-testing
2014-03-10 12:08:23 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
f777e6cccd
Merge branch 'diffbot' into diffbot-testing
2014-03-07 08:23:21 -08:00
Matt Wells
434dd182d4
fix mem leak. always harvest links
...
for custom crawls.
2014-03-06 21:24:39 -08:00
Matt Wells
27e8e810d2
use collnum instead of coll string.
...
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
25cf0efdbf
first compiled stab at multi collection searching.
2014-03-06 10:45:13 -08:00
Matt Wells
282dad6cef
deal with no coll recs when getting
...
link text using msg25. do not share
g_lineTable between collections.
2014-03-03 08:04:24 -08:00
Matt Wells
ff8a0b4ef1
do not let all collections share the same line table
...
in linkdb.cpp
2014-03-03 07:50:11 -08:00
Matt Wells
e4d425c18f
fix coll being deleted when getting link text.
2014-03-02 14:24:49 -08:00
Matt Wells
42f254125e
fix core in new link text logic.
...
empty msg25 replies are ok if g_errno is set.
2014-02-27 13:56:32 -08:00
Matt Wells
365fc16606
fix core in "wait in line" logic
...
when getting link info in Linkdb.cpp.
2014-02-27 09:22:35 -08:00
Matt Wells
6716d8f21b
remove entry from linetable for linkinfo lookup
2014-02-26 00:27:29 -08:00
Matt Wells
33c8123288
more fixes for new link info code.
2014-02-25 13:53:41 -08:00
Matt Wells
94a55bf9a6
fixes for new link info code so it doesn't
...
bottleneck. got EFENCE_SIZE working so we
can use efence on large allocs only so we don't
go oom using it. might help finding some of
the out of bounds writing going on.
2014-02-25 10:55:05 -08:00
Matt Wells
72f1312652
new linkdb code compiling.
2014-02-20 17:27:28 -08:00
Matt Wells
9820f14066
checkpoint
2014-02-20 14:54:21 -08:00
Matt Wells
4d2eafe39b
added some repair logic for 0001.dat files.
...
turn of spiderdb disk cache for now.
2014-02-01 10:14:25 -08:00
Matt Wells
bc78b21dc6
for json docs only give them a single
...
xmlnode in the Xml.cpp class. hopefully
will not get "malformed sections" error
anymore. i think that was a result of the
json having html tags in it and making
unnested html structures which the
sections class did not like.
TODO: probably do this for CT_TEXT etc.
as well.
2014-01-25 08:17:38 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00