Matt Wells
e366c12470
Merge branch 'master' into diffbot
...
Conflicts:
Collectiondb.cpp
Msg13.cpp
Parms.cpp
Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
724af442d4
do not do simplified redirs if its a bulk job
2014-01-06 15:55:41 -08:00
Matt Wells
c472635660
do not do simplified redirs for custom
...
crawls so clients see their original urls.
2014-01-06 15:46:39 -08:00
Matt Wells
599be55b81
return {} if no crawls, and just specify token.
2014-01-06 13:55:58 -08:00
Matt Wells
258c48bb98
increase hardware requirements to 4GB from 500MB until
...
we adjust in-memory structure mem usage dynamically.
2014-01-04 18:05:50 -07:00
mwells
9bf49884b9
fix compiler warning
2014-01-02 01:35:52 -07:00
Matt Wells
7df2111ceb
fixed 'gb inject titledb-DIR newhosts.conf' command
...
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
935a4faccf
fixed './gb inject titledb newhosts.conf'
...
You have to be in working directory of the instance
whose cached pages (titlerecs) you want to inject
into the new cluster defined by newhosts.conf.
2014-01-01 22:04:26 -07:00
Matt Wells
d77ddc19c3
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2013-12-31 09:46:47 -08:00
Matt Wells
5619d2a2c8
fix initializing status msg error
2013-12-31 09:46:35 -08:00
Matt Wells
1919ad7f95
gb.conf spiders enabled.
2013-12-31 09:22:46 -08:00
Matt Wells
471fc7a50a
fixed core from deleting a non-existent crawl.
...
it tried to add it ...
2013-12-30 10:53:45 -08:00
Matt Wells
71982c9919
fix bad csv output
2013-12-30 10:39:45 -08:00
Matt Wells
f92f190176
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2013-12-29 14:51:33 -08:00
Matt Wells
a70b280206
nothing
2013-12-29 14:51:24 -08:00
Matt Wells
c0447de3a1
watch out for NULL "base" after a coll delete.
2013-12-29 01:32:40 -08:00
Matt Wells
70fc63985b
nothing
2013-12-28 20:32:28 -08:00
Matt Wells
1c044235be
count EFAKEFIRSTIP errors when spidering as
...
page download attempts. should fix a couple
smoke tests.
2013-12-27 19:25:51 -08:00
Matt Wells
6aac48e487
fix crawl delay wait queue logic.
...
if coll already exists trying to add, let it be. don't error out.
2013-12-27 14:35:51 -08:00
Matt Wells
5cdb73bc70
fix spider core
2013-12-27 15:28:44 -07:00
Matt Wells
d8a9a3f4e3
fix parm sync code some more.
...
added localhosts.conf to the 'gb install' dist.
2013-12-27 14:00:37 -08:00
Matt Wells
bff0083538
ensured robots.txt redirects are cached as well
2013-12-27 13:01:01 -08:00
Matt Wells
534c9cf9db
fix parm sync core
2013-12-27 12:09:46 -08:00
Matt Wells
958becbdf0
fix parm checksum for syncing parms.
...
was not using gbstrlen() for strings.
2013-12-27 11:56:20 -08:00
Matt Wells
0181a32311
fix array count syncing.
...
fix parms that were not syncing.
2013-12-26 11:51:20 -08:00
Matt Wells
100af585a6
parm sync fixes
2013-12-26 11:20:19 -08:00
Matt Wells
93d62a1f9e
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2013-12-26 09:34:47 -08:00
Matt Wells
9b5e3016df
fix hosts.conf
2013-12-26 09:34:35 -08:00
Matt Wells
141a76c322
try localhosts.conf before hosts.conf
2013-12-26 09:32:22 -08:00
Matt Wells
7624a3db0a
if url is manually added and it is simplifiedredirect
...
then re-add with the same manually added bit set
in the new spider request, otherwise seed url might
not get spidered since it might not match the regex.
2013-12-26 08:58:56 -08:00
Matt Wells
048b715962
if coll is deleted or reset in a middle of a dump
...
or merge then stop the dump/merge with ENOCOLLREC
error. avoid calling "base->" functions since it
could be NULL if deleted.
2013-12-25 17:12:09 -08:00
Matt Wells
f9d7b9dbc7
fix core
2013-12-23 18:50:46 -08:00
Matt Wells
8537a02008
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2013-12-23 10:31:00 -08:00
Matt Wells
6cc69106c2
fix hosts.conf
2013-12-23 10:30:45 -08:00
Matt Wells
3acd6a08d5
add the true spider request when
...
retrying to spider a fake-ip spider request.
add a EFAKEIP error reply for the fake ip request.
prevents us double spidering the same url.
2013-12-23 10:27:42 -08:00
Matt Wells
11d6d5ad6a
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2013-12-23 09:30:52 -08:00
Matt Wells
2ac8ff2952
compile regex so it's case dependent
2013-12-23 09:30:35 -08:00
Matt Wells
b0d77a834a
do not spider fake ips requests, just re-add them
...
with the right firstip
2013-12-20 12:22:02 -08:00
Matt Wells
6f2e552bcd
fix core in linked list of msg13requests in
...
case one gets freed
2013-12-20 11:26:46 -08:00
Matt Wells
5fcfff6729
fixes for spiders getting stuck
2013-12-19 20:04:06 -07:00
Matt Wells
4c7ce819b9
fix core dump
2013-12-19 18:39:29 -08:00
Matt Wells
c2f8445a70
expand reg ex shortcuts like \d to [0-9]
2013-12-19 18:31:37 -08:00
Matt Wells
261f4feb9b
fixed cdata parsing issue
2013-12-19 16:04:53 -08:00
Matt Wells
3092dcecaa
rebuild url filters and regexes at startup
2013-12-19 15:56:27 -08:00
Matt Wells
99099505d8
call regfree before changing regex
2013-12-19 15:32:26 -08:00
Matt Wells
7f70e4e887
fix regex logic
2013-12-19 15:19:18 -08:00
Matt Wells
aad12f9fe3
minor print format fix
2013-12-19 14:30:56 -08:00
Matt Wells
ef5decb0b8
more fixing stuck spiders
2013-12-19 14:17:22 -08:00
Matt Wells
32db83ae47
try to fix spiders from petering out.
...
reset doledb next keys and empty flags
every 3 minutes.
2013-12-19 13:31:14 -08:00
Matt Wells
cb111a1efa
fix doledb empty logic
2013-12-19 13:06:35 -08:00