mwells
9244e172d0
fix calls to antiword and pdftohtml etc.
2014-06-15 17:44:52 -07:00
mwells
5c0b371dc9
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
HttpServer.cpp
Make.depend
Parms.cpp
Parms.h
2014-06-13 11:00:09 -07:00
mwells
20c4ac4205
got it marking up html now with sectiondb stats.
...
seems to work ok.
2014-06-12 14:42:08 -07:00
mwells
1e10c676d5
parm updates for injecting
2014-06-11 17:24:33 -07:00
mwells
4a2717a88f
Merge branch 'diffbot-testing' into diffbot-matt
2014-06-09 12:42:54 -07:00
mwells
778e67130f
File::set() fix for //'s
2014-06-08 15:24:30 -07:00
mwells
628fe2336f
make code compile cleaner.
2014-06-07 14:11:12 -07:00
mwells
965d992f98
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Msg13.cpp
2014-06-06 15:14:41 -07:00
Matt Wells
cfda735194
print error stuff in spiderdb dump.
2014-06-05 16:14:32 -07:00
mwells
a1f1daad16
Merge branch 'master' into diffbot-matt
...
Conflicts:
Spider.cpp
2014-06-03 11:41:46 -07:00
mwells
f16414b774
fix stripAccentMarks() to use libiconv stuff
...
so all languages are now supported.
2014-05-31 08:14:39 -07:00
mwells
a811462d5f
spider proxy stuff compiles now
2014-05-30 15:05:00 -07:00
Matt Wells
2e7f32b01a
fix getcwd2() so it works on red hat.
...
defaults to /var/gigablast/data0/gb if
cmd is "gb" and the "gb" binary is not
in the current working directory.
2014-05-25 20:53:49 -04:00
Matt Wells
b0f9227bbc
path fixes for gb startup
2014-05-25 10:28:13 -04:00
Matt Wells
45df139ccb
update logging
2014-05-21 10:05:49 -07:00
Matt Wells
35f6652ceb
make gb start and kstart not use hostid
...
any more. it is now inferred from path of gb binary.
2014-05-12 21:24:41 -07:00
Matt Wells
32a95cca45
fix 'gb install'
2014-05-12 17:04:38 -07:00
mwells
45b8bb3421
log msg cleanups
2014-05-11 21:55:44 -07:00
mwells
a9dc18c866
fix more bugs.
2014-05-11 19:44:41 -07:00
mwells
c3a1c674c3
now we run gb without a hostid.
...
we use its path and the local ip to identify its
hostid # in the hosts.conf.
2014-05-11 19:36:24 -07:00
mwells
5df28bb147
more pkg fixes
2014-05-11 15:02:03 -07:00
mwells
7c30c6b970
make install fixes. getting ready for pkg build.
2014-05-11 14:20:24 -07:00
mwells
70016ec3a3
work on make install.
2014-05-11 12:48:56 -07:00
mwells
8e381504a1
fix makeTrashDir()
2014-05-10 08:02:46 -07:00
mwells
2b37f56e4c
Merge branch 'diffbot-matt' into testing
2014-05-10 07:56:45 -07:00
mwells
38a79888b6
Merge branch 'diffbot-testing' into testing
2014-05-10 07:49:29 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
0daced51df
Merge branch 'diffbot-testing' into diffbot-matt
2014-05-02 14:34:04 -07:00
Matt Wells
a503cb35a2
fix gb installconf
2014-05-01 17:09:15 -07:00
Matt Wells
060f7da967
fix data corruption detection and repair bug.
...
do not core on corrupt http reply missing \0.
just set the g_errno to ECORRUPTDATA.
give more informative corruption log msgs.
2014-05-01 10:38:00 -07:00
mwells
ff121a76d9
fix formatting bugs
2014-04-30 14:17:39 -06:00
mwells
81369b786c
make trash dir for image thumbs automatically
2014-04-29 17:01:48 -06:00
Matt Wells
82726879a2
support base64 generated thumbnails in serps.
2014-04-24 14:04:57 -07:00
Matt Wells
75032da5b9
fix pagination for &stream=1
2014-04-22 11:18:21 -07:00
Matt Wells
e5c58d4322
fix gb kstart/start
2014-04-15 23:28:14 -07:00
Matt Wells
a903c15285
updates. -d flag now runs as daemon.
2014-04-15 16:28:34 -07:00
Matt Wells
b4e7a0df71
default so gb start uses kstart, so gb
...
is kept alive in case it cores. use
gb nstart to start w/o keepalives now.
2014-04-11 11:44:45 -07:00
Matt Wells
36e26436b5
logging fixes
2014-04-10 23:04:00 -07:00
Matt Wells
f293045ace
more help/documentation updates
2014-04-10 22:41:20 -07:00
Matt Wells
39daca3df5
minor log msg.
2014-04-10 22:02:42 -07:00
mwells
0906d0ae38
update gb -h
2014-04-10 00:48:27 -07:00
mwells
bef076917d
use -g for debug mode not -d, that's working dir.
2014-04-10 00:36:00 -07:00
mwells
2adf5b9bc5
more awesome fixes
2014-04-09 13:31:11 -07:00
mwells
72dc660598
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
HttpRequest.h
PageBasic.cpp
coll.main.0/coll.conf
2014-04-09 11:18:39 -07:00
mwells
1b5c6a6278
create hosts.conf into cwd if not there.
...
pretty up logging system.
update admin.html
2014-04-06 21:12:52 -07:00
mwells
5ee79a4c2f
daemonize on ./gb 0 etc.
2014-04-06 15:57:38 -07:00
mwells
c20c30c53f
Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing
2014-04-06 14:03:13 -07:00
mwells
23e5a94ddf
move log file in the binary itself now.
2014-04-06 14:02:51 -07:00
Matt Wells
d6434191d1
nomenclature changes to reduce collissions.
...
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
5057fdaf14
aesthetic cleanups
2014-03-16 17:12:04 -07:00
Matt Wells
edbd61b0c5
thread fixes. if pthread_create fails then
...
keep thread queue and just return. will try to
relaunch later. do not count delete keys towards
shard rebalance count.
2014-03-15 20:07:02 -07:00
Matt Wells
1f162ce7b2
update localhosts.conf too
2014-03-14 19:20:23 -07:00
Matt Wells
27e8e810d2
use collnum instead of coll string.
...
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
f11e25024a
Merge branch 'diffbot' into diffbot-testing
2014-02-26 20:34:06 -08:00
Matt Wells
a6b7e088f5
take out tfndb, unused. fix core
...
from diffbot url too long.
2014-02-26 01:07:13 -08:00
Matt Wells
94a55bf9a6
fixes for new link info code so it doesn't
...
bottleneck. got EFENCE_SIZE working so we
can use efence on large allocs only so we don't
go oom using it. might help finding some of
the out of bounds writing going on.
2014-02-25 10:55:05 -08:00
Matt Wells
7806a8a68c
fix excessive dupcache deduping.
2014-02-05 13:41:15 -08:00
Matt Wells
7bf8a2ac49
do not let glibc do malloc checks, we do that.
2014-02-02 13:41:59 -07:00
Matt Wells
0df697e56a
fix keep alive loop code to bail out if
...
fails to bind to socket as well as quick cores.
2014-02-02 12:11:18 -07:00
Matt Wells
4346fcee29
added recovery mode display in hosts table
2014-02-01 10:16:46 -08:00
Matt Wells
4a1ad74f79
test fix for keep alive infinite loop bug.
2014-01-30 14:16:16 -08:00
Matt Wells
83e291f12b
fix infinite keep alive restart bug some more
2014-01-30 14:12:32 -08:00
Matt Wells
03aa7842d0
do not enter into an inifinite keep alive restart loop.
2014-01-30 14:40:03 -07:00
Matt Wells
b40f393f4c
fix a couple cores related to deleting collections
...
in progress. support termlist dump with terms
containing colons.
2014-01-29 15:56:07 -08:00
Matt Wells
7b424a6236
always use kstart.
...
fixed restrictDomain bug of not saving parm.
sped up csv download around 2x.
2014-01-28 14:37:21 -08:00
Matt Wells
8f39c41962
just print out cached page straight, it is
...
just the diffbot json reply pretty much
verbatim, except for being tokenized.
should no longer escape forward slashes.
2014-01-28 11:04:53 -08:00
Matt Wells
474676010c
fix gb install 1-15 logic
2014-01-27 14:28:48 -08:00
Matt Wells
bc78b21dc6
for json docs only give them a single
...
xmlnode in the Xml.cpp class. hopefully
will not get "malformed sections" error
anymore. i think that was a result of the
json having html tags in it and making
unnested html structures which the
sections class did not like.
TODO: probably do this for CT_TEXT etc.
as well.
2014-01-25 08:17:38 -08:00
Matt Wells
321fc90ff6
fix some cores.
...
NOTE: emails disabled here... need to fix.
2014-01-24 12:07:28 -08:00
Matt Wells
5c9b688f72
spiderdb fixes for injections
2014-01-19 14:33:27 -08:00
Matt Wells
36b93a1e92
minor cmdline fixes
2014-01-18 21:26:59 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
f9d0a02dbe
test and get gbparenturl: query working.
2014-01-18 09:28:58 -08:00
Matt Wells
16f8af0d57
added awesome streaming mode support
...
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
01a3282020
fix problem scanning spiderdb.
...
move dedup spiderdb code to
RdbMerge.cpp where it really should be.
2014-01-16 17:04:08 -08:00
Matt Wells
883487889d
make gb install only have 10 outstanding per an ip
...
since ssh seems to close connections if you have more
than 12 out.
2014-01-15 14:41:30 -08:00
Matt Wells
6de7abf6ba
display fixes.
...
./gb installgb and ./gb installgb2 now install 'gb'
if 'gb.new' is not present.
2014-01-11 17:16:20 -08:00
Matt Wells
8a49e87a61
got code with shard rebalancing compiling.
...
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
Matt Wells
1d6ba52dcd
list collections in sidebar.
2014-01-09 21:13:41 -08:00
Matt Wells
ebdf1f638a
fix ./gb installgb2 to be semi-sequential
2014-01-09 13:25:45 -08:00
Matt Wells
47327a0c41
Merge branch 'master' into diffbot
2014-01-09 13:07:59 -08:00
Matt Wells
70f8c416de
allow collections to be added when no colls exist.
...
fixed gb start2 etc. to be sequential.
2014-01-09 13:07:16 -08:00
Matt Wells
161a5c5d6b
logging cleanups
2014-01-09 12:38:38 -08:00
Matt Wells
5007dc8e0c
fix core in gb seektest
2014-01-09 11:17:05 -07:00
Matt Wells
909022642d
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470
Merge branch 'master' into diffbot
...
Conflicts:
Collectiondb.cpp
Msg13.cpp
Parms.cpp
Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
4f64677b4f
get new global preemptive cache
...
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
mwells
9bf49884b9
fix compiler warning
2014-01-02 01:35:52 -07:00
Matt Wells
7df2111ceb
fixed 'gb inject titledb-DIR newhosts.conf' command
...
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
935a4faccf
fixed './gb inject titledb newhosts.conf'
...
You have to be in working directory of the instance
whose cached pages (titlerecs) you want to inject
into the new cluster defined by newhosts.conf.
2014-01-01 22:04:26 -07:00
Matt Wells
d8a9a3f4e3
fix parm sync code some more.
...
added localhosts.conf to the 'gb install' dist.
2013-12-27 14:00:37 -08:00
Matt Wells
958becbdf0
fix parm checksum for syncing parms.
...
was not using gbstrlen() for strings.
2013-12-27 11:56:20 -08:00
Matt Wells
9b080ff89c
more parmdb bug fixes
2013-12-16 13:36:31 -08:00
Matt Wells
9be1ab6323
more parmdb fixes
2013-12-16 12:20:13 -08:00
Matt Wells
0615acff17
zero out url filters checkboxes on submit
2013-12-16 11:03:40 -08:00
mwells
f2d5661965
parmdb overhaul. support collection add/del
...
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
mwells
0e47d48d8c
test commit
2013-12-10 13:02:52 -08:00
Matt Wells
06edfddf31
a bunch of bug fixes, mostly spider related.
...
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
c669f8c138
fix file descriptor leak in Dir class.
...
try to fix core from Thread getting SIGALRM.
try to set NOFILES to 1024 at startup in case
more are allowed.
2013-11-19 13:41:56 -08:00
Matt Wells
e909b85638
Merge branch 'master' into diffbot
2013-11-19 00:45:49 -08:00