Commit Graph

988 Commits

Author SHA1 Message Date
Matt Wells
0be59b4c4d Merge branch 'master' into diffbot 2014-01-07 12:09:35 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
c529eaf1c3 fix alloc in a thread bug 2014-01-07 11:32:12 -07:00
Matt Wells
b0457f973d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-06 15:56:18 -08:00
Matt Wells
724af442d4 do not do simplified redirs if its a bulk job 2014-01-06 15:55:41 -08:00
Matt Wells
7e5b9bc1e8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-06 15:49:47 -08:00
Matt Wells
ace49d0b16 fix sectiondb core 2014-01-06 15:49:35 -08:00
Matt Wells
c472635660 do not do simplified redirs for custom
crawls so clients see their original urls.
2014-01-06 15:46:39 -08:00
Matt Wells
50c99dd815 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-06 14:28:18 -08:00
Matt Wells
49c935cf6d added SpiderReply::m_wasIndexedValid
so we know whether to cound m_wasIndexed and m_isIndexed
for page counting quota purposes.
2014-01-06 14:27:38 -08:00
Matt Wells
599be55b81 return {} if no crawls, and just specify token. 2014-01-06 13:55:58 -08:00
Matt Wells
4ed30a98ec bring back checkboxes. fix issue by
putting an input hidden box with value=0
before the checkbox to transmit it even
if unchecked.
2014-01-06 11:35:17 -08:00
Matt Wells
622790d0f8 radio button fixes. make them buttons now. 2014-01-06 10:23:44 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
258c48bb98 increase hardware requirements to 4GB from 500MB until
we adjust in-memory structure mem usage dynamically.
2014-01-04 18:05:50 -07:00
mwells
9bf49884b9 fix compiler warning 2014-01-02 01:35:52 -07:00
Matt Wells
7df2111ceb fixed 'gb inject titledb-DIR newhosts.conf' command
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
935a4faccf fixed './gb inject titledb newhosts.conf'
You have to be in working directory of the instance
whose cached pages (titlerecs) you want to inject
into the new cluster defined by newhosts.conf.
2014-01-01 22:04:26 -07:00
Matt Wells
b7e9b78c21 hash gbparenturl: for getting json
objects for the specified url in
the search results.
2013-12-31 10:21:08 -08:00
Matt Wells
d77ddc19c3 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-31 09:46:47 -08:00
Matt Wells
5619d2a2c8 fix initializing status msg error 2013-12-31 09:46:35 -08:00
Matt Wells
1919ad7f95 gb.conf spiders enabled. 2013-12-31 09:22:46 -08:00
Matt Wells
471fc7a50a fixed core from deleting a non-existent crawl.
it tried to add it ...
2013-12-30 10:53:45 -08:00
Matt Wells
71982c9919 fix bad csv output 2013-12-30 10:39:45 -08:00
Matt Wells
f92f190176 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-29 14:51:33 -08:00
Matt Wells
a70b280206 nothing 2013-12-29 14:51:24 -08:00
Matt Wells
c0447de3a1 watch out for NULL "base" after a coll delete. 2013-12-29 01:32:40 -08:00
Matt Wells
70fc63985b nothing 2013-12-28 20:32:28 -08:00
Matt Wells
1c044235be count EFAKEFIRSTIP errors when spidering as
page download attempts. should fix a couple
smoke tests.
2013-12-27 19:25:51 -08:00
Matt Wells
6aac48e487 fix crawl delay wait queue logic.
if coll already exists trying to add, let it be. don't error out.
2013-12-27 14:35:51 -08:00
Matt Wells
5cdb73bc70 fix spider core 2013-12-27 15:28:44 -07:00
Matt Wells
d8a9a3f4e3 fix parm sync code some more.
added localhosts.conf  to the 'gb install' dist.
2013-12-27 14:00:37 -08:00
Matt Wells
bff0083538 ensured robots.txt redirects are cached as well 2013-12-27 13:01:01 -08:00
Matt Wells
534c9cf9db fix parm sync core 2013-12-27 12:09:46 -08:00
Matt Wells
958becbdf0 fix parm checksum for syncing parms.
was not using gbstrlen() for strings.
2013-12-27 11:56:20 -08:00
Matt Wells
0181a32311 fix array count syncing.
fix parms that were not syncing.
2013-12-26 11:51:20 -08:00
Matt Wells
100af585a6 parm sync fixes 2013-12-26 11:20:19 -08:00
Matt Wells
93d62a1f9e Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-26 09:34:47 -08:00
Matt Wells
9b5e3016df fix hosts.conf 2013-12-26 09:34:35 -08:00
Matt Wells
141a76c322 try localhosts.conf before hosts.conf 2013-12-26 09:32:22 -08:00
Matt Wells
7624a3db0a if url is manually added and it is simplifiedredirect
then re-add with the same manually added bit set
in the new spider request, otherwise seed url might
not get spidered since it might not match the regex.
2013-12-26 08:58:56 -08:00
Matt Wells
048b715962 if coll is deleted or reset in a middle of a dump
or merge then stop the dump/merge with ENOCOLLREC
error. avoid calling "base->" functions since it
could be NULL if deleted.
2013-12-25 17:12:09 -08:00
Matt Wells
f9d7b9dbc7 fix core 2013-12-23 18:50:46 -08:00
Matt Wells
8537a02008 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-23 10:31:00 -08:00
Matt Wells
6cc69106c2 fix hosts.conf 2013-12-23 10:30:45 -08:00
Matt Wells
3acd6a08d5 add the true spider request when
retrying to spider a fake-ip spider request.
add a EFAKEIP error reply for the fake ip request.
prevents us double spidering the same url.
2013-12-23 10:27:42 -08:00
Matt Wells
11d6d5ad6a Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-23 09:30:52 -08:00
Matt Wells
2ac8ff2952 compile regex so it's case dependent 2013-12-23 09:30:35 -08:00
Matt Wells
b0d77a834a do not spider fake ips requests, just re-add them
with the right firstip
2013-12-20 12:22:02 -08:00
Matt Wells
6f2e552bcd fix core in linked list of msg13requests in
case one gets freed
2013-12-20 11:26:46 -08:00