Commit Graph

127 Commits

Author SHA1 Message Date
mwells
ca6af65217 git dmoz nagivation system working.
now we just need to index the urls to
populate dmoz.
2013-10-10 22:08:21 -07:00
mwells
7ba9994804 many dmoz fixes. but still more we need to do.
isn't printing subcategories right now.
2013-10-08 23:55:11 -07:00
mwells
63c7764cd1 c=dmoz3 to c=dmoz 2013-10-06 17:12:45 -07:00
mwells
59b491f007 return fake tag recs for links if
usefakeips meta tag is given. saves
some lookups in tagdb when adding gbdmoz.urls.txt.*
files which have tons of links each. like 500,000.
2013-10-06 16:42:32 -07:00
mwells
2383905c80 start using fakeips flag to stop
ip tragrec lookups
2013-10-06 16:40:04 -06:00
mwells
183b7c372e make sections grow dynamically so we do not
OOM when trying to index a gbdmoz.urls.txt.* file
which can be 25MB.
2013-10-06 11:04:10 -06:00
mwells
d8e6ac8748 fixed bug of not putting meta tags
in all gbdmoz.urls.txt.* files in
dmozparse.cpp
2013-10-06 00:18:59 -06:00
mwells
000caa5a26 support for usefakeips meta tag 2013-10-06 00:10:07 -06:00
mwells
2935a143f0 if downloading a url on 127.0.0.1 or other local
ip then do not limit download size. should fix
downloading of gbdmoz.urls.txt.* files which can be
> 25MB big.
2013-10-05 23:43:00 -06:00
mwells
612f2872f7 use addurl to add the gbdmoz url
files to gigablast. it should index
just those dmoz urls, and not spider their links.
it should ignore external errors like
ETCPTIMEDOUT when indexing so it will be
identical to dmoz.
2013-10-05 23:22:51 -06:00
mwells
d464066da4 use catdb/ not cat/ 2013-10-04 22:39:41 -06:00
mwells
71d5d05f7c use catdb/ subdir not cat/ for consistency. 2013-10-04 21:35:13 -06:00
mwells
78c4bda368 fix dmozparse urldump -s bugs
for dumping out urls in dmoz.
2013-10-04 00:00:26 -06:00
mwells
f562e6da9a just ignore all urls with # (hashtag) in them
from the dmoz dump. we were truncating
http://twitter.com/#!/ronpaul to
http://twitter.com/ and when looking up
the catids of twitter.com got that ronpaul url.
so that's bad. people should respect the hashtag.
2013-10-03 23:33:55 -06:00
mwells
0176f8d6a7 fix cores in catdb logic. 2013-10-03 22:34:49 -06:00
mwells
9e1fee2cb9 dmozparse works with latest dmoz files now 2013-10-03 22:08:40 -06:00
mwells
a0c79932bb catdb is now generated successfully. 2013-10-02 23:36:49 -06:00
mwells
6c2c9f7774 trying to bring back dmoz integration. 2013-10-02 22:34:21 -06:00
Matt Wells
91b8921b9e have to use different ports if multiple gb
instances/processes on same server.
2013-10-02 16:12:17 -07:00
mwells
c03e862b99 use a better version of hosts.conf where we
specify the working directory for each host
entry. then we can use the exact same hosts.conf
file for each gb instance rather than having to
change the single "working-dir:" directive for
each instance, in the case where the each have
a different working directory.
2013-10-02 13:11:58 -06:00
Matt Wells
c911a606c9 renamed matches.h and matches.cpp to
matches2.h and matches2.cpp to avoid potential
confusion with Matches.h and Matches.cpp files.
2013-10-01 07:58:24 -07:00
mwells
5a52072888 recommended SSDs for optimal performance in admin.html. 2013-09-28 14:02:02 -06:00
mwells
a80cb52740 minor log msg. 2013-09-28 13:58:59 -06:00
mwells
737f3eae4d Merge branch 'master' of github.com:gigablast/open-source-search-engine 2013-09-28 13:45:39 -06:00
mwells
5884951190 only do certain things if running
on a machine in matt wells datacenter.
like fan switching based on temps,
or printing seo links. made seo functions
weak overridable placeholder stubs so if
seo.o is linked in it will override.
include seo.o object if seo.cpp file exists
for automatic seo module building and linking.
2013-09-28 13:43:56 -06:00
mwells
88677e1a15 fix bad engineer error that comes up sometimes
when viewing cached pages.
2013-09-27 18:15:59 -06:00
Matt Wells
e34afd21ea fix bug of possibly not removing some locks 2013-09-25 09:28:35 -07:00
Matt Wells
a687380aeb fix a bug of not reading enough spiderdb
records for a given "ip" because short reads
were causing us to bail out early. still not
sure as to the cause of the short reads.
2013-09-24 20:48:48 -07:00
Matt Wells
fbd853fdf7 fix long-standing spider bug causing some
ip queues to not get fully spidered.
2013-09-24 20:44:55 -07:00
Matt Wells
fbd62cecba updated compilation instructions. need
to apt-get install gcc-multilib.
2013-09-20 10:06:01 -07:00
mwells
d6815f2c9d if family filter enabled (&ff=1) then
prepend "gbadult:0 |" to the query to
restrict to non-adult pages.
2013-09-18 00:11:55 -06:00
mwells
a0032e0eb7 added another log statement for when
debugging the adult content detectory.
we err on the side of caution for the most part.
2013-09-18 00:06:21 -06:00
mwells
119a4c0c22 fix adult content detector 2013-09-17 23:53:17 -06:00
mwells
5ec3803312 fix core in hashing gbisadult:[0|1] term. 2013-09-17 23:27:31 -06:00
Matt Wells
3005f904c7 index gbisadult:1 if adult content
gbisadult:0 if not.
2013-09-17 22:05:47 -07:00
Matt Wells
3ac79de92e fix type adurl -> addurl. 2013-09-16 08:11:06 -07:00
Matt Wells
e6f87f5049 do not send email alerts to sysadmin@gigablast. 2013-09-16 08:10:18 -07:00
Matt Wells
5deda56ede minor documentation updates. 2013-09-15 22:16:14 -07:00
Matt Wells
3fdbae4b05 admin.html documentation update. 2013-09-15 22:05:01 -07:00
Matt Wells
68db2e6cc6 fix bug when checking the delete checkbox on
the injection page.
2013-09-15 21:47:42 -07:00
Matt Wells
965e23f192 fix core from hashtablex::set() not getting
enough buf space. now we force it to allocate
a minimum of 32 slots to fix another bug where
it was re-allocating immediately upon adding a
key because growTable() is ALWAYS called if there
are less than 20 slots!
2013-09-15 21:15:58 -07:00
Matt Wells
991e2f30f7 speed up whitelist hashtable like 20x
using hashtable key magic.
2013-09-15 21:10:53 -07:00
Matt Wells
928dc36a03 get "&site=abc.com+xyz.com"... working to restrict
search results to specified sites. tested a little.
2013-09-15 20:16:48 -07:00
mwells
2211881e59 take apt-get install ssl stuff out of admin.html
installation instructions since we supply the
ssl headers now.
2013-09-15 18:27:47 -06:00
mwells
01c2a6d381 we already include our own 32-bit
libssl.a and libcrypto.a so we can ensure
stability. so we have to include the header
files as well really.
2013-09-15 18:25:49 -06:00
mwells
107037c6a2 new &sites=xyz.com+abc.com+... functionality compiles ok. 2013-09-15 18:14:32 -06:00
mwells
b684414e16 almost done adding support for whitelists.
i.e. list of sites to restrict search results to,
for instance.
2013-09-15 15:15:56 -06:00
mwells
e152205765 make depend update 2013-09-09 02:37:47 -06:00
Matt Wells
1d63aa936c remove plotter.h includes causing
compiler errors on some machines.
2013-09-09 01:25:00 -07:00
Matt Wells
76b390aea2 fix typo 2013-09-08 19:51:57 -07:00