open-source-search-engine/html/faq.html
2015-02-18 07:26:01 -07:00

1419 lines
95 KiB
HTML

<!--<html>
<head>
<title>FAQ</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf8" />
</head>
<body text=#000000 bgcolor=#ffffff link=#000000 vlink=#000000 alink=#000000 >
<style>body,td,p,.h{font-family:arial,sans-serif; font-size: 15px;} </style>
<center>
<img border=0 width=500 height=122 src=/logo-med.jpg>
<br><br>
</center>
-->
<div style=max-width:700px;>
<br>
<h1>FAQ</h1>
Developer documentation is <a href=/developer.html>here</a>.
<br><br>
A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
<br>
<br>
<h1>Table of Contents</h1>
<br>
<a href=#quickstart>Quick Start</a><br><br>
<a href=#src>Build from Source</a><br><br>
<a href=#features>Features</a><br><br>
<a href=/admin/api>API</a> - for doing searches, indexing documents and performing cluster maintenance<br><br>
<!--<a href=#weighting>Weighting Query Terms</a> - how to pass in your own query term weights<br><br>-->
<a href=#requirements>Hardware Requirements</a> - what is required to run gigablast
<br>
<br>
<a href=#perf>Performance Specifications</a> - various statistics.
<br>
<br>
<a href=#multisetup>Setting up a Cluster</a> - how to run multiple gb instances in a sharded cluster.
<br>
<br>
<a href=#scaling>Scaling the Cluster</a> - how to add more gb instances.
<br>
<br>
<a href=#scaling>Updating the Binary</a> - follow similar procedure to <i>Scaling the Cluster</i>
<br>
<br>
<a href=#trouble>Cleaning Up after a Crash</a> - how do i make sure my data is in tact after a host crashes?
<br>
<br>
<a href=#spider>The Spider</a> - how does the spider work?
<br>
<br>
<!--<a href=#files>List of Files</a> - the necessary files to run Gigablast
<br>
<br>-->
<a href=#cmdline>Command Line Options</a> - various command line options (coming soon)
<br><br>
<!--
<a href=#clustermaint>Cluster Maintenance</a> - running Gigablast on a cluster of computers.<br><br><a href=#trouble>Troubleshooting</a> - how to fix problems
-->
<!--<br><br><a href=#disaster>Disaster Recovery</a> - dealing with a crashed host-->
<!--<br><br>
<a href=#security>The Security System</a> - how to control access-->
<!--<a href=#build>Building an Index</a> - how to start building your index<br><br>
<a href=#spider>The Spider</a> - all about Gigabot, Gigablast's crawling agent<br><br>-->
<!--<a href=#quotas>Document Quotas</a> - how to limit documents into the index<br><br>-->
<a href=/api.html#/admin/inject>Injecting Documents</a> - inserting documents directly into Gigablast
<br><br>
<a href=/api.html#/admin/inject>Deleting Documents</a> - removing documents from the index
<br><br><a href=#metas>Indexing User-Defined Meta Tags</a> - how Gigablast indexes user-defined meta tags
<!--<br><br><a href=#bigdocs>Indexing Big Documents</a> - what controls the maximum size of a document that can be indexed?-->
<!--<br><br><a href=#rolling>Rolling the New Index</a> - merging the realtime files into the base file-->
<br><br>
<a href=#dmoz>Building a DMOZ Based Directory</a> - build a web directory based on open DMOZ data
<br><br>
<a href=#logs>The Log System</a> - how Gigablast logs information
<br><br>
<a href=#optimizing>Optimizing</a> - optimizing Gigablast's spider and query performance
<!--
<br><br>
<a href=#config>gb.conf</a> - describes the gb configuration file
<br><br>
<a href=#hosts>hosts.conf</a> - the file that describes all participating hosts in the network
<br><br>
<a href=#stopwords>Stopwords</a> - list of common words generally ignored at query time<br><br>
<a href=#phrasebreaks>Phrase Breaks</a> - list of punctuation that breaks a phrase<br><br>
-->
<br><br><a name=quickstart></a>
<h1>Quick Start</h1>
&lt;<i>Last Updated February 2015</i>&gt;
<br>
<br>
<b><font color=red>Requirements:</font></b>
<br><br>
<!--Until I get the binary packages ready, <a href=#src>build from the source code</a>, it should only take about 30 seconds to type the three commands.-->
You will need an Intel or AMD system with at least 4GB of RAM for every gigablast shard you want to run.
<br><br>
<br>
<b><font color=red>For Debian/Ubuntu Linux:</font></b>
<br><br>
1. Download a package: <a href=http://www.gigablast.com/gb_1.19-1_amd64.deb>Debian/Ubuntu 64-bit</a> &nbsp; ( <a href=http://www.gigablast.com/gb_1.19-1_i386.deb>Debian/Ubuntu 32-bit</a> )
<br><br>
2. Install the package by entering: <b>sudo dpkg -i <i>&lt;filename&gt;</i></b> where filename is the file you just downloaded.
<br><br>
3. Type <b>sudo gb -d</b> to run Gigablast in the background as a daemon.
<br><br>
4. If running for the first time, it could take up to 20 seconds to build some preliminary files.
<br><br>
5. Once running, visit <a href=http://127.0.0.1:8000/>port 8000</a> with your browser to access the Gigablast controls.
<br><br>
6. To list all packages you have installed do a <b>dpkg -l</b>.
<br><br>
7. If you ever want to remove the gb package type <b>sudo dpkg -r gb</b>.
<br><br>
<br>
<b><font color=red>For RedHat/Fedora Linux:</font></b>
<br><br>
1. Download a package: <a href=http://www.gigablast.com/gb-1.19-2.x86_64.rpm>RedHat 64-bit</a> ( <a href=http://www.gigablast.com/gb-1.19-2.i386.rpm>RedHat 32-bit</a> )
<br><br>
2. Install the package by entering: <b>rpm -i --force --nodeps <i>&lt;filename&gt;</i></b> where filename is the file you just downloaded.
<br><br>
3. Type <b>sudo gb -d</b> to run Gigablast in the background as a daemon.
<br><br>
4. If running for the first time, it could take up to 20 seconds to build some preliminary files.
<br><br>
5. Once running, visit <a href=http://127.0.0.1:8000/>port 8000</a> with your browser to access the Gigablast controls.
<br><br>
<br>
<b><font color=red>For Microsoft Windows:</font></b>
<br><br>
1. If you are running Microsoft Windows, then you will need to install Oracle's <a href=http://www.virtualbox.org/wiki/Downloads><b>VirtualBox for Windows hosts</b></a> software. That will allow you to run Linux in its own window on your Microsoft Windows desktop.
<br><br>
2. When configuring a new Linux virtual machine in VirtualBox, make sure you select at least 4GB of RAM.
<br><br>
3. Once VirtualBox is installed you can download either an
<!--<a href=http://virtualboxes.org/images/ubuntu/>Ubuntu</a> or <a href=http://virtualboxes.org/images/fedora/>RedHat Fedora</a>-->
<a href="http://www.ubuntu.com/download/desktop">Ubuntu CD-ROM Image (.iso file)</a> or a <a href="http://fedoraproject.org/get-fedora">Red Hat Fedora CD-ROM Image (.iso file)</a>.
The CD-ROM Images represent Linux installation CDs.
<br><br>
4. When you boot up Ubuntu or Fedora under VirtualBox for the first time, it will prompt you for the CD-ROM drive, and it will allow you to enter your .iso filename there.
<br><br>
5. Once you finish the Linux installation process
and then boot into Linux through VirtualBox, you can follow the Linux Quick Start instructions above.
<br><br>
<br>
<hr>
<br>
<table cellpadding=5><tr><td colspan=2><b>Installed Files</b></td></tr>
<tr><td><nobr>/var/gigablast/data0/</nobr></td><td>Directory of Gigablast binary and data files</td></tr>
<tr><td>/etc/init.d/gb</td><td>start up script link</td></tr>
<!--<tr><td>/etc/init/gb.conf</td><td>Ubuntu upstart conf file so you can type 'start gb' or 'stop gb', but that will only work on local instances of gb.</td></tr>-->
<tr><td>/usr/bin/gb</td><td>Link to /var/gigablast/data0/gb</td></tr>
</table>
<!--<br><br>
If you run into an bugs let me know so i can fix them right away: gigablast@mail.com.-->
<br>
<br>
<a name=src></a>
<h1>Build From Source</h1>
&lt;<i>Last Updated January 2015</i>&gt;
<br>
<br>
Requirements: You will need an Intel or AMD system running Linux and at least 4GB of RAM.<br><br>
<!--If you run into an bugs let me know so i can fix them right away: gigablast@mail.com.
<br><br>-->
<!--
You will need the following packages installed<br>
<ul>
<li>For Ubuntu do a <b>apt-get install make g++ gcc-multilib</b>
<br>
For RedHat do a <b>yum install gcc-c++ glibc-static libstdc++-static openssl-static</b>
-->
<!--<li>apt-get install g++
<li>apt-get install gcc-multilib <i>(for 32-bit compilation support)</i>
-->
<!--<li>apt-get install libssl-dev <i>(for the includes, 32-bit libs are here)</i>-->
<!--<li>apt-get install libplot-dev <i>(for the includes, 32-bit libs are here)</i>-->
<!--<li>apt-get install lib32stdc++6-->
<!--<li>apt-get install ia32-libs-->
<!--<li>I supply libstdc++.a but you might need the include headers and have to do <b>apt-get install lib32stdc++6</b> or something.
</ul>
-->
<b>1.0</b> For <u>Ubuntu 12.02 or 14.04</u>: do <b>sudo apt-get update ; apt-get install make g++ libssl-dev binutils</b>
<br><br>
<!--<b>1.1</b> For <u>64-bit Ubuntu 12.02</u>: do <b>sudo apt-get update ; apt-get install make g++ libssl-dev</b>
<br><br>
<<b>1.1</b> For <u>32-bit Ubuntu 14.04</u>: do <b>sudo apt-get update ; apt-get install make g++ gcc-multilib g++-multilib</b>
<br><br>
<b>1.3</b> For <u>32-bit Ubuntu 12.02</u>: do <b>sudo apt-get update ; apt-get install make g++ gcc-multilib </b>
<br><br>
-->
<b>1.1.</b> For <u>RedHat</u> do <b>sudo yum install gcc-c++</b>
<br><br>
<b>2.</b> Download the <a href=https://github.com/gigablast/open-source-search-engine>Gigablast source code</a> using <b>wget --no-check-certificate "https://github.com/gigablast/open-source-search-engine/archive/master.zip"</b>, unzip it and cd into it. (optionally use <b>git clone https://github.com/gigablast/open-source-search-engine.git ./github</b> if you have <i>git</i> installed.)
<br><br>
<b>3.0</b> Run <b>make</b> to compile. (e.g. use 'make -j 4' to compile on four cores)
<br><br>
<b>3.1</b> If you want to compile a 32-bit version of gb for some reason,
run <b>make clean ; make gb32</b>.
<br><br>
<b>4.</b> Run <b>./gb -d</b> to start a single gigablast node which listens on port 8000 running in daemon mode (-d).
<br><br>
<b>5.</b> The first time you run gb, wait about 30 seconds for it to build some files. Check the log file to see when it completes.
<br><br>
<b>6.</b> Go to the <a href=http://127.0.0.1:8000/>root page</a> to begin.
<br>
<br><br><a name=features></a>
<h1>Features</h1>
&lt;<i>Last Updated Jan 2015</i>&gt;
<br>
<ul>
<li> <b>The ONLY open source WEB search engine.</b>
<li> 64-bit architecture.
<li> Scalable to thousands of servers.
<li> Has scaled to over 12 billion web pages on over 200 servers.
<li> A dual quad core, with 32GB ram, and two 160GB Intel SSDs, running 8 Gigablast instances, can do about 8 qps (queries per second) on an index of 10 million pages. Drives will be close to maximum storage capacity. Doubling index size will more or less halve qps rate. (Performance metrics can be made about ten times faster but I have not got around to it yet. Drive space usage will probably remain about the same because it is already pretty efficient.)
<li> 1 million web pages requires 28.6GB of drive space. That includes the index, meta information and the compressed HTML of all the web pages. That is 28.6K of disk per HTML web page.
<li>Spider rate is around 1 page per second per core. So a dual quad core can spider and index 8 pages per second which is 691,200 pages per day.
<li> 4GB of RAM required per Gigablast instance. (instance = process)
<li> Live demo at <a href=http://www.gigablast.com/>http://www.gigablast.com/</a>
<li> Written in C/C++ for optimal performance.
<li> Over 500,000 lines of C/C++.
<li> 100% custom. A single binary. The web server, database and everything else
is all contained in this source code in a highly efficient manner. Makes administration and troubleshooting easier.
<li> Reliable. Has been tested in live production since 2002 on billions of
queries on an index of over 12 billion unique web pages, 24 billion mirrored.
<li> Super fast and efficient. One of a small handful of search engines that have hit such big numbers. The only open source search engine that has.
<li> Supports all languages. Can give results in specified languages a boost over others at query time. Uses UTF-8 representation internally.
<li> Track record. Has been used by many clients. Has been successfully used
in distributed enterprise software.
<li> Cached web pages with query term highlighting.
<li> Shows popular topics of search results (Gigabits), like a faceted search on all the possible phrases.
<li> Email alert monitoring. Let's you know when the system is down in all or part, or if a server is overheating, or a drive has failed or a server is consistently going out of memory, etc.
<li> "Synonyms" based on wiktionary data. Using query expansion method.
<li> Customizable "synonym" file: my-synonyms.txt
<li> No silly TF/IDF or Cosine. Stores position and format information (fancy bits) of each word in an indexed document. It uses this to return results that contain the query terms in close proximity rather than relying on the probabilistic tf/idf approach of other search engines. The older version of Gigablast used tf/idf on Indexdb, whereas it now uses Posdb to hold the index data.
<li> Complete scoring details are displayed in the search results.
<li> Indexes anchor text of inlinks to a web page and uses many techniques to flag pages as link spam thereby discounting their link weights.
<li> Demotes web pages if they are spammy.
<li> Can cluster results from same site.
<li> Duplicate removal from search results.
<li> Distributed web crawler/spider. Supports crawl delay and robots.txt.
<li> Crawler/Spider is highly programmable and URLs are binned into priority queues. Each priority queue has several throttles and knobs.
<li> Spider status monitor to see the urls being spidered over the whole cluster in a real-tiem widget.
<li> Complete REST/XML API for doing queries as well as adding and deleting documents in real-time.
<li> Automated data corruption detection, fail-over and repair based on hardware failures.
<li> Custom Search. (aka Custom Topic Search). Using a cgi parm like &sites=abc.com+xyz.com you can restrict the search results to a list of up to 500 subdomains.
<li> DMOZ integration. Run DMOZ directory. Index and search over the pages in DMOZ. Tag all pages from all sites in DMOZ for searching and displaying of DMOZ topics under each search result.
<li> Collections. Build tens of thousands of different collections, each treated as a separate search engine. Each can spider and be searched independently.
<li> Federated search over multiple Gigablast collections using syntax like &c=mycoll1+mycoll2+mycoll3+...
<li> Plug-ins. For indexing any file format by calling Plug-ins to convert that format to HTML. Provided binary plug-ins: pdftohtml (PDF), ppthtml (PowerPoint), antiword (MS Word), pstotext (PostScript).
<li> Indexes JSON and XML natively. Provides ability to search individual structured fields.
<li> Sorting. Sort the search results by meta tags or JSON fields that contain numbers, simply by adding something like gbsortby:price or gbrevsortby:price as a query term, assuming you have meta price tags.
<li> Easy Scaling. Add new servers to the <a href=/hosts.conf.txt>hosts.conf</a> file then click 'rebalance shards' to automatically rebalance the sharded data.
<li> Using &stream=1 can stream back millions of search results for a query without running out of memory.
<li> Makes and displays thumbnail images in the search results.
<li> Nested boolean queries using AND, OR, NOT operators.
<li> Built-in support for <a href=http://www.diffbot.com/products/automatic/>diffbot.com's api</a>, which extracts various entities from web sites, like products, articles, etc. But you will need to get a free token from them for access to their API.
<li> Facets over meta tags or X-Paths for HTML documents.
<li> Facets over JSON and XML fields.
<li> Sort and constrain by numeric fields in JSON or XML.
<li> Built-in real-time profiler.
<li> Built-in QA tester.
<li> Can inject WARC and ARC archive files.
<li> Spellchecker will be renabled shortly.
</ul>
<h2>Coming Soon</h2>
<ul>
<li> file:// "spidering" support
<li> smb:// "spidering" support
<li> Query completion
<li> Improved plug-in support
</ul>
<br>
<!--
<br><br><a name=weighting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Weighting Query Terms</td></tr></table>
<br><br>
Gigablast allows you to pass in weights for each term in the provided query. The query term weight operator, which is directly inserted into the query, takes the form: <b>[XY]</b>, where <i>X</i> is the weight you want to apply and <i>Y</i> is <b><i>a</i></b> if you want to make it an absolute weight or <b><i>r</i></b> for a relative weight. Absolute weights cancel any weights that Gigablast may place on the query term, like weights due to the term's popularity, for instance. The relative weight, on the other hand, is multiplied by any weight Gigablast may have already assigned.<br><br>
The query term weight operator will affect all query terms that follow it. To turn off the effects of the operator just use the blank operator, <b>[]</b>. Any weight operators you apply override any previous weight operators.<br><br>
The weight applied to a phrase is unaffected by the weights applied to its constituent terms. In order to weight a phrase you must use the <b>[XYp]</b> operator. To turn off the affects of a phrase weight operator, use the phrase blank operator, <b>[p]</b>.<br><br>
Applying a relative weight of 0 to a query term, like <b>[0r]</b>, has the effect of still requiring the term in the search results (if it was not ignored), but not allowing it to contribute to the ranking of the search results. However, when doing a default OR search, if a document contains two such terms, it will rank above a document that only contains one such term. <br><br>
Applying an absolute weight of 0 to a query term, like <b>[0a]</b>, causes it to be completely ignored and not used for generating the search results at all. But such ignored or devalued query terms may still be considered in a phrase context. To affect the phrases in a similar manner, use the phrase operators, <b>[0rp]</b> and <b>[0ap]</b>.<br><br>
Example queries:<br><br>
<b>[10r]happy [5rp][13r]day []lucky</b><br>
<i>happy</i> is weighted 10 times it's normal weight.<br>
<i>day</i> is weighted 13 times it's normal weight.<br>
<i>"day lucky"</i>, the phrase, is weighted 5 times it's normal weight.<br>
<i>lucky</i> is given it's normal weight assigned by Gigablast.<br><br>
Also, keep in mind not to use these weighting operators between another query operator, like '+', and its affecting query term. If you do, the '+' or '-' operator will not work.<br><br>
-->
<a name=requirements></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Hardware Requirements</td></tr></table>
<br>
&lt;<i>Last Updated January 2014</i>&gt;
<br>
<br>
At least one computer with 4GB RAM, 10GB of hard drive space and any distribution of Linux with the 2.4.25 kernel or higher. For decent performance invest in Intel Solid State Drives. I tested other brands around 2010 and found that they would freeze for up for 500ms every hour or so to do "garbage collection". That is unacceptable in general for a search engine.
Plus, Gigablast, reads and writes a lot of data at the same time under heavy spider and query loads, therefore disk will probably be your MAJOR bottleneck.<br><br>
<br>
<a name=perf></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Performance Specifications</td></tr></table>
<br>
&lt;<i>Last Updated January 2014</i>&gt;
<br>
<br>
Gigablast can store 100,000 web pages (each around 25k in size) per gigabyte of disk storage. A typical single-cpu pentium 4 machine can index one to two million web pages per day even when Gigablast is near its maximum document capacity for the hardware. A cluster of N such machines can index at N times that rate.<br><br>
<br>
<!--
<a name=files></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>List of Files</td></tr></table>
<br>
<b>1.</b> Create one directory for every Gigablast process you would like to run. Each Gigablast process is also called a <i>host</i> or a <i>node</i>. Multiple processes can exist on one physical server and is usually done to take advantage of mutiple cores, one process per core.<br><br>
<b>2.</b>
Each directory should have the following files and subdirectories:<br><br>
<table cellpadding=3>
<tr><td><b>gb</b></td><td>The Gigablast executable. Contains the web server, the database and the spider. This file is required to run gb. It will be created when gb first runs.</td></tr>
<tr><td><b>hosts.conf</b></td><td><a href=/hosts.conf>example</a>. This file describes each host (gb process) in the Gigablast network. Every gb process uses the same hosts.conf file. This file is required to run gb.
</td></tr>
<tr><td><b>gb.conf</b></td><td><a href=/gb.conf.txt>example</a>. Each gb process is called a <i>host</i> and each gb process has its own gb.conf file. This file is required to run gb.<tr><td><b>coll.XXX.YYY/</b></td><td>For every collection there is a subdirectory of this form, where XXX is the name of the collection and YYY is the collection's unique id. Contained in each of these subdirectories is the data associated with that collection.</td></tr>
-->
<!--<tr><td><b>coll.XXX.YYY/coll.conf</b></td><td>Each collection contains a configuration file called coll.conf. This file allows you to configure collection specific parameters. Every parameter in this file is also controllable via your the administrative web pages as well.</td></tr>
-->
<!--
<tr><td><b>trash/</b></td><td>Deleted collections are moved into this subdirectory. A timestamp in milliseconds since the epoch is appended to the name of the deleted collection's subdirectory after it is moved into the trash sub directory. Gigablast doesn't physically delete collections in case it was a mistake.</td></tr>
<tr><td><b>html/</b></td><td>A subdirectory that holds all the html files and images used by Gigablast. Includes Logos and help files.</tr>
<tr><td><b>antiword</b></td><td>Executable called by gbfilter to convert Microsoft Word files to html for indexing.</tr>
<tr><td><b>antiword-dir/</b></td><td>A subdirectory that contains information needed by antiword.</tr>
<tr><td><b>pdftohtml</b></td><td>Executable called by gbfilter to convert PDF files to html for indexing.</tr>
<tr><td><b>pstotext</b></td><td>Executable called by gbfilter to convert PostScript files to text for indexing.</tr>
<tr><td><b>ppthtml</b></td><td>Executable called by gbfilter to convert PowerPoint files to html for indexing.</tr>
<tr><td><b>xlhtml</b></td><td>Executable called by gbfilter to convert Microsoft Excel files to html for indexing.</tr>
</table>
-->
<!--<tr><td><b>gbfilter</b></td><td>Simple executable called by Gigablast with document HTTP MIME header and document content as input. Output is an HTTP MIME and html or text that can be indexed by Gigablast.</tr>-->
<!--<tr><td><b><a href=#gbstart>gbstart</a></b></td><td>An optional simple script used to start up the gb process(es) on each computer in the network. Otherwise, iff you have passwordless ssh capability then you can just use './gb start' and it will spawn an ssh command to start up a gb process for each host listed in hosts.conf.</tr>-->
<!--
</table>
<br><br>
<br>
-->
<a name=multisetup></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Setting up a Cluster</td></tr></table>
<br>
&lt;<i>Last Updated July 2014</i>&gt;
<br>
<br>
1. Locate the <a href=/hosts.conf.txt>hosts.conf</a> file. If installing from binaries it should be in the /var/gigablast/data0/ directory. If it does not exist yet then run <b>gb</b> or <b>./gb</b> which will create one. You will then have to exit gb after it does.
<br><br>
2. Update the <b>num-mirrors</b> in the <a href=/hosts.conf.txt>hosts.conf</a> file. Leave it as 0 if you do not want redundancy. If you want each shard to be mirrored by one other gb instance, then set this to 1. I find that 1 is typically good enough, provided that the mirror is on a different physical server. So if one server gets trashed there is another to serve that shard. The sole advantage in not mirroring your cluster is that you will have twice the disk space for storing documents. Query speed should be unaffected because Gigablast is smart enough to split the load evenly between mirrors when processing queries. You can send your queries to any shard and it will communicate with all the other shards to aggregate the results. If one shard fails and you are not mirroring then you will lose that part of the index, unfortunately.
<br><br>
3. Make one entry in the <a href=/hosts.conf.txt>hosts.conf</a> per physical core you have on your server. If an entry is on the same server as another, then it will need a completely different set of ports. Each gb instance also requires 4GB of ram, so you may be limited by your RAM before being limited by your cores. You can of course run multiple gb instances on a single core if you have the RAM, but performance will not be optimal.
<br><br>
4. Continue following the instructions for <a href=#scaling>Scaling the Cluster</a> below in order to get the other shards set up and running.
<br>
<br>
<br>
<a name=scaling></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Scaling the Cluster</td></tr></table>
<br>
&lt;<i>Last Updated June 2014</i>&gt;
<br>
<br>
1. If your spiders are active, then turn off spidering in the <a href=/admin/master>master controls</a>.
<br><br>
2. If your cluster is running, shut down the clustering by doing a <b>gb stop</b> command on the command line OR by clicking on "save & exit" in the <a href=/admin/master>master controls</a>
<br><br>
3. Edit the <a href=/hosts.conf.txt>hosts.conf</a> file in the working directory of host #0 (the first host entry in the hosts.conf file) to add the new hosts.
<br><br>
4. Ensure you can do passwordless ssh from host #0 to each new IP address you added. This generally requires running <b>ssh-keygen -t dsa</b> on host #0 to create the files <i>~/.ssh/id_dsa</i> and <i>~/.ssh/id_dsa.pub</i>. Then you need to insert the key in <i>~/.ssh/id_dsa.pub</i> into the <i>~/.ssh/authorized_keys2</i> file on every host, including host #0, in your cluster. Furthermore, you must do a <b>chmod 700 ~/.ssh/authorized_keys2</b> on each one otherwise the passwordless ssh will not work.
<br><br>
5. Run <b>gb install &lt;hostid&gt;</b> on host #0 for each new hostid to copy the required files from host #0 to the new hosts. This will do an <i>scp</i> which requires the passwordless ssh. &lt;hostid&gt; can be a range of hostids like <i>5-12</i> as well.
<br><br>
6. Run <b>gb start</b> on the command line to start up all gb instances/processes in the cluster.
<br><br>
7. If your index was not empty, then click on <b>rebalance shards</b> in the <a href=/admin/master>master controls</a> to begin moving data from the old shards to the new shards. The <a href=/admin/hosts>hosts table</a> will let you know when the rebalance operation is complete. It should be able to serve queries during the rebalancing, but spidering can not resume until it is completed.
<br>
<br>
<br>
<!--
<a name=clustermaint></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Cluster Maintenance</td></tr></table>
<br>
&lt;<i>Caution: Old Documentation</i>&gt;
<br>
<br>
For the purposes of this section, we assume the name of the cluster is gf and all hosts in the cluster are named gf*. The Master host of the cluster is gf0. The gigablast working directory is assumed to be /a/ . We all assume you can do passwordless ssh from one machine to another, otherwise administration of hundreds of servers is not fun!
<br>
<br>
<b>To setup dsh:</b>
<ul>
<li> Install the dsh package, on debian it would be:<br> <b> $ apt-get install dsh</b><br>
<li>Go to the working directory in your bash shell and type <b>./gb dsh hostname | sort | uniq > all</b> to add the hostname of each server to the file <i>all</i>.
<br></ul>
<b>To setup dsh on a machine on which we do not have root:</b>
<ul>
<li>cd to the working directory
<li>Copy /usr/lib/libdshconfig.so.1.0.0 to the working directory.
<li><b>export LD_PATH=.</b>
</ul>
<b>To use the dsh command:</b>
<ul>
<li>run <b>dsh -c -f all hostname</b> as a test. It should execute the hostname command on all servers listed in the file <i>all</i>.
<li>to copy a master configuration file to all hosts:<br>
<b>$ dsh -c -f all 'scp gf0:/a/coll.conf /a/coll.conf'</b><br>
<li>to check running processes on all machines concurrently (-c option):<br>
<b>$ dsh -c -f all 'ps auxww'</b><br>
</ul>
<b>To prepare a new cluster or erase an old cluster:</b><ul>
<li>Save <b>/a/gb.conf</b>, <b>/a/hosts.conf</b>, and <b>/a/coll.*.*/coll.conf</b> files somewhere besides on /dev/md0 if they exist and you want to keep them.
<li>cd to a directory not on /dev/md0
<li>Login as root using <b>su</b>
<li>Use <b>dsh -c -f all 'umount /dev/md0'</b> to unmount the working directory. All login shells must exit or cd to a different directory, and all processes with files opened in /dev/md0 must exit for the unmount to work.
<li>Use <b>dsh -c -f all 'umount /dev/md0'</b> to unmount the working directory.
<li>Use <b>dsh -c -f all 'mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0'</b> to revuild the filesystem on the raid. CAUTION!!! WARNING!! THIS COMPLETELY ERASES ALL DATA ON /dev/md0
<li>Use <b>dsh -c -f all 'mount /dev/md0'</b> to remount it.
<li>Use <b>dsh -c -f all 'mkdir /mnt/raid/a ; chown mwells:mwells /mnt/raid/a</b> to create the 'a' directory and let user mwells, or other search engine administrator username, own it.
<li>Recopy over the necessary gb files to every machine.
</ul>
<br>
<b>To test a new gigablast executable:</b><ul>
<li>Change to the gigablast working directory.<br> <b>$ cd /a</b><li>Stop all gb processes on hosts.conf.<br> <b>$ gb stop</b><li>Wait until all hosts have stopped and saved their data. (the following line should not print anything)<br> <b>$ dsh -a 'ps auxww' | grep gb</b>
<li>Copy the new executable onto gf0<br> <b>$ scp gb user@gf0:/a/</b><li>Install the executable on all machines.<br> <b>$ gb installgb</b><br><li>This will copy the gb executable to all hosts. You must wait until all of the scp processes have completed before starting the gb process. Run ps to verify that all of the scp processes have finished.<br> <b>$ ps auxww</b><li>Run gb start<br> <b>$ gb start </b><li>As soon as all of the hosts have started, you can use the web interface to gigablast.<br></ul>
<b>To switch the live cluster from the current (cluster1) to another (cluster2):</b><ul>
<li>Ensure that the gb.conf of cluster2 matches that of cluster1, excluding any desired changes.<br><li>Ensure that the coll.conf for each collection on cluster2 matches those of cluster1, excluding any desired changes.<br><li>Thoroughly test cluster2 using the blaster program.<br><li>Test duplicate queries between cluster1 and cluster2 and ensure results properly match, with the exception of any known new changes.<br><li>Make sure port 80 on cluster2 is directing to the correct port for gb.<br> <b>$ iptables -t nat -A PREROUTING -i eth0 -p tcp -m tcp --dport 80 -j DNAT --to-destination 2.2.2.2:8000</b><br><li>Test that cluster2 works correctly by accessing it from a browser using only it's IP in the address.<br><li>For both primary and secondary DNS servers, perform the following:<br><ul><li>Edit /etc/bind/db.&lt;hostname&gt; (i.e. db.gigablast.com)<br> <b>$ vi /etc/bind/db.gigablast.com</b><br> <li>Change lines using cluster1's ip to have cluster2's ip. It is recommended that comment out the old line with a ; at the front.<br> <b>i.e.: "www&nbsp;&nbsp;IN&nbsp;&nbsp;A&nbsp;&nbsp;1.1.1.1" >> "www&nbsp;&nbsp;IN&nbsp;&nbsp;A&nbsp;&nbsp;2.2.2.2"</b><br> <li>Edit /etc/bind/db.64<br> <b>$ vi /etc/bind/db.64</b><br> <li>Change lines with cluster1's last IP number to have cluster2's last IP number.<br> <b>i.e.: "1&nbsp;&nbsp;IN&nbsp;&nbsp;PTR&nbsp;&nbsp;www.gigablast.com" >> "2&nbsp;&nbsp;IN&nbsp;&nbsp;PTR&nbsp;&nbsp;www.gigablast.com"</b><br> <li>Restart named.<br> <b>$ /etc/rc3.d/S15bind9 restart</b><br></ul><li>Again, test that cluster2 works correctly by accessing it from a browser using only it's IP in the address.<br><li>Check log0 of cluster2 to make sure it is recieving queries.<br> <b>$ tail -f /a/log0</b><br><li>Allow cluster1 to remain active until all users have switched over to cluster2.<br></ul><br>
-->
<a name=trouble></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Cleaning Up After a Crash</td></tr></table>
<br>
&lt;<i>Last Updated Sep 2014</i>&gt;
<br>
<!--
<br>
<a name=disaster></a>
<b>A host in the network crashed. How do I temporarily decrease query latency on the network until I get it up again?</b><br>You can go to the <i>Search Controls</i> page and cut all nine tier sizes in half. This will reduce search result recall, but should cut query latency times in half for slower queries until the crashed host is recovered.<br>-->
<br><b>A host in the network crashed. What is the recovery procedure?</b><br>First determine if the host's crash was clean or unclean. It was clean if the host was able to save all data in memory before it crashed. If the log ended with <i>allExit: dumping core after saving</i> then the crash was clean, otherwise it was not.<br><br>If the crash was clean then you can simply restart the crashed host by typing <b>gb start <i>i</i></b> where <i>i</i> is the hostId of the crashed host. However, if the crash was not clean, like in the case of a sudden power outtage, then in order to ensure no data gets lost, you must copy the data of the crashed host's twin. If it does not have a twin then there may be some data loss and/or corruption. In that case try reading the section below, <i>How do I minimize the damage after an unclean crash with no twin?</i>, but you may be better off starting the index build from scratch. To recover from an unclean crash using the twin, follow the steps below: <br><br>a. Click on 'all spiders off' in the 'master controls' of host #0, or host #1 if host #0 was the host that crashed.<br>b. If you were injecting content directly into Gigablast, stop.<br>c. Click on 'all just save' in the 'master controls' of host #0 or host #1 if host #0 was the one that crashed.<br>d. Determine the twin of the crashed host by looking in the <a href=/hosts.conf.txt>hosts.conf</a> file or on the <a href=/admin/hosts>hosts</a> page. The twin will have the same shard number as the crashed host.<br>e. Recursively copy the working directory of the twin to the crashed host using rcp since it is much faster than scp.<br>f. Restart the crashed host by typing <b>gb start <i>i</i></b> where <i>i</i> is the hostId of the crashed host. If it is not restartable, then skip this step.
<!--
<br>g. If the crashed host was restarted, wait for it to come back up. Monitor another host's <i>hosts</i> table to see when it is up, or watch the log of the crashed host.<br>h. If the crashed host was restarted, wait a minute for it to absorb all of the data add requests that may still be lingering. Wait for all hosts' <i>spider queues</i> of urls currently being spidered to be empty of urls.<br>i. Perform another <i>all just save</i> command to relegate any new data to disk.<br>j. After the copy completes edit the hosts.conf on host #0 and replace the ip address of the crashed host with that of the spare host.<br>k. Do a <b>gb stop</b> to safely shut down all hosts in the network.<br>l. Do a <b>gb installconf</b> to propagate the hosts.conf file from host #0 to all other hosts in the network (including the spare host, but not the crashed host)<br>m. Do a <b>gb start</b> to bring up all hosts under the new hosts.conf file.<br>n. Monitor all logs for a little bit by doing <i>dsh -c -f all 'tail -f /a/log? /a/log??'</i><br>o. Check the <i>hosts</i> table to ensure all hosts are up and running.
-->
<br><br><br><b>How do I minimize the damage after an unclean crash with no twin?</b><br>You may never be able to get the index 100% back into shape right now, but in the near future there may be some technology that allows gigablast to easily recover from these situations. For now though, 2. Try to determine the last url that was indexed and *fully* saved to disk. Every time you index a url some data is added to all of these databases: checksumdb, posdb (index), spiderdb, titledb and clusterdb. These databases all have in-memory data that is periodically dumped to disk. So you must determine the last time each of these databases dumped to disk by looking at the timestamp on the corresponding files in the appropriate collection subdirectories contained in the working directory. If clusterdb was dumped to disk the longest time ago, then use its timestamp to indicate when the last url was successfully added or injected. You might want to subtract thirty minutes from that timestamp to make sure because it is really the time that that file <b>started</b> being dumped to disk that you are after, and that timestamp represents the time of the last write to that file. Now you can re-add the potentially missing urls from that time forward using the <a href=/admin/addurl>AddUrl page</a> and get a semi-decent recovery.
<br>
<!--
<br><br><b>I get different results for the XML feed (raw=X) as compared to the HTML feed. What is going on?</b><br> Try adding the &rt=1 cgi parameter to the search string to tell Gigablast to return real time results.rt is set to 0 by default for the XML feed, but not for the HTML feed. That means Gigablast will only look at the root indexdb file when looking up queries. Any newly added pages will be indexed outside of the root file until a merge is done. This is done for performance reasons. You can enable real time look ups by adding &rt=1 to the search string. Also, in your search controls there are options to enable or disable real time lookups for regular queries and XML feeds, labeled as "restrict indexdb for queries" and "restrict indexdb for xml feed". Make sure both regular queries and xml queries are doing the same thing when comparing results.<br><br>Also, you need to look at the tier sizes at the top of the Search Controls page. The tier sizes (tierStage0, tierStage1, ...) listed for the raw (XML feed) queries needs to match non-raw in order to get exactly the same results. Smaller tier sizes yield better performance but yield less search results.
-->
<!--
<br><br><b>The spider is on but no urls are showing up in the Spider Queue table as being spidered. What is wrong?</b><br><table width=100%><tr><td>1. Set <i>log spidered urls</i> to YES on the <i>log</i> page. Then check the log to see if something is being logged.</td></tr><tr><td>2. Check the <a href=/admin/master>master controls</a> page for the following:<br> &nbsp; a. the <i>spider enabled</i> switch is set to YES.
-->
<!--<br> &nbsp; c. the <i>spider max kbps</i> control is set high enough.</td></tr></td></tr><tr><td>3. Check the <i>spider controls</i> page for the following:-->
<!--<br> &nbsp; c. the <i>spider delay</i> control is not TOO HIGH.-->
<!--
</td></tr></td></tr><tr><td>
3. Check the <a href=/admin/spider>spider controls</a> page for the following:
<br> &nbsp; a. the collection you wish to spider for is selected (in red).
<br> &nbsp; a. the <i>spidering enabled</i> is set to YES.
<br> &nbsp; a. the <i>max spiders</i> is not to LOW.
<br> &nbsp; c. the <i>spider delay in milliseconds</i> control is not TOO HIGH.
<br> &nbsp; b. the appropriate <i>spidering enabled</i> checkboxes in the URL Filters page are checked.
-->
<!--<br> &nbsp; c. the <i>spider start</i> and <i>end times</i> are set appropriately.-->
<!--<br> &nbsp; d. the <i>use current time</i> control is set correctly.-->
<!--</td></tr><tr><td>4. Make sure you have urls to spider by running 'gb dump s <collname>' on the command line to dump out spiderdb. See 'gb -h' for the help menu and more options.-->
<!--
</td></tr>
</table>
-->
<!-- If they are mostly "getting cached web page" and the IP address column is mostly empty, then Gigablast may be bogged down looking up the cached web pages of each url in the spider queue only to discover it is from a domain that was just spidered. This is a wasted lookup, and it can bog things down pretty quickly when you are spidering a lot of old urls from the same domain. Try setting <i>same domain wait</i> and <i>same ip wait</i> both to 0. This will pound those domain's server, though, so be careful. Maybe set it to 1000ms or so instead. We plan to fix this in the future.
-->
<!--
<br><br><a name=security></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>The Security System
</td></tr></table>
<br>
Right now any local IP can adminster Gigablast, so any IP on the same network with a netmask of 255.255.255.0 can get in. There was an accounting system but it was disabled for simplicity. So we need to at least partially re-enable it, but still keep things simple for single administrators on small networks.
Every request sent to the Gigablast server is assumed to come from one of four types of users. A public user, a spam assassin, a collection admin, or a master admin. A collection admin has control over the controls corresponding to a particular collection. A spam assassin has control over even fewer controls over a particular collection in order to remove pages from it. A master admin has control over all aspects and all collections. <br><br>To verify a request is from an admin or spam assassin Gigablast requires that the request contain a password or come from a listed IP. To maintain these lists of passwords and IPs for the master admin, click on the "security" tab. To maintain them for a collection admin or for a spam assassin, click on the "access" tab for that collection. Alternatively, the master passwords and IPs can be edited in the gb.conf file in the working dir and collection admin passwords and IPs can be edited in the coll.conf file in the collections subdirectory in the working dir. <br><br>To add a further layer of security, Gigablast can server all of its pages through the https interface. By changing http:// to https:// and using the SSL port you specified in hosts.conf, all requests and responses will be made secure.-->
<br><br>
<!--
<a name=build></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Building an Index
</td></tr></table>
<br>
<b>1.</b> Determine a collection name for your index. You may just want to use the default, unnamed collection. Gigablast is capable of handling many sub-indexes, known as collections. Each collection is independent of the other collections. You can add a new collection by clicking on the <b>add new collection</b> link on the <a href="/admin/spider">Spider Controls</a> page.<br><br>
<b>2.</b> Goto the <a href=/admin/settings>settings</a> page and add the sites you want. Only add seeds if you want to spider the whole web.
-->
<!--
<b>2.</b> Add rules to the <a href="/admin/filters">URL Filters page</a>. This is like a routing table but for URLs. The first rule that a URL matches will determine what priority queue it is assigned to. You can also use <a href="http://www.phpbuilder.com/columns/dario19990616.php3">regular expressions</a>. The special keywords you can used are described at the bottom of the rule table.
<br><br>
On that page you can tell Gigablast how often to
re-index a URL in order to pick up any changes to that URL's content.
You can assign a spider priority, the maximum number of outstanding spiders for that rule, the re-spider frequency and how long to wait before spidering another url in that same priority. It would be nifty to have an infile:myfile.txt rule that would match if the URL's subdomain were in that file, myfile.txt, however, until that is added you can added your file of subdomains to tagdb and set a tag field, such as <i>ruleset</i> to 3. Then you can say 'tag:ruleset==3' as one of your rules to capture them. This works because tagdb is hiearchical like that.
<br><br>
-->
<!--<b>3.</b> Test your Regular Expressions. Once you've submitted your
regular expressions try entering some URLs in the second pink box, entitled,
<i>URL Filters Test</i> on the <a href="/admin/filters">URL Filters page</a>. This will help you make sure that you've entered your regular expressions correctly. (NOTE: something happened to this box. It is missing and needs to be put back.)
<br><br>-->
<!--
<b>4.</b> Enable "add url". By enabling the add url interface you will be able to tell Gigablast to index some URLs. You must make sure add url is enabled on the <a href="/admin/master">Master Controls</a> page and also on the <a href="/admin/spider">Spider Controls</a> page for your collection. If it is disabled on the Master Controls page then you will not be able to add URLs for *any* collection.
<br><br>
<b>5.</b> Submit some seed URLs. Go to the <a href="/addurl">add url
page</a> for your collection and submit some URLs you'd like to put in your
index. Usually you want these URLs to have a lot of outgoing links that
point to other pages you would like to have in your index as well. Gigablast's
spiders will follow these links and index whatever web pages they point to,
then whatever pages the links on those pages point to, ad inifinitum. But you
must make sure that <b>spider links</b> is enabled on the <a href="/admin/spider">Spider Controls</a> page for your collection.
<br><br>
<b>5.a.</b> Check the spiders. You can go to the <b>Spider Queue</b> page to see what urls are currently being spidered from all collections, as well as see what urls exist in various priority queues, and what urls are cached from various priority queues. If you urls are not being spidered check to see if they are in the various spider queues. Urls added via the add url interface usually go to priority queue 5 by default, but that may have been changed on the Spider Controls page to another priority queue. And it may have been added to any of the hosts' priority queue on the network, so you may have to check each one to find it.<br><br>
If you do not see it on any hosts you can do an <b>all just save</b> in the Master Controls on host #0 and then dump spiderdb using gb's command line dumping function, <b>gb dump s 0 -1 1 -1 5</b> (see gb -h for help) on every host in the cluster and grep out the url you added to see if you can find it in spiderdb.<br><br>Then make sure that your spider start and end time on the Spider Controls encompas, and spidering is enabled, and spidering is enabled for that priority queue. If all these check out the url should be spidered asap.<br><br>
<b>6.</b> Regulate the Spiders. Given enough hardware, Gigablast can index
millions of pages PER HOUR. If you don't want Gigablast to thrash your or
someone else's website
then you should adjust the time Gigablast waits between page requests to the
same web server. To do this go to the
<a href="/admin/spider">Spider Controls</a> page for your collection and set
the <b>spider delay in milliseconds</b> value to how long you want Gigablast to wait in between page requests. This value is in milliseconds (ms). There are 1000 milliseconds in one second. That is, 1000 ms equals 1 second.
You must then click on the
<i>update</i> button at the bottom of that page to submit your new value. Or just press enter.
<br><br>
<b>7.</b> Turn on the new spider. Go to the
<a href="/admin/spider">Spider Controls</a> page for your collection and
turn on <b>spidering enabled</b>. It should be at the top of the
controls table. You may also have to turn on spidering from the
<a href="/admin/master">Master Controls</a> page which is a master switch for all
collections.
<br><br>
<b>8.</b> Monitor the spider's progress. By visiting the
<a href="/admin/spiderdb">Spider Queue</a> page for your collection you can see what
URLs are currently being indexed in real-time. Gigablast.com currently has 32hosts and each host spiders different URLs. You can easily switch between
these hosts by clicking on the host numbers at the top of the page.
-->
<!--<br><br><br>-->
<a name=spider></>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>The Spider
</td></tr></table>
<br>
&lt;<i>Last Updated Sep 2014</i>&gt;
<br>
<br>
<b>Robots.txt</b>
<br><br>
The name of Gigablast's spider is Gigabot, but it by default uses GigablastOpenSource as the name of the User-Agent when downloading web pages.
Gigabot respects the <a href=/spider.html>robots.txt convention</a> (robot exclusion) as well as supporting the meta noindex, noarchive and nofollow meta tags. You can tell Gigabot to ignore robots.txt files on the <a href="/admin/spider">Spider Controls</a> page.
<br><br>
<a name="spiderqueue">
<b>Spider Queues</b>
<br><br>
You can tell Gigabot what to spider by using the <i>site list</i> on the <a href=/admin/settings>Settings</a> page. You can have very precise control over the spider by also employing the use of the <a href=/admin/filters>URL Filters</a> page which allows you to prioritize and schedule the spiders based on the individual URL and many of its associated attributes, such as hop count, language, parent language, whether is is indexed already and number of inlinks to its site, to name just a smidgen.
<br><br>
<br>
<!--
<b><a name=classifying>Classifying URLs</a></b>
<br><br>
You can specify different indexing and spider parameters on a per URL basis by one or more of the following methods:
<br><br>
<ul>
<li>Using the <a href="/admin/tagdb">tagdb interface</a>, you can assign a <a href=#ruleset>ruleset</a> to a set of sites. All you do is provide Gigablast with a list of sites and the ruleset to use for those sites.
You can enter the sites via the <a href="/admin/tagdb">HTML form</a> or you can provide Gigablast with a file of the sites. Each file must be limited to 1 Megabyte, but you can add hundreds of millions of sites.
Sites can be full URLs, hostnames, domain names or IP addresses.
If you add a site which is just a canonical domain name with no explicit host name, like gigablast.com, then any URL with the same domain name, regardless of its host name will match that site. That is, "hostname.gigablast.com" will match the site "gigablast.com" and therefore be assigned the associated ruleset.
Sites may also use IP addresses instead of domain names. If the least significant byte of an IP address that you submit to tagdb is 0 then any URL with the same top 3 IP bytes as that IP will be considered a match.
<li>You can specify a regular expression to describe a set of URLs using the interface on the <a href="/admin/filters"></a>URL filters</a> page. You can then assign a <a href=#ruleset>ruleset</a> that describes how to spider those URLs and how to index their content. Currently, you can also explicitly assign a spider frequency and spider queue to matching URLs. If these are specified they will override any values in the ruleset.</ul>
If the URL being spidered matches a site in tagdb then Gigablast will use the corresponding ruleset from that and will not bother searching the regular expressions on the <a href="/admin/filters"></a>URL filters</a> page.
-->
<!--
<br><br>
Gigablast uses spider queues to hold and partition URLs. Each spider queue has an associated priority which ranges from 0 to 127.
Furthermore, each queue is either denoted as <i>old</i> or <i>new</i>. Old spider queues hold URLs whose content is currently in the index. New spider queues hold URLs whose content is not in the index. The priority of a URL is the same as the priority of the spider queue to which it belongs. You can explicitly assign the priority of a URL by specifying it in a <a href=#ruleset>ruleset</a> to which that URL has been assigned or by assigning it on the <a href="/admin/filters"></a>URL filters</a> page.
<br><br>
On the <a href="/admin/spider">Spider Controls</a> page you can toggle the spidering of individual spider queues as well as link harvesting. More control on a per queue basis will be available soon, perhaps including the ability to assign a ruleset to a spider queue.
<br><br>
The general idea behind spider queues is that it allows Gigablast to prioritize its spidering. If two URLs are overdue to be spidered, Gigabot will download the one in the spider queue with the highest priority before downloading the other. If the two URLs have the same spider priority then Gigabot will prefer the one in the new spider queue. If they are both in the new queue or both in the old queue, then Gigabot will spider them based on their scheduled spider time.
<br><br>
Another aspect of the spider queues is that they allow Gigabot to perform depth-first spidering. When no priority is explicitly given for a URL then Gigabot will assign the URL the priority of the "linker from which it was found" minus one.
-->
<!--
<br><br>
<b>Custom Filters</b>
<br><br>
You can write your own filters and hook them into Gigablast. A filter is an executable that takes an HTTP reply as input through stdin and makes adjustments to that input before passing it back out through stdout. The HTTP reply is essentially the reply Gigabot received from a web server when requesting a URL. The HTTP reply consists of an HTTP MIME header followed by the content for the URL.
<br><br>
Gigablast also appends <b>Last-Indexed-Date</b>, <b>Collection</b>, <b>Url</b> and <b>DocId</b> fields to the MIME in order to supply your filter with more information. The Last-Indexed-Date is the time that Gigablast last indexed that URL. It is -1 if the URL's content is currently not in the index.
<br><br>
You can specify the name of your filter (an executable program) on the <a href="/admin/spider">Spider Controls</a> page. After Gigabot downloads a web page it will write the HTTP reply into a temporary file stored in the /tmp directory. Then it will pass the filename as the first argument to the first filter by calling the system() function. popen() was used previously but was found to be buggy under Linux 2.4.17. Your program should send the filtered reply back out through stdout.
<br><br>
You can use multiple filters by using the pipe operator and entering a filter like "./filter1 | ./filter2 | ./filter3". In this case, only "filter1" would receive the temporary filename as its argument, the others would read from stdin.
<br><br>
-->
<!--
<a name=quotas></>
<b>Document Quotas</b>
<br><br>
You can limit the number of documents on a per site basis. By default the site is defined to be the full hostname of a url, like, <i>www.ibm.com</i>. However, using tagdb you can define the site as a domain or even a subfolder within the url. By adjusting the &lt;maxDocs&gt; parameter in the <a href=#ruleset>ruleset</a> for a particular url you can control how many documents are allowed into the index from that site. Additionally, the quotaBoost tables in the same ruleset file allow you to influence how a quota is changed based on the quality of the url being indexed and the quality of its root page. Furthermore, the Spider Controls allow you to turn quota checking on and off for old and new documents. <br><br>The quota checking routine quickly obtains a decent approximation of how many documents a particular site has in the index, but this approximation becomes higher than the actual count as the number of big indexdb files increases, so you may want to keep &lt;indexdbMinFilesToMerge&gt; in <a href=#config>gb.conf</a> down to a value of around five or so to ensure a half way decent approximation. Typically you can excpect to be off by about 1000 to 2000 documents for every indexdb file you have.<br><br>
<br><br>
-->
<!--
<a name=injecting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Injecting Documents</td></tr></table>
<br>
&lt;<i>Caution: Old Documentation</i>&gt;
<br>
<br>
<b>Injection Methods</b>
<br><br>
Gigablast allows you to inject documents directly into the index by using the command <b>gb [-c &lt;<a href=#hosts>hosts.conf</a>&gt;] &lt;hostId&gt; --inject &lt;file&gt;</b> where &lt;file&gt; must be a sequence of HTTP requests as described below. They will be sent to the host with id &lt;hostId&gt;.<br><br>
You can also inject your own content a second way, by using the <a href="/admin/inject">Inject URL</a> page. <br><br>
Thirdly you can use your own program to feed the content directly to Gigablast using the same form parameters as the form on the Inject URL page.<br><br>
<br><br><br>
<b>Input Parameters</b>
<br><br>
When sending an injection HTTP request to a Gigablast server, you may optionally supply an HTTP MIME in addition to the content. This MIME is treated as if Gigablast's spider downloaded the page you are injecting and received that MIME. If you do supply this MIME you must make sure it is HTTP compliant, preceeds the actual content and ends with a "
" followed by the content itself. The smallest mime header you can get away with is "HTTP 200
" which is just an "OK" reply from an HTTP server.<br><br>
The cgi parameters accepted by the /inject URL for injecting content are the following: (<b>remember to map spaces to +'s, etc.</b>)<br><br>
<table cellpadding=4>
<tr><td bgcolor=#eeeeee>u=X</b></td>
<td bgcolor=#eeeeee>X is the url you are injecting. This is required.</td></tr>
<tr><td>c=X</b></td>
<td>X is the name of the collection into which you are injecting the content. This is required.</td></tr>
<tr><td bgcolor=#eeeeee>delete=X</b></td>
<td bgcolor=#eeeeee>X is 0 to add the URL/content and 1 to delete the URL/content from the index. Default is 0.</td></tr>
<tr><td>ip=X</b></td>
<td>X is the ip of the URL (i.e. 1.2.3.4). If this is ommitted or invalid then Gigablast will lookup the IP, provided <i>iplookups</i> is true. But if <i>iplookups</i> is false, Gigablast will use the default IP of 1.2.3.4.</td></tr>
<tr><td bgcolor=#eeeeee>iplookups=X</b></td>
<td bgcolor=#eeeeee>If X is 1 and the ip of the URL is not valid or provided then Gigablast will look it up. If X is 0 Gigablast will never look up the IP of the URL. Default is 1.</td></tr>
-->
<!--<tr><td>isnew=X</b></td>
<td>If X is 0 then the URL is presumed to already be in the index. If X is 1 then URL is presumed to not be in the index. Omitting this parameter is ok for now. In the future it may be put to use to help save disk seeks. Default is 1.</td></tr>-->
<!--
<tr><td>dedup=X</b></td>
<td>If X is 1 then Gigablast will not add the URL if another already exists in the index from the same domain with the same content. If X is 0 then Gigablast will not do any deduping. Default is 1.</td></tr>
<tr><td bgcolor=#eeeeee>rs=X</b></td>
<td bgcolor=#eeeeee>X is the number of the <a href=#ruleset>ruleset</a> to use to index the URL and its content. It will be auto-determined if <i>rs</i> is omitted or <i>rs</i> is -1.</td></tr>
<tr><td>quick=X</b></td>
<td>If X is 1 then the reply returned after the content is injected is the reply described directly below this table. If X is 0 then the reply will be the HTML form interface.</td></tr>
<tr><td bgcolor=#eeeeee>hasmime=X</b></td>
<td bgcolor=#eeeeee>X is 1 if the provided content includes a valid HTTP MIME header, 0 otherwise. Default is 0.</td></tr>
<tr><td>content=X</b></td>
<td>X is the content for the provided URL. If <i>hasmime</i> is true then the first part of the content is really an HTTP mime header, followed by "
", and then the actual content.</td></tr>
<tr><td bgcolor=#eeeeee>ucontent=X</b></td>
<td bgcolor=#eeeeee>X is the UNencoded content for the provided URL. Use this one <b>instead</b> of the <i>content</i> cgi parameter if you do not want to encode the content. This breaks the HTTP protocol standard, but is convenient because the caller does not have to convert special characters in the document to their corresponding HTTP code sequences. <b>IMPORTANT</b>: this cgi parameter must be the last one in the list.</td></tr>
</table>
<br><br>
<b>Sample Injection Request</b> (line breaks are \r\n):<br>
<pre>
POST /inject HTTP/1.0
Content-Length: 291
Content-Type: text/html
Connection: Close
u=myurl&c=&delete=0&ip=4.5.6.7&iplookups=0&dedup=1&rs=7&quick=1&hasmime=1&ucontent=HTTP 200
Last-Modified: Sun, 06 Nov 1994 08:49:37 GMT
Connection: Close
Content-Type: text/html
</pre>
<i>ucontent</i> is the unencoded content of the page we are injecting. It allows you to specifiy data without having to url encode it for performance and ease.
<br><br>
<b>The Reply</b>
<br><br>
<a name=ireply></a>The reply is always a typical HTTP reply, but if you defined <i>quick=1</i> then the *content* (the stuff below the returned MIME) of the HTTP reply to the injection request is of the format:<br>
<br>
&lt;X&gt; docId=&lt;Y&gt; hostId=&lt;Z&gt;<br>
<br>
OR<br>
<br>
&lt;X&gt; &lt;error message&gt;<br>
<br>
Where &lt;X&gt; is a string of digits in ASCII, corresponding to the error code. X is 0 on success (no error) in which case it will be followed by a <b>long long</b> docId and a hostId, which corresponds to the host in the <a href=#hosts>hosts.conf</a> file that stored the document. Any twins in its group (shard) will also have copies. If there was an error then X will be greater than 0 and may be followed by a space then the error message itself. If you did not define <i>quick=1</i>, then you will get back a response meant to be viewed on a browser.<br>
<br>
Make sure to read the complete reply before spawning another request, lest Gigablast become flooded with requests.<br>
<br>
Example success reply: <b>0 docId=123543 hostId=0</b><br>
Example error reply: <b>12 Cannot allocate memory</b>
<br>
<br>
See the <a href=#errors>Error Codes</a> for all errors, but the following
errors are most likely:<br>
<table cellpadding=2>
<tr><td><b> 12 Cannot allocate memory</b></td><td>There was a shortage of memory to properly process the request.</td></tr>
<tr><td><b>32771 Record not found</b></td><td>A cached page was not found when it should have been, likely due to corrupt data on disk.</td></tr>
<tr><td><b>32769 Try doing it again</b></td><td>There was a shortage of resources so the request should be repeated.</td></tr>
<tr><td><b>32863 No collection record</b></td><td>The injection was to a collection that does not exist.</td></tr>
</table>
<br>
<br><br>
<a name=deleting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Deleting Documents</td></tr></table>
<br>
&lt;<i>Caution: Old Documentation</i>&gt;
<br>
<br>
You can delete documents from the index two ways:<ul>
<li>Perhaps the most popular is to use the <a href="/admin/reindex">Reindex URLs</a> tool which allows you to delete all documents that match a simple query. Furthermore, that tool allows you to assign rulesets to all the domains of all the matching documents. All documents that match the query will have their docids stored in a spider queue of a user-specified priority. The spider will have to be enabled for that priority queue for the deletion to take place. Deleting documents is very similar to adding documents.<br><br>
<li>To delete a single document you can use the <a href="/admin/inject">Inject URL</a> page.
</ul>
-->
<!--
<br>
<a name=metas></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Indexing User-Defined Meta Tags</td></tr></table>
<br>
&lt;<i>Caution: Old Documentation</i>&gt;
<br>
<br>
Gigablast supports the indexing, searching and displaying of user-defined meta tags. For instance, if you have a tag like <i>&lt;meta name="foo" content="bar baz"&gt;</i> in your document, then you will be able to do a search like <i><a href="/search?q=foo%3Abar&dt=foo">foo:bar</a></i> or <i><a href="/search?q=foo%3A%22bar+baz%22&dt=foo">foo:"bar baz"</a></i> and Gigablast will find your document. <br><br>
You can tell Gigablast to display the contents of arbitrary meta tags in the search results, like <a href="/search?q=gigablast&s=10&dt=author+keywords%3A32">this</a>. Note that you must assign the <i>dt</i> cgi parameter to a space-separated list of the names of the meta tags you want to display. You can limit the number of returned characters of each tag to X characters by appending a <i>:X</i> to the name of the meta tag supplied to the <i>dt</i> parameter. In the link above, I limited the displayed keywords to 32 characters. The content of the meta tags is also provided in the &lt;display&gt; tags in the <a href="#output">XML feed</a>
<br><br>
Gigablast will index the content of all meta tags in this manner. Meta tags with the same <i>name</i> parameter as other meta tags in the same document will be indexed as well.
<br><br>
Why use user-defined metas? Because it is very powerful. It allows you to embed custom data in your documents, search for it and retrieve it.<br>
<br>
You can also explicitly specify how to index certain meta tags by making an &lt;index&gt; tag in the <a href="#ruleset">ruleset</a> as shown <a href="#rsmetas">here</a>. The specified meta tags will be indexed in the user-defined meta tag fashion as described above, in addition to any method described in the ruleset.<br>
<br>
<br>
-->
<!--
<a name=bigdocs></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Indexing Big Documents</td></tr></table>
<br>
&lt;<i>Caution: Old Documentation</i>&gt;
<br>
<br>
When indexing a document you will be bound by the available memory of the machine that is doing the indexing. A document that is dense in words can takes as much as ten times the memory as the size of the document in order to process it for indexing. Therefore you need to make sure that the amount of available memory is adequate to process the document you want to index. You can turn off Spam detection to reduce the processing overhead by a little bit.<br>
<br>
The <b>&lt;maxMem&gt;</b> tag in the <a href=#config>gb.conf</a> file controls the maximum amount of memory that the whole Gigablast process can use. HOWEVER, this memory is shared by databases, thread stacks, protocol stacks and other things that may or may not use most of it. Probably, the best way to see much memory is available to the Gigablast process for processing a big document is to look at the <b>Stats Page</b>. It shows you exactly how much memory is being used at the time you look at it. Hit refresh to see it change.<br>
<br>
You can also check all the tags in the gb.conf file that have the word "mem" in them to see where memory is being allocated. In addition, you will need to check the first 100 lines of the log file for the gigablast process to see how much memory is being used for thread and protocol stacks. These should be displayed on the Stats page, but are currently not.<br>
<br>
After ensuring you have enough extra memory to handle the document size, you will need to make sure the document fits into the tree that is used to hold the documents in memory before they get dumped to disk. The documents are compressed using zlib before being added to the tree so you might expect a 5:1 compression for a typical web page. The memory used to hold document in this tree is controllable from the <b>&lt;titledbMaxTreeMem&gt;</b> parameter in the gb.conf file. Make sure that is big enough to hold the document you would like to add. If the tree could accomodate the big document, but at the time is partially full, Gigablast will automatically dump the tree to disk and keep trying to add the big document.<br>
<br>
Finally, you need to ensure that the <b>max text doc len</b> and <b>max other doc len</b> controls on the <b>Spider Controls</b> page are set to accomodating sizes. Use -1 to indicate no maximum. <i>Other</i> documents are non-text and non-html documents, like PDF, for example. These controls will physically prohibit the spider from downloading more than this many bytes. This causes excessively long documents to be truncated. If the spider is downloading a PDF that gets truncated then it abandons it, because truncated PDFs are useless.<br>
<br>
<br>
-->
<!--
<a name=rolling></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Rolling the New Index</td></tr></table>
<br>
Just because you have indexed a lot of pages does not mean those pages are being searched. If the <b>restrict indexdb for queries</b> switch on the <a href="/admin/spider">Spider Controls</a> page is on for your collection then any query you do may not be searching some of the more recently indexed data. You have two options:<br><br>
<b>1.</b>You can turn this switch off which will tell Gigablast to search all the files in the index which will give you a realtime search, but, if &lt;indexdbMinFilesToMerge&gt; is set to <i>X</i> in the <a href=#config>gb.conf</a> file, then Gigablast may have to search X files for every query term. So if X is 40 this can destroy your performance. But high X values are indeed useful for speeding up the build time. Typically, I set X to 4 on gigablast.com, but for doing initial builds I will set it to 40.<br><br>
<b>2.</b>The second option you have for making the newer data searchable is to do a <i>tight merge</i> of indexdb. This tells Gigablast to combine the X files into one. Tight merges typically take about 2-4 minutes for every gigabyte of data that is merged. So if all of your indexdb* files are about 50 gigabytes, plan on waiting about 150 minutes for the merge to complete.<br><br>
<b>IMPORTANT</b>: Before you do the tight merge you should do a <b>disk dump</b> which tells Gigablast to dump all data in memory to disk so that it can be merged. In this way you ensure your final merged file will contain *all* your data. You may have to wait a while for the disk dump to complete because it may have to do some merging right after the dump to keep the number of files below &lt;indexdbMinFilesToMerge&gt;.<br><br>
Now if you are <a href=#input>interfacing to Gigablast</a> from another program you can use the <b>&rt=[0|1]</b> real time search cgi parameter. If you set this to 0 then Gigablast will only search the first file in the index, otherwise it will search all files.<br><br>
-->
<a name=dmoz></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Building a DMOZ Based Directory</td></tr></table>
<br>
&lt;<i>Last Updated Nov 18, 2014</i>&gt;
<br>
&lt;<i>Procedure tested on 32-bit Gigablast on Ubuntu 14.04 on Nov 18, 2014</i>&gt;
<br>
&lt;<i>Procedure tested on 64-bit Gigablast on Ubuntu 14.04 on Nov 18, 2014</i>&gt;
<br>
<br>
<b>Building the DMOZ Directory:</b>
<br><ul><li>Create the <i>dmozparse</i> program.<br> <b>$ make dmozparse</b><br>
<br>
<li>Download the latest content.rdf.u8 and structure.rdf.u8 files from http://rdf.dmoz.org/rdf into the <i>catdb/</i> directory onto host 0, the first host listed in the <a href=/hosts.conf.txt>hosts.conf</a> file.
<b>
<br> $ mkdir catdb
<br> $ cd catdb
<br> $ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
<br> $ gunzip content.rdf.u8.gz
<br> $ wget http://rdf.dmoz.org/rdf/structure.rdf.u8.gz
<br> $ gunzip structure.rdf.u8.gz</b>
<br>
<br>
<li>Execute <i>dmozparse</i> in its directory with the <i>new</i> option to generate the catdb .dat files. These .dat files are in Gigablast's special format so Gigablast can quickly get all the associated DMOZ entries given a url. Having several <i>Missing parent for catid ...</i> messages are normal. Ultimately, it should put two files into the <i>catdb/</i> subdirectory: <i>gbdmoz.structure.dat</i> and <i>gbdmoz.content.dat</i>.
<br> <b>$ cd ..</b>
<br> <b>$ ./dmozparse new</b>
<br>
<br>
<li>Execute the installcat script command on host 0 to distribute the catdb files to all the hosts.<br>This just does an scp from host 0 to the other hosts listed in <a href=/hosts.conf.txt>hosts.conf</a>.<br> <b>$ ./gb installcat</b><br>
<br>
<li>Make sure all spiders are stopped and inactive.<br>
<br>
<li>Click the <a href=/admin/catdb>catdb</a> in the admin section of Gigablast and click "<b>Generate Catdb</b>" (NOT <i>Update Catdb</i>). This will make a huge list of <i>catdb</i> records and then add them to all the hosts in the network in a sharded manner.<br>
<br>
<li>Once the command returns, typically in 3-4 minutes, Catdb will be ready for use and spidering. It will have created some ./catdb/catdb*.dat files which are Gigablast's database of DMOZ entry records. Any documents added that are from a site in DMOZ will show up in the search results with their appropriate DMOZ categories listed beneath. This will affect all collections.<br></ul>
<br><b>Testing DMOZ:</b>
<ul> Go to the <a href=/admin/catdb>catdb</a> page and enter a url into the <i>Lookup Category Url</a> box and hit enter to see the associated DMOZ records for that url. So if you enter <a href=/admin/catdb?caturl=http%3A%2F%2Fwww.ibm.com%2F>http://www.ibm.com/</a></i> you should see a few entries.
</ul>
<br><b>Searching DMOZ:</b>
<ul>
<li>Gigablast provides the unique ability to search the content of the pages in the DMOZ directory. But in order to search the pages in DMOZ we have to index them. You can't search what is not indexed.
So execute <i>dmozparse</i> with the <i>urldump -s</i> option to create the html/gbdmoz.urls.txt.* files which contain all the URLs in DMOZ. (Excluding URLs that contained hashtags, '#'.) It will create several large files. Each file it creates is basically a VERY LARGE page of links and each link is a url in dmoz. Each of these files has a <i>&lt;meta name=spiderlinkslinks content=0&gt;</i> special Gigablast meta tag that says NOT to follow the links OF THE LINKS. So it will just spider the outlinks on this massive page and then stop. Furthermore, the massive page also has a &lt;meta name=noindex content=1&gt; tag that tells Gigablast to not index this massive page itself, but only spider the outlinks.
<br><b>$ ./dmozparse urldump -s</b>
<br>
<br><li>Now tell Gigablast to index each URL listed in each gbdmoz.urls.txt.* file. Make sure you specify the collection you are using for DMOZ, in the example below it uses <i>main</i>. You can use the <a href=/admin/addurl>add url</a> page to add the gbdmoz.urls.txt.* files or you can use curl (or wget) like:
<br>
<b>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.0"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.1"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.2"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.3"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.4"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.5"
<br>
$ curl "http://127.0.0.1:8000/addurl?id=1&spiderlinks=1&c=main&u=http://127.0.0.1:8000/gbdmoz.urls.txt.6"
<br>
</b>
<br><li>Each gbdmoz.urls.txt.* file contains a special meta tag which instructs Gigablast to index each DMOZ URL even if there was some external error, like a DNS or TCP timeout. If the error is internal, like an Out of Memory error, then the document will, of course, not be indexed, but it should be reported in the log. This is essential for making our version of DMOZ exactly like the official version.
<br>
<br><li>Finally, ensure spiders are enabled for your collection. In the above example, <i>main</i>. And ALSO ensure that spiders are enabled in the Master Controls for all collections. Then the URLs you added above should be spidered and indexed. Hit reload on the <a href=/admin/spiderdb>Spider Queue</a> tab to ensure you see some spider activity for your collection.
<!--
<br> <li>Move the gbdmoz.urls.txt.* files to the <i>html</i> directory under the main Gigablast directory of host 0.<br>
<li>Go to "add url" under the admin section of Gigablast.<br>
<li><b>IMPORTANT:</b> Uncheck the strip session ids option.<br>
<li>In the "url of a file of urls to add" box, insert the hostname/ip and http port of host 0 followed by one of the gbdmoz.urls.txt.# files. Example: http://10.0.0.1:8000/gbdmoz.urls.txt.0<br>
<li>Press the "add file" button and allow the urls to be added to the spider.<br>
<li>Repeat for all the gbdmoz.urls.txt.# files.<br>
-->
</ul><br>
<!--
<b>Updating an Existing Catdb With New DMOZ Data:</b><ul><li>Download the latest content.rdf.u8 and structure.rdf.u8 files from http://rdf.dmoz.org/rdf into the <i>catdb/</i> directory on host 0 with the added extension ".new".
<br> <b>$ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz -O content.rdf.u8.new.gz
<br> $ gunzip content.rdf.u8.new.gz
<br> $ wget http://rdf.dmoz.org/rdf/structure.rdf.u8.gz -O structure.rdf.u8.new.gz
<br> $ gunzip structure.rdf.u8.new.gz</b>
<br> <li>Execute <i>dmozparse</i> in the <i>cat</i> directory with the <i>update</i> option to generate the catdb dat.new and diff files.
<br> <b>$ dmozparse update</b><br> <li><b>NOTE:</b> If you wish to spider the new, changed, and removed urls from this update, execute <i>dmozparse</i> with the <i>diffurldump -s</i> option to generate the gbdmoz.diffurls.txt file (See below).<br> <b>$ dmozparse diffurldump -s</b><br> <li>Execute the installnewcat script command on host 0 to distribute the catdb files to all the hosts.<br> <b>$ gb installnewcat</b><br> <li>Make sure all spiders are stopped and inactive.<br> <li>Go to "catdb" in the admin section of Gigablast and click "Update Catdb."<br> <li>Once the command returns, Catdb will be ready for use and spidering.<br></ul><br><b>Spidering Urls For Updated Catdb:</b><ul><li>Execute <i>dmozparse</i> in the <i>cat</i> directory with the <i>diffurldump -s</i> option to create the gbdmoz.diffurls.txt.# files which contain all the new, changed, or removed urls in DMOZ.<br> <b>$ dmozparse diffurldump -s</b><br> <li>Move the gbdmoz.diffurls.txt.# files to the <i>html</i> directory under the main Gigablast directory of host 0.<br> <li>Go to "add url" under the admin section of Gigablast.<br> <li><b>IMPORTANT:</b> Uncheck the strip session ids option.<br> <li>In the "url of a file of urls to add" box, insert the hostname/ip and http port of host 0 followed by one of the gbdmoz.diffurls.txt.# files. Example: http://10.0.0.1:8000/gbdmoz.diffurls.txt.0<br> <li>Press the "add file" button and allow the urls to be added to the spider.<br> <li>Repeat for all the gbdmoz.diffurls.txt.# files.<br></ul><br>-->
<b>Deleting Catdb:</b><ul><li>Shutdown Gigablast.<br> <li>Delete <i>catdb-saved.dat</i> and all <i>catdb/catdb*.dat</i> and <i>catdb/catdb*.map</i> files from all hosts.<br> <li>Start Gigablast. <li>You will have to run <i>./dmozparsenew new</i> again to undelete.<br></ul><br><b>Troubleshooting:</b><ul><li><b>Dmozparse prints an error saying it could not open a file:</b><br> Be sure you are running dmozparse in the cat directory and that the steps above have been followed correctly so that all the necessary files have been downloaded or created.<br> <li><b>Dmozparse prints an Out of Memory error:</b><br> Some modes of dmozparse can require several hundred megabytes of system memory. Systems with insufficient memory, under heavy load, or lacking a correctly working swap may have problems running dmozparse. Attempt to free up as much memory as possible if this occcurs.<br> <li><b>How to tell if pages are being added with correct directory data:</b><br> All pages with directory data are indexed with special terms utilizing a prefix and sufix. The prefixes are listed below and represent a specific feature under which the page was indexed. The sufix is always a numerical category ID. To search for one of these terms, simply performa a query with "prefix:sufix", i.e. "gbpcat:2" will list all pages under the Top category (or all pages in the entire directory).<br> <ul><li>gbcatid - The page is listed directly under this base category.<br> <li>gbpcatid - The page is listed under this category or any child of this category.<br> <li>gbicatid - The page is listed indirectly under this base category, meaning it is a page found under a site listed in the base category.<br> <li>gbipcatid - The page is listed indirectly under this category, meaning it is a page found under a site listed under this category or any child of this category.<br> </ul> <li><b>Pages are not being indexed with directory data:</b><br> First check to make sure that sites that are actually in DMOZ are those being added by the spiders. Next check to see if the sites return category information when looked up under the Catdb admin section. If they come back with directory information, the site may just need to be respidered. If the lookup does not return category information and all hosts are properly running, Catdb may need to be rebuilt from scratch.<br> <li><b>The Directory shows results but does not show sub-category listings or a page error is returned and no results are shown:</b><br> Make sure the gbdmoz.structure.dat and structure.rdf.u8 files are in the <i>cat</i> directory on every host. Also be sure the current dat files were built from the current rdf.u8 files. Check the log to see if Categories was properly loaded from file at startup (grep log# Categories).<br></ul><br><a name=logs></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>The Log System</td></tr></table>
<br>
&lt;<i>Last Updated March 2014</i>&gt;
<br>
<br>
<table>
<tr>
<td>Gigablast uses its own format for logging messages, for example,<br>
<pre>
1091228736104 0 Gigablast Version 1.234
1091228736104 0 Allocated 435333 bytes for thread stacks.
1091228736104 0 Failed to alloc 360000 bytes.
1091228736104 0 Failed to intersect lists. Out of memory.
1091228736104 0 Too many words. Query truncated.
1091228736104 0 GET http://hohum.com/foobar.html
1091228736104 0 http://hohum.com/foobar.html ip=4.5.6.7 : Success
1091228736104 0 Skipping xxx.com, would hammer IP.
</pre>
The first field, a large number, is the time in milliseconds since the epoch. This timestamp is useful for evaluating performance.<br>
<br>
The second field, a 0 in the above example, is the hostId (from <a href=/hosts.conf.txt>hosts.conf</a>) of the host that logged the message.<br>
<!--
<br>
The third field, INIT in the first line of the above example, is the type of log message. It can be any of the following:<br>
<br>
<table>
<tr>
<td>INIT</td>
<td>Messages printed at the initilization or shutdown of the Gigablast process.</td>
</tr>
<tr>
<td>WARN</td>
<td>Most messages fall under this category. These messages are usually due to an error condition, like out of memory.</td>
</tr>
<td>INFO</td>
<td>Messages that are given for information purposes only and not indicative of an error condition.</td>
</tr>
<tr>
<td>LIMIT</td>
<td>Messages printed when a document was not indexed because the document quota specified in the ruleset was breeched. Also, urls that were truncated because they were too long. Or a robots.txt file was too big and was truncated.</td>
</tr>
<tr>
<td>TIME</td>
<td>Timestamps, logged for benchmarking various processes.</td>
</tr>
<tr>
<td>DEBUG</td>
<td>Messages used for debugging.</td>
</tr>
<tr>
<td>LOGIC</td>
<td>Programmer sanity check messages. You should never see these, because they signify a problem with the code logic.</td>
</tr>
<tr>
<td>REMND</td>
<td>A reminder to the programmer to do something.</td>
</tr>
</table>
<br>
The fourth field is the resource that is logging the message. The resource can be one of the following:<table>
<tr>
<td>addurls</td>
<td>Messages related to adding urls. Urls could have been added by the spider or by a user via a web interface.</td>
</tr>
<tr>
<td>admin</td>
<td>Messages related to administrative functions and tools like the query-reindex tool and the sync tool.</td>
</tr>
<tr>
<td>build</td>
<td>Messages related to the indexing process.</td>
</tr>
<tr>
<td>conf</td>
<td>Messages related to <a href="#hosts">hosts.conf</a> or <a href="#config">gb.conf</a>.</td>
</tr>
<tr>
<td>disk</td>
<td>Messages related to reading or writing to the disk.</td>
</tr>
<tr>
<td>dns</td>
<td>Messages related to talking with a dns server.</td>
</tr>
<tr>
<td>http</td>
<td>Messages related to the HTTP server.</td>
</tr>
<tr>
<td>loop</td>
<td>Messages related to the main loop that Gigablast uses to process incoming signals for network and file communication.</td>
</tr>
<tr>
<td>merge</td>
<td>Messages related to performing file merges.</td>
</tr>
<tr>
<td>net</td>
<td>Messages related to the network layer above the udp server. Includes the ping and redirect-on-dead functionality.</td>
</tr>
<tr>
<td>query</td>
<td>Messages related to executing a query.</td>
</tr>
<tr>
<td>db</td>
<td>Messages related to a database. Fairly high level.</td>
</tr>
<tr>
<td>spcache</td>
<td>Messages related to the spider cache which is used to efficiently queue urls from disk.</td>
</tr>
<tr>
<td>speller</td>
<td>Messages related to the query spell checker.</td>
</tr>
<tr>
<td>thread</td>
<td>Messages related to the threads class.</td>
</tr>
<tr>
<td>topics</td>
<td>Messages related to related topics generation.</td>
</tr>
<tr>
<td>udp</td>
<td>Messages related to the udp server.</td>
</tr>
</table>
-->
<br>
The last field, is the message itself.<br><br>
You can turn many messages on and off by using the <a href="/admin/log">Log Controls</a>.<br><br>
The same parameters on the Log Controls page can be adjusted in the gb.conf file.<br><br>
<a name=optimizing></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Optimizing
</td></tr></table>
<br>
&lt;<i>Last Updated Sep 2014</i>&gt;
<br>
<br>
Gigablast is a fairly sophisticated database that has a few things you can tweak to increase query performance or indexing performance.
<br><br>
<b>General Optimizations:</b>
<ul>
<li> Ensure that all drives can operate at maximum performance at the same time. Nowadays, the PCI-E bus could be the limiting factor, so be aware of that. Ensure you get maximum throughput and lots of disk seeks per second. By doing a <i>cat /proc/scsi/scsi</i> you can get info about your drives.
<li> Prevent Linux from unnecessary swapping. Linux will often swap out
Gigablast pages to satisfy Linux's disk cache. By using the swapoff command
to turn off swap you can increase performance, but if the computer runs out
of memory it will start killing processes withouth giving them a chance
to save their data.
<li> Consider adding more servers to your architecture. You can add the new servers to the <a href=/hosts.conf.txt>hosts.conf</a> file and Gigablast can rebalance the shards while still serving queries.
<li> Run one gb process per physical core, not including hyperthreaded cores. Hypthreaded cores are not that good.
<!--using Rik van Riel's
patch, rc6-rmap15j, applied to kernel 2.4.21, you can add the
/proc/sys/vm/pagecache control file. By doing a
'echo 1 1 > /proc/sys/vm/pagecache' you tell the kernel to only use 1% of
the swap space, so swapping is effectively minimized.-->
</ul>
<br>
<b>Query Optimizations:</b>
<ul>
<!--<li> Set <b>restrict indexdb for queries</b> on the
<a href="/admin/spider">Spider Controls</a> page to YES.
This parameter can also be controlled on a per query basis using the
<a href=#rt><b>rt=X</b></a> cgi parm.
This will decrease freshness of results but typically use
3 or 4 times less disk seeks.
-->
<!--<li> If you want to spider at the same time, then you should ensure
that the <b>max spider disk threads</b> parameter on the
<a href="/admin/master">Master Controls</a> page is set to around 1
so the indexing/spidering processes do not hog the disk.
-->
<!--
<li> Set Gigablast to read-only mode to true to prevent Gigablast from using
memory to hold newly indexed data, so that this memory can
be used for caches. Just set the <b>&lt;readOnlyMode&gt;</b> parameter in your config file to 1.
<li> Increase the indexdb cache size. The <b>&lt;indexdbMaxCacheMem&gt;</b>
parameter in
your config file is how many bytes Gigablast uses to store <i>index lists</i>.
Each word has an associated index list which is loaded from disk when that
word is part of a query. The more common the word, the bigger its index list.
By enabling a large indexdb cache you can save some fairly large disk reads.
<li> Increase the clusterdb cache size. The <b>&lt;clusterdbMaxCacheMem&gt;</b>
parameter in
your config file is how many bytes Gigablast uses to store cluster records.
Cluster records are used for site clustering and duplicate removal. Every
URL in the index has a corresponding cluster record. When a url appears as a
search result its cluster record must be loaded from disk. Each cluster
record is about 12 to 16 bytes so by keeping these all in memory you can
save around 10 disk seeks every query.
-->
<li> Disable site clustering and dup removal. By specifying <i>&sc=0&dr=0</i>
in your query's URL you ensure that these two services are avoided and no
cluster records are loaded. You can also turn them off by default (site cluster by default and dedup results by default) on the
<a href="/admin/search">Search Controls</a> page. But if someone explicitly
specifies <i>&sc=1</i> or <i>&dr=1</i> in their query URL then they will
override that switch.
<li> Disable gigabit generation. If accessing from the API set &amp;dsrt=0 otherwise set <i>results to scan for gigabits generation by dfefault</i> to 0 in the search controls page for your collection.
<li> If you see lots of long black lines on the <a href=/admin/perf>Performance graph</a> then that means your disk is slowing everything down. Make sure that if you are doing realtime queries that you do not have too many big posdb (index) files. If you tight merge everything it should fix that problem. Otherwise, consider getting a raid level 0 and faster disks. Get SSDs. Perhaps the filesystem is severly fragmented. Or maybe your query traffic is repetetive. If the queries are sorted alphabetically, or you have many duplicate queries, then most of the workload might be falling on one particular host in the network, thus bottle-necking everything.
<li> If you see long purple lines in the <a href=/admin/perf>Performance graph</a> when Gigablast is doing slow query handling then that means Gigablast is distributed over a slow network.
<!-- OR your tier sizes, adjustable on the Search Controls page, are way too high so that too much data is clogging the network. If your tier sizes are at the default values or lower, then the problem may be that the bandwidth between one gigablast host and another is below the required 1000Mbps. Try doing a 'dmesg | grep Mbps' to see what speed your card is operating at. Also try testing the bandwidth between hosts using the thunder program or try copying a large file using rcp and timing it. Do not use scp since it is often bottlenecked on the CPU due to the encryption that it does. If your gigabit card is operating at 100Mbps that can sometimes be fixed by rebooting. I've found that there is about a 20% chance that the reboot will make the card come back to 1000Mbps.-->
</ul>
<br>
<b>Spidering and Indexing Optimizations:</b>
<ul>
<!--<li> Set <b>restrict indexdb for spidering</b> on the -->
<li> Disable <b>link voting</b> or <b>link spam checking</b> in the spider controls if you do not care about it. This is also useful when doing millions of injections and doing an indexdb rebuild using the rebuild tool later to pick up the link text.
<li> Disable dup checking. Gigablast will not allow any duplicate pages
from the same domain into the index when this is enabled. This means that
Gigablast must do about one disk seek for every URL indexed to verify it is
not a duplicate.
See the <i>deduping enabled</i> parm on the <a href=/admin/spider>Spider Controls</a> page.
<!--<li> Disable <b>link voting</b>. Gigablast performs at least one disk seek
to determine who link to the URL being indexed. If it does have some linkers
then the Cached Copy of each linker (up to 200) is loaded and the corresponding
link text is extracted. Most pages do not have many linkers so the disk
load is not too bad. Furthermore, if you do enable link voting, you can
restrict it to the first file of indexdb, <b>restrict indexdb for
spidering</b>, to ensure that about one seek is used to determine the linkers.
-->
<li> Enable <b>use IfModifiedSince</b>. This tells the spider not to do
anything if it finds that a page being reindexed is unchanged since the last
time it was indexed. Some web servers do not support the IfModifiedSince tag,
so Gigablast will compare the old page with the new one to see if anything
changed. This backup method is not quite as efficient as the first,
but it can still save ample disk resources.
<li> Disable <i>make image thumbnails</i> in the <a href=/admin/spider>spider controls</a> if you have not already.
<li> In the current <a href=/admin/spiderdb>spider queue</a>, what are the statuses of each url? If the spider is always bottlenecking on <i>adding links</i>
that is because it does a dns lookup on each link if it has not previously encountered that subdomain. Otherwise, the subdomain IP when first encountered is stored in tagdb in the <i>firstIp</i> field. You might try using more DNSes (add more in the <a href=/admin/master>Master Controls</a>).
<li> Ensure that the maximum spiders in the <a href=/admin/master>Master Controls</a> and in your collection's <a href=/admin/spider>Spider Controls</a> is high enough.
</ul>
<!--<li> Don't let Linux's bdflush flush the write buffer to disk whenever it
wants. Gigablast needs to control this so it won't perform a lot of reads
when a write is going on. Try performing a 'echo 1 > /proc/sys/vm/bdflush'
to make bdflush more bursty. More information about bdflush is available
in the Linux kernel source Documentation directory in the proc.txt file.-->
</ul>
<br>
<!--
<a name=stopwords></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Stop Words</center>
</td></tr></table>
<br>
<pre>
at be by of on
or do he if is
it in me my re
so to us vs we
the and are can did
per for had has her
him its not our she
you also been from have
here hers ours that them
then they this were will
with your about above ain
could isn their there these
those would yours theirs aren
hadn didn hasn ll ve
should shouldn
</pre>
<br>
<br>
<a name=phrasebreaks></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Phrase Breaks</center>
</td></tr></table>
<br>
Certain punctuation breaks up a phrase. All single character punctuation marks can be phrased across, with the exception of the following:
<table border=1 cellpadding=6><tr><td colspan=11><b>Breaking Punctuation (1 char)</b></td></tr>
<tr>
<td>?</td><td>!</td><td>;</td><td>{</td><td>}</td><td>&lt;</td><td>&gt;</td><td>171</td><td>187</td><td>191</td><td>161</td></tr></table>
<br><br>
The following 2 character punctuation sequences break phrases:
<table border=1 cellpadding=6><tr><td colspan=12><b>Breaking Punctuation (2 chars)( _ = whitespace = \t, \n, \r or \0x20)</b></td></tr>
<tr><td>?_</td><td>!_</td><td>;_</td><td>{_</td><td>}_</td><td>&lt;_</td><td>&gt;_</td><td>171_</td><td>187_</td><td>191_</td><td>161_</td><td>_.</td></tr>
<tr><td>_?</td><td>_!</td><td>_;</td><td>_{</td><td>_}</td><td>_&lt;</td><td>_&gt;</td><td>_171</td><td>_187</td><td>_191</td><td>_161</td><td>_.</td></tr>
<tr><td colspan=12>Any 2 character combination with NO whitespaces with the exception of "<b>/~</b>"</td></tr>
</table>
<br><br>
All 3 character sequences of punctuation break phrases with the following exceptions:
<table border=1 cellpadding=6><tr><td colspan=12><b><u>NON</u>-Breaking Punctuation (3 chars)( _ = whitespace = \t, \n, \r or \0x20)</b></td></tr>
<tr><td>://</td><td>___</td><td>_,_</td><td>_-_</td><td>_+_</td><td>_&amp;_</td></tr>
</table>
<br><br>
All sequences of punctuation greater than 3 characters break phrases with the sole exception being a sequence of strictly whitespaces.
-->
</div>