#include "gb-include.h" #include "TcpSocket.h" #include "HttpRequest.h" #include "Pages.h" #include "Spider.h" // MAX_SPIDERS #include "Users.h" bool sendPageOverview ( TcpSocket *s , HttpRequest *r ) { //char buf [ 256*1024 ]; //char *p = buf; //char *pend = buf + 256*1024; // . print standard header // . do not print big links if only an assassin, just print host ids SafeBuf sb; g_pages.printAdminTop ( &sb , s , r ); //int32_t user = g_pages.getUserType ( s , r ); //sprintf ( p , //" \n" //"
Admin Overview |
n=X | " "returns X search results. Default is 10. Max is 50. |
s=X | \n" "returns results starting at result #X. The first result is result #0. Default is 0. Max is 499. |
ns=X | " "returns X summary excerpts in the summary of each search result. Default is defined on a per collection basis in the Display Controls. |
site=X | \n" "returned results will have URLs from the site, X. |
plus=X | " "returned results will have all words in X. Like a default AND. |
minus=X | \n" "returned results will not have any words in X. |
rat=1 | " "returned results will have ALL query terms. This is also known as a default and search. rat means Require All Terms. |
sc=X | \n" "X can be 0 or 1 to respectively disable or enable site clustering. Default is 1, but 0 if the raw parameter is used. |
dr=X | " "X can be 0 or 1 to respectively disable or enable duplicate result removal. Default is 1, but 0 if the raw parameter is used. |
raw=X | \n" "X ranges from 0 to 8 to specify the format of the search results. raw=8 requests the XML feed. |
raw=2 | " "Just display a list of docids between <pre> tags. Will display one extra docid than requested if possible, so you know if you have more docids available or not. Does not have to generate summaries so it is a bit faster, especially if you do not perform site clustering or dup removal. |
qh=X | " "X can be 0 or 1 to respectively disable or enable highlighting of query terms in the titles and summaries. Default is 1, but 0 if the raw parameter is used. |
usecache=X | \n" "X can be 0 or 1 to respectively disable or enable caching of the search results pages. Default is 1. |
rcache=X | " "X can be 0 or 1 to respectively disable or enable reading from the search results page cache. Default is 1. |
wcache=X | \n" "X can be 0 or 1 to respectively disable or enable writing to the search results page cache. Default is 1. |
bq=X | " "X can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect. Default is 2. |
rt=X | \n" "X can be 0 or 1 to respectively disable or enable real time searches. If enabled, query response time will suffer because Gigablast will have to read from multiple files, usually 3 or 4, of varying ages, to satisfy a query. Default value of rt is 1, but 0 if the raw paramter is used. |
dt=X | " "X is a space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When receiving the XML feed from gigablast, the <display name=\"meta_tag_name\">meta_tag_content</display> XML tag will be used to convey each requested meta tag's content. |
spell=X | \n" "X can be 0 or 1 to respectively disable or enable spell checking. If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML |
topics=NUM+MAX+SCAN+ MIN+MAXW+META+ DEL+IDF+DEDUP | \n"
"\n"
"\n"
"\n"
"NUM is how many related topics you want returned. \n"
" \n" "MAX is the maximum number of topics to generate and store in cache, so if TW is increased, but still below MT, it will result in a fast cache hit.\n" " \n" "SCAN is how many documents to scan for related topics. If this is 30, for example, then Gigablast will scan the first 30 search results for related topics.\n" " \n" "MIN is the minimum score of returned topics. Ranges from 0%% to over 100%%. 50%% is considered pretty good. BUG: This must be at least 1 to get any topics back.\n" " \n" "MAXW is the maximum number of words per topic.\n" " \n" "META is the meta tag name to which Gigablast will restrict the content used to generate the topics. Do not specify thie field to restrict the content to the body of each document, that is the default.\n" " \n" "\n" "DEL is a single character delimeter which defines the topic candidates. All candidates must be separated from the other candidates with the delimeter. So <meta name=test content=\" cat dog ; pig rabbit horse\"> when using the ; as a delimeter would only have two topic candidates: \"cat dog\" and \"pig rabbit horse\". If no delimeter is provided, default funcationality is assumed.\n" " \n" "" "IDF is 1, the default, if you want Gigablast to weight topic candidates by their idf, 0 otherwise." " \n" "" "DEDUP is 1, the default, if the topics should be deduped. This involves removing topics that are substrings or superstrings of other higher-scoring topics." " \n" "" "" "Example: topics=49+100+30+1+6+author+%%3B+0+0" " \n" "The default values for those parameters with unspecifed defaults can be defined on the \"Search Controls\" page. " " \n" "" "XML feeds will contain the generated topics like: <topic><name><![CDATA[some topic]]></name><score>13</score><from>metaTagName</from></topic>" " \n" "Even though somewhat nonstandard, you can specify multiple &topic= parameters to get back multiple topic groups." " \n" "Performance will decrease if you increase the MAX, SCAN or MAXW." " | \n"
"
rdc=X | \n" "\n" "\n" "X is 1 if you want Gigablast to return the number of documents that " "contained each topic." " | \n" "
rd=X | \n" "\n" "\n" "X is 1 if you want Gigablast to return the list of docIds that " "contained each topic." " | \n" "
rp=X | \n" "\n" "\n" "X is 1 if you want Gigablast to return the popularity of each topic." " | \n" "
mdc=X | \n" "\n" "\n" "Gigablast will not display topics that are not contained in at least X " "documents. The default is configurable in the Search Controls page on a per " "collection basis." " | \n" "
t0=X | \n" "\n"
"\n"
"Gigablast will use at least X docids from each termlist. Used to get more accurate hit counts."
" \n" "For performance reasons, most large search engines nowadays only return a rough estimate of the number of search results, but you may desire to get a better approximation or even an exact count. Gigablast allows you to do this, but it may be at the expense of query resonse time." " \n" "By using the t0 variable you can tell Gigablast to use a minimum number of docids from each termlist. Typically, t0 defaults to something of around 10,000 docids. Often more docids than that are used, but this is just the minimum. So if Gigablast is forced to use more docids it will take longer to compute the search results on average, but it will give you a more precise hit count. By setting t0 to the truncation limit or higher you will max out the hit count precision." " \n" "Example: http://www.gigablast.com/search?q=test&t0=5000000\n" "" " |
d=X | \n" "X is the docId of the page you want returned. DocIds are 64-bit, so you'll need 8 bytes to hold one. |
ih=X | \n" "X is 1 to include the Gigablast header in the returned page, and 0 to exclude it. |
ibh=X | \n" "X is 1 to include the Gigablast BASE HREF tag in the cached page. The default is 1. |
q=X | \n" "X is the the query that, when present, will cause Gigablast to highlight the query terms on the returned page. |
cas=X | \n" "" "X can be 0 or 1 to respectively disable or enable click and scroll. Default is 1. |
strip=X | \n" "" "X can be 0, 1 or 2. If X is 0 then no stripping is performed. If X is 1 then image and other tags are removed. An X of 2 is another form of removing tags. Default is 0. |
\n" "# The XML reply uses the Latin-1 Character Set (ISO 8859-1) when using raw=8\n" "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n" "# OR when using raw=9\n" "<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n" "\n" "# It consists of one, and only one, response.\n" "<response>\n" "\n" " # If any error was received in processing the request, it will be here.\n" " <error>Out of memory</error>\n" " # The numeric code of the error, if any, goes here.\n" " # See all the Error Codes, but the " " # following errors are most likely:\n" " # %5li - A cached page was not found when it should have been.\n" " # %5li - There was a int16_tage of memory to properly process the request.\n" " # %5li - Queried collection does not exist.\n", (int32_t)ENOTFOUND, (int32_t)ENOMEM, (int32_t)ENOCOLLREC); sprintf( p , " <errno>32790</errno>\n" " # Total number of documents in the collection being searched.\n" " <docsInCollection>2060245584</docsInCollection>\n" " # An APPROXIMATION of the total number of search results for the query.\n" " <hits>4838158</hits>\n" " # This is \"1\" if more results are available after these, \"0\" if not.\n" " <moreResultsFollow>1</moreResultsFollow>\n" " # If present and value is 1, some words in the query were censored for content.\n" " <queryCensored>1</queryCensored>\n" " # If present, the value is the number of results that were censored for content.\n" " <resultsCensored>3</resultsCensored>\n" " # If this tag is present, it will hold an alternate spelling recommendation \n" " # for the query. The &spell=1 parameter must be present in the query url,\n" " # however, for you to get a spelling recommendation back.\n" " <spell>nose</spell>\n" " # If this tag is present, it contains the list of query words that were \n" " # ignored as individual words, but not necessarily as part of a phrase\n" " <ignoredWords>the in of</ignoredWords>\n" " # This is how many of the search results contain ALL of the query terms.\n" " # It is only used for printing the \"blue bar\" for doing SuperRecall\n" " <minNumExactMatches>300</minNumExactMatches>\n" "\n" " # The list of related topics, each enclosed by <topic> tags. \n" " # You must provide a topics parameter to the query url to get " "topics.\n" " <topic>\n" " # Each topic has a score. A score of 50%% or more is considered pretty good.\n" " <score>63</score>\n" " # Out of the documents scanned, how many contain this topic.\n" " <docCount>4</docCount>\n" " # The topic popularity. A measure of how popular the word or phrase is\n" " # based on how many web pages contain it overall. Ranges from 0 to 1000.\n" " # 1000 being the most popular.\n" " <popularity>16</popularity>\n" " # The docIds of the documents scanned that contain this topic.\n" " <docId>9030668134</docId>\n" " <docId>265962215563</docId>\n" " <docId>43940265200</docId>\n" " <docId>264861015824</docId>\n" " # The topic name.\n" " <name><![CDATA[Race Cars]]></name>\n" " # And OPTIONALLY the name of the meta tag it was derived from.\n" " <from>keywords</from>\n" " </topic>\n" "\n" " # The list of reference pages for the search results. Each reference is\n" " # enclosed in <reference> tags.\n" " <reference>\n" " # Each reference has a score based on its relevance to the query.\n" " <score>93</score>\n" " # Title of the reference page\n" " <title></title>\n" " # Url of the reference page\n" " <url><![CDATA[http://www.greatreference.com/]]></url>\n" " </reference>\n" "\n" " # The list of related pages for the search results. Each related page is\n" " # enclosed in <related> tags.\n" " <related>\n" " # Each related page has a score based on its relevance to the query.\n" " <score>91</score>\n" " # Title of the related page.\n" " <title></title>\n" " # Url of the related page.\n" " <url><![CDATA[http://www.similar.com/]]></url>\n" " # Summary of the related page.\n" " <sum><![CDATA[This page is similar to the results]]></sum>\n" " </related>\n" "\n" " # The list of search results, each enclosed in <result> tags.\n" " <result>\n" " # Each result has a title. This may be empty if none was found on the page.\n" " <title><![CDATA[My Homepage]]></title>\n" " # Each result has a summary. This may be empty. The summary is generated \n" " # so as to contain the query terms if possible.\n" " <sum><![CDATA[All about my interests and hobbies]]></sum>\n" " # If this result is categorized under the DMOZ Directory, data about each\n" " # category it is in will be enclosed in a <dmoz> tag.\n" " <dmoz>\n" " # The category ID number of this category.\n" " <dmozCatId>172</dmozCatId>\n" " # The path of this category in the directory.\n" " <dmozCat><![CDATA[Health: Dentistry]]></dmozCat>\n" " # Title of this result as listed in the directory.\n" " <dmozTitle><![CDATA[My Homepage]]></dmozTitle>\n" " # Description of this page as listed in the directory.\n" " <dmozDesc><![CDATA[A Dentist's Home Page]]></dmozDesc>\n" " </dmoz>\n" " # If the directory is being given along with the results, this is the number of\n" " # stars given to this page based on its quality.\n" " <stars>3</stars>\n" " # Each result may have a sequence of <display> tags if the feed input\n" " # contained a dt parameter. This allows you to extract\n" " # information contained in meta tags in the content of each search result.\n" " # To obtain the contents of the author meta tag, you would need to pass in\n" " # dt=author.\n" " <display name=\"author\"><![CDATA[Contents of the meta author tag]]></display>\n" " # Each result has a URL. This should never be empty.\n" " <url><![CDATA[http://www.mydomain.com/mypage.html]]></url>\n" " # The size of the page in kilobytes. Accurate to the tenth of a kilobyte.\n" " <size>5.6</size>\n" " # The time the page was last INDEXED. It may not have been indexed in a \n" " # int32_t time if the page's content has not changed. The time is expressed \n" " # in seconds since the epoch. (Jan 1, 1969)\n" " <spidered>1064367311</spidered> \n" " # The time the page was last modified. This is taken from the HTTP reply \n" " # of the web server when downloading the page. It is 0 if unknown. The time\n" " # is expressed in seconds since the epoch. (Jan 1, 1969)\n" " <lastMod>1058477041</lastMod>\n" " # The assigned docid for this page. This number is unique and used \n" " # internally by Gigablast to identify this page. It is used to retrieve the\n" " # \"cached copy\" of the page.\n" " <docId>65990704587</docId>\n" " # When doing site clustering, this tag will be present if the result is \n" " # from the same hostname as a previous result for the same query. It \n" " # indicates that you might want to indent the result. Any further results \n" " # from this same hostname will be stripped from the feed.\n" " <clustered>1</clustered>\n" " # When Topic Clustering is being used, these will display results which \n" " # are considered similar to this result and have been clustered under it. \n" " # Each similar result is enclosed in a <similar> tag. \n" " <similar>\n" " # The url for the similar result.\n" " <url><![CDATA[http://www.similar.com/]]></url>\n" " # The title of the similar result.\n" " <title><![CDATA[A similar topic]]></title>\n" " </similar>\n" " # If this is present and set to 1, there are more similar results beyond \n" " # those given here. \n" " <moreSimilar>1</moreSimilar>\n" " # This is a standard HTTP MIME content classification of the result. It is \n" " # not present if the page is text/html. Otherwise, it will be one of the\n" " # following: text/plain\n" " # text/xml\n" " # application/pdf\n" " # application/msword\n" " # application/vnd.ms-excel\n" " # application/mspowerpoint\n" " # application/postscript\n" " <contentType>text/plain</contentType>\n" " # The documents are all sorted by this score. This score is a generally a\n" " # product of the WEIGHT of the query term and the COUNT of the query term\n" " # in this document. The WEIGHT is usually influenced by them term frequency\n" " # of the query term (rarer terms get more WEIGHT), by the additional weight\n" " # received by phrases which can be adjusted in the Master Controls, and,\n" " # possibly, by any user-defined weight in the query (See Weighting Query Terms).\n" " # This score is normalized by dividing by the maximum\n" " # score for all documents in the search results and then making it into a\n" " # percentage, so the score ranges from 0 to 100, and the first result\n" " # should always have score 100.\n" " <score>100</score>\n" " # This is the absolute score. Useful for merging results from other\n" " # collections or other search engines.\n" " <absScore>5132</absScore>\n" " # This is the language the page was detected as.\n" " <language><![CDATA[English]]></language>\n" " # The character set this page was originally encoded in. \n" " <charset><![CDATA[utf-8]]></charset>\n" " </result>\n" "\n" " <result>\n" " ...\n" " </result>\n" "\n" " ...\n" "\n" " # If the directory has been requested, this node will include the directory\n" " # structure for the requested category. Typically this is above the results.\n" " <directory>\n" " # Category ID for the displayed directory structure.\n" " <dirId>172</dirId>\n" " # Directory path of this category listing.\n" " <dirName>Health: Dentistry</dirName>\n" " # Specifies if the directory listing is displayed in a Right-To-Left format.\n" " <dirIsRTL>1</dirIsRTL>\n" " # Sub-Categories listed as letters meant to be displayed as a letter bar.\n" " # Each sub-category will be enclosed in a <letterbar> tag.\n" " <letterbar><![CDATA[Health/Dentistry/A]]>" " # Every sub category will include a count of how many urls are listed under it.\n" " <urlcount>5<urlcount>\n" " </letterbar>\n" " # Normal sub-categories listed in groups. These are listed in order of group\n" " # and alphabetically within each group. Each sub-category is enclosed in a\n" " # <narrow2>, <narrow1>, or <narrow> tag.\n" " <narrow2><![CDATA[Health/Dentistry/Regional]]>\n" " <urlcount>0<urlcount>\n" " </narrow2>\n" " <narrow1><![CDATA[Health/Dentistry/Association]]>\n" " <urlcount>122<urlcount>\n" " </narrow1>\n" " <narrow><![CDATA[Health/Dentistry/Children]]>\n" " <urlcount>24<urlcount>\n" " </narrow>\n" " # Symbolically linked sub-categories physically under a different category.\n" " # These will be interwoven alphabetically within the respective narrow groups.\n" " # The name listed before the path is the symbolic name. Each symbolically linked\n" " # sub-category is enclosed in a <symbolic2>, <symbolic1>, or \n" " # <symbolic> tag.\n" " <symbolic2><![CDATA[Dentophobia:Health/Mental_Health/Disorders/Anxiety/Phobias/Dentophobia]]>\n" " <urlcount>2<urlcount>\n" " </symbolic2>\n" " <symbolic1><![CDATA[Dental_Laboratories:Buisness/Healthcare/Products_and_Services/Dentistry/Dental_Laboratories]]>\n" " <urlcount>71<urlcount>\n" " </symbolic1>\n" " <symbolic><![CDATA[Products:Shopping/Health/Dental]]>\n" " <urlcount>71<urlcount>\n" " </symbolic>\n" " # Seperate categories in the directory which are related to this one.\n" " <related><![CDATA[Society/Issues/Health/Dentistry]]>\n" " <urlcount>4</urlcount>\n" " </related>\n" " # This category in other languages in the directory.\n" " <altlang><![CDATA[Basque:World/Euskara/Osasuna/Odontologia]]>\n" " <urlcount>7</urlcount>\n" " </altlang>\n" " </directory>\n" "\n" "</response>\n" "\n" "\n" "" "\n" "
Key | |
a | " "Error used by an add or delete collection " "operation." " |
i | " "Error used by an inject (or delete) operation." " |
s | " "Error used by a search operation." " |
C error codes | ||
%"INT32" | " "%s | ", c,i,strerror(i)); char *s = p; // is it for injector, search results or addcoll interface? // use 'i','s','a' switch ( i ) { case EPERM : p += sprintf(p,"a - Did not have permission in the " "working dir to create/delete the " "collection subdir."); break; case ENOENT: p += sprintf(p,"a - When creating the subdir for the " "collection in the working dir, a " "directory component in pathname " "does not exist or is a dangling " "symbolic link."); break; case EIO : p += sprintf(p,"a,i,s - There was an error writing or " "reading data to or from the disk, most " "likely due to a hardware failure."); break; case EACCES: p += sprintf(p,"a,i - The working directory, or its " "parent does not allow write " "permission."); break; case EEXIST: p += sprintf(p,"a - The collection subdir already " "exists in the working dir."); break; case ENOSPC: p += sprintf(p,"a,i - There is no room on the drive " "to write data because the drive is " "full, or the user's disk quota is " "exhausted."); break; case EBADF: p += sprintf(p,"a,i,s - Read or write on a bad file " "descriptor. This should not happen."); break; case ENOBUFS : p += sprintf(p,"a - Collection name limit of %"INT32" is " "exceeded.",(int32_t)MAX_COLL_LEN); break; case ENOMEM: p += sprintf(p,"a,i,s - Out of memory."); break; } // don't print if not used! if ( s == p ) { p = b; continue; } if ( c[0] == 'e' ) c = "ffffff"; else c = "eeeeee"; p += sprintf(p," |
Gigablast error codes" " | ||
%"INT32" | " "%s | ",
c,i,mstrerror(i));
char *s = p;
// is it for injector, search results or addcoll interface?
// use 'i','s','a'
switch ( i ) {
case ETRYAGAIN:
p += sprintf(p,"a,i,s - Resources temporarily "
"unavailable.");
break;
case ENOCOLLREC:
p += sprintf(p,"a,i,s - Referenced collection does "
"not exist.");
break;
case EBADENGINEER :
p += sprintf(p,"a - Collection name being added "
"contains an illegal character, or an "
"empty name was provided, or the name "
"is more than %"INT32" characters. ", (int32_t)MAX_COLL_LEN); // SpiderLoop.cpp Msg7.cpp PageInject.cpp p += sprintf(p,"i - No URL was provided, or URL " "has no hostname. Or provided URL is " "currently being injected. Or %"INT32" " "injects are currently in progress.", (int32_t)MAX_SPIDERS); break; //case EURLTOOLONG : //p += sprintf(p,"i - Injected URL was longer than " // "%"INT32" characters.",(int32_t)MAX_URL_LEN); //break; case EBADREPLY: p += sprintf(p,"i - Received bad internal reply. You " "should never see this error."); break; case EEXIST: p += sprintf(p,"a - Adding a collection name that " "already exists."); break; case ENOTFOUND: p += sprintf(p,"i - When looking up old document " "for injected URL it was not found when " "it should have been. This is due to " "data corruption."); break; case ENODOCID: p += sprintf(p,"i - No docids were available to " "inject the URL. The database has " "reached its limit."); break; case EBUFTOOSMALL: p += sprintf(p,"i - Injected URL was longer than " "%"INT32" characters. Or the injected " "document was too big to fit in memory, " "so consider increasing " " |
gb | The Gigablast executable. Contains the web server, the database and the spider. This file is required to run gb. |
hosts.conf | This file describes each host (gb process) in the Gigablast network. Every gb process uses the same hosts.conf file. This file is required to run gb." "" " |
gb.conf | Each gb process is called a host and each gb process has its own gb.conf file. This file is required to run gb." " |
coll.XXX.YYY/ | For every collection there is a subdirectory of this form, where XXX is the name of the collection and YYY is the collection's unique id. Contained in each of these subdirectories is the data associated with that collection. |
coll.XXX.YYY/coll.conf | Each collection contains a configuration file called coll.conf. This file allows you to configure collection specific parameters. Every parameter in this file is also controllable via your the administrative web pages as well. |
trash/ | Deleted collections are moved into this subdirectory. A timestamp in milliseconds since the epoch is appended to the name of the deleted collection's subdirectory after it is moved into the trash sub directory. Gigablast doesn't physically delete collections in case it was a mistake. |
tagdbN.xml | Several files where N is an integer. The files must be contiguous, starting with an N of 0. Each one of these files is a ruleset file. This file is required for indexing and deleting documents. |
html/ | A subdirectory that holds all the html files and images used by Gigablast. Includes Logos and help files. |
dict/ | A subdirectory that holds files used by the spell checker and the GigaBits generator. Each file in dict/ holds all the words and phrases starting with a particular letter. The words and phrases in each file are sorted by a popularity score. |
antiword | Executable called by gbfilter to convert Microsoft Word files to html for indexing. |
.antiword/ | A subdirectory that contains information needed by antiword. |
pdftohtml | Executable called by gbfilter to convert PDF files to html for indexing. |
pstotext | Executable called by gbfilter to convert PostScript files to text for indexing. |
ppthtml | Executable called by gbfilter to convert PowerPoint files to html for indexing. |
xlhtml | Executable called by gbfilter to convert Microsoft Excel files to html for indexing. |
gbfilter | Simple executable called by Gigablast with document HTTP MIME header and document content as input. Output is an HTTP MIME and html or text that can be indexed by Gigablast. |
gbstart | An optional simple script used to start up the gb process(es) on each computer in the network. Otherwise, iff you have passwordless ssh capability then you can just use './gb start' and it will spawn an ssh command to start up a gb process for each host listed in hosts.conf. |
1. Set log spidered urls to YES on the log page. Then " "check the log to see if something is being logged." " |
2. Check the master controls page for the following: " " a. the spider enabled switch is set to YES. " " b. the spider max pages per second control is set " "high enough. " " c. the spider max kbps control is set high enough. |
3. Check the spider controls page for the following: " " a. the collection you wish to spider for is selected (in red). " " a. the old or new spidering is set to YES. " " b. the appropriate old and new spider priority " "checkboxes are checked. " " c. the spider start and end times are set " "appropriately. " " d. the use current time control is set correctly. " " e. the spider max pages per second control is set " "high enough. " " f. the spider max kbps control is set high enough. |
3. If you have urls from only a few domains then the same domain " "wait or same ip wait controls could be limiting the spidering " "of the urls such that you do not see any in the Spider Queue table. If the " "indexed document count on the home page is increasing then this may be the " "case. Even if the count is not increasing, it may still be the case if the " "documents all have errors, like 404 not found. |
"
"4. Make sure you have urls to spider by running 'gb dump s |
In the current spider queue, what are the statuses of each url? If " "they are mostly \"getting cached web page\" and the IP address column is " "mostly empty, then Gigablast may be bogged down looking up the cached web " "pages of each url in the spider queue only to discover it is from a domain " "that was just spidered. This is a wasted lookup, and it can bog things down " "pretty quickly when you are spidering a lot of old urls from the same " "domain. " "Try setting same domain wait and same ip wait both to 0. This " "will pound those domain's server, though, so be careful. Maybe set it to " "1000ms or so instead. We plan to fix this in the future." " |
Try increasing the <tfnbMaxPageCacheMem> in the gb.conf for all hosts in the cluster to minimize the disk seeks into tfndb as seen on the Stats page. Stop all gb processes then use ./gb installconf to distribute the gb.conf to all hosts in the cluster. You migh also try decreasing the size of the url filters table, every regular expression in that table is consulted for every link added and it can really block the cpu.\n" " |
u=X | \n" "X is the url you are injecting. This is required. |
c=X | \n" "X is the name of the collection into which you are injecting the content. This is required. |
delete=X | \n" "X is 0 to add the URL/content and 1 to delete the URL/content from the index. Default is 0. |
ip=X | \n" "X is the ip of the URL (i.e. 1.2.3.4). If this is ommitted or invalid then Gigablast will lookup the IP, provided iplookups is true. But if iplookups is false, Gigablast will use the default IP of 1.2.3.4. |
iplookups=X | \n" "If X is 1 and the ip of the URL is not valid or provided then Gigablast will look it up. If X is 0 Gigablast will never look up the IP of the URL. Default is 1. |
dedup=X | \n" "If X is 1 then Gigablast will not add the URL if another already exists in the index from the same domain with the same content. If X is 0 then Gigablast will not do any deduping. Default is 1. |
rs=X | \n" "X is the number of the ruleset to use to index the URL and its content. It will be auto-determined if rs is omitted or rs is -1. |
quick=X | \n" "If X is 1 then the reply returned after the content is injected is the reply described directly below this table. If X is 0 then the reply will be the HTML form interface. |
hasmime=X | \n" "X is 1 if the provided content includes a valid HTTP MIME header, 0 otherwise. Default is 0. |
content=X | \n" "X is the content for the provided URL. If hasmime is true then the first part of the content is really an HTTP mime header, followed by \"\r\n\r\n\", and then the actual content. |
ucontent=X | \n" "X is the UNencoded content for the provided URL. Use this one instead of the content cgi parameter if you do not want to encode the content. This breaks the HTTP protocol standard, but is convenient because the caller does not have to convert special characters in the document to their corresponding HTTP code sequences. IMPORTANT: this cgi parameter must be the last one in the list. |
\n" "POST /inject HTTP/1.0\r\n\n" "Content-Length: 291\r\n\n" "Content-Type: text/html\r\n\n" "Connection: Close\r\n\n" "\r\n\n" "u=myurl&c=&delete=0&ip=4.5.6.7&iplookups=0&dedup=1&rs=7&quick=1&hasmime=1&ucontent=HTTP 200\r\n\n" "Last-Modified: Sun, 06 Nov 1994 08:49:37 GMT\r\n\n" "Connection: Close\r\nContent-Type: text/html\r\n\r\n\n" "Overview \n" "This is the unencoded content of the page we are injecting.\n" "
%5li %s | " "There was a int16_tage of memory to properly " "process the request. |
%05"INT32" %s | " "A cached page was not found when it should have " "been, likely due to corrupt data on disk. |
%5li %s | " "There was a int16_tage of resources so the " "request should be repeated. |
%5li %s | " "The injection was to a collection that does " "not exist. |
\n" "<name>X</>" " | \n" "This tag tells Gigablast what part of the document to index. You can have multiple <name> tags in the same index rule. X can be one of the following values: \n" "" "
| \n"
"||||||||||||||||
\n" "<prefix>X</>" " | \n" "\n" "If present, Gigablast will index the words and phrases with the specified prefix, X. Fielded searches can then be performed. Example: <prefix>title</>" " | \n" "||||||||||||||||
\n" "<maxQualityForSpamDetect>X</>" " | \n" "\n" "Spam detection will be performed on the words and phrases if the document's quality is X or lower. Spam detection generally lowers the scores of repeated words and phrases based on the degree of repetition." " | \n" "||||||||||||||||
\n" "<minQualityToIndex>X</>" " | \n" "\n" "If the document's quality is below X, then do not index the words and phrases for this index rule." " | \n" "||||||||||||||||
\n" "<filterHtmlEntities>X</>" " | \n" "\n" "If X is yes then convert HTML entities, like &gt;, into their represented characters before indexing." " | \n" "||||||||||||||||
\n" "<indexIfUniqueOnly>X</>" " | \n" "\n" "If X is yes then each word or phrase will only be indexed if not already indexed by a previous index rule in the ruleset, and only the first occurence of the word or phrase will be indexed, subsequent occurences will not count towards the score." " | \n" "||||||||||||||||
\n" "<indexSingletons>X</>" " | \n" "\n" "If X is yes then index the words, otherwise do not." " | \n" "||||||||||||||||
\n" "<indexPhrases>X</>" " | \n" "\n" "If X is yes then index the phrases, otherwise do not." " | \n" "||||||||||||||||
\n" "<indexAsWhole>X</>" " | \n" "\n" "If X is yes then index the whole sequence of indexable words as a checksum." " | \n" "||||||||||||||||
\n" "<useStopWords>X</>" " | \n" "\n" "If X is yes then use stop words when forming phrases." " | \n" "||||||||||||||||
\n" "<useStems>X</>" " | \n" "\n" "If X is yes then index stems. Currently unsupported." " | \n" "||||||||||||||||
\n"
""
"<quality11> X1 </> \n" "<quality12> X2 </>... \n" "<quality1N> XN </> \n" "<maxLen11> Y1 </> \n" "<maxLen12> Y2 </>... \n" "<maxLen1N> YN </> \n" "" " | \n"
"\n" "This maps the quality of the document to a maximum number of CHARACTERS to index. The (Xn,Yn) points form a piecewise function which is linearly interpolated between points. The edges are horizontal, meaning, if X is 0 Y will be Y1, or if X is infinite, Y will be YN.\n" " | \n" "||||||||||||||||
\n"
"\n"
"<quality21> X1 </> \n" "<quality22> X2 </>... \n" "<quality2N> XN </> \n" "<maxScore21> Y1 </> \n" "<maxScore22> Y2 </>... \n" "<maxScore2N> YN </> \n" "" " | \n"
"\n" "This maps the quality of the document to a percentage of the absolute max score a word or phrase can have. This is the QUALITY_WEIGHT_MAX value in the formula." " | \n" "||||||||||||||||
\n"
"\n"
"<quality31> X1 </> \n" "<quality32> X2 </>... \n" "<quality3N> XN </> \n" "<scoreWeight31> Y1 </> \n" "<scoreWeight32> Y2 </>... \n" "<scoreWeight3N> YN </> \n" "" " | \n"
"\n" "This maps the quality of the document to a percentage weight on the base score of the words and phrases being indexed. This is the QUALITY_WEIGHT value in the formula." " | \n" "||||||||||||||||
\n"
"\n"
"<len41> X1 </> \n" "<len42> X2 </>... \n" "<len4N> XN </> \n" "<scoreWeight41> Y1 </> \n" "<scoreWeight42> Y2 </>... \n" "<scoreWeight4N> YN </> \n" "" " | \n"
"\n" "This maps the length (in characters) of the what is being indexed to a percentage weight on the base score of the words and phrases being indexed. This is the LENGTH_WEIGHT value in the formula." " | \n" "||||||||||||||||
\n"
"\n"
"<len51> X1 </> \n" "<len52> X2 </>... \n" "<len5N> XN </> \n" "<maxScore51> Y1 </> \n" "<maxScore52> Y2 </>... \n" "<maxScore5N> YN </> \n" " | \n"
"\n" "This maps the length (in characters) of the what is being indexed to a percentage of the absolute maximum score a word or phrase can have. This is the LENGTH_WEIGHT_MAX value in the formula." " | \n" "
\n" "BASE_SCORE = min { (256 * QUALITY_WEIGHT * LENGTH_WEIGHT ) / 10000 + BOOST ,\n" " (0xffffffffLL * QUALITY_WEIGHT_MAX * LENGTH_WEIGHT_MAX ) / 10000 }\n" "\n" "
Item | \n" "Ruleset Tag | \n" "Desription | \n" "\n" "<meta name=foo content=bar>" " | \n" "\n" "--" " | \n" "\n" "User-defined meta tags use the quality of the document multiplied by 256 as their score. If this product is 0 it is upped to 1. This score is then mapped to an 8-bit final score an indexed. Furthermore, when indexing user-defined meta tags, only one occurence of each word or phrase is counted. In the future, these meta tags may have their own index rule." " | \n" "\n" "" "
\n" "http://www.xxx.com/abc" " | \n" "\n" "<indexUrl>X</>" " | \n" "If X is yes then the entire url is indexed as one word with a BASE_SCORE of 1 and with a url: prefix so a search for url:http://www.xxx.com/ will bring up the document." " | \n" "
\n" "http://www.xxx.com/abc" " | \n" "\n" "<indexSubUrl>X</>" " | \n" "If X is yes then the url is indexed as if it occured in the document, but with a random BASE_SCORE (based on url hash) and a suburl: prefix so a search for suburl:\"com/abc\" will bring up the document." " | \n" "
\n" "http://www.xxx.com/abc" " | \n" "\n" "<indexIp>X</>" " | \n" "If X is yes then the IP of the url will be indexed as if it were one word but with a random BASE_SCORE (based on url hash). Furthermore, the last number of the IP address is replaced with a zero and that IP address is indexed in order to provide an IP domain search ability. So if a url has the IP address 1.2.3.4 then a search for ip:1.2.3.4 or for ip:1.2.3 should bring it up." " | \n" "
\n" "http://www.xxx.com/abc?q=hi" " | \n" "\n" "<indexSite>X</>" " | \n" "If X is yes then the following terms would be indexed with a base score of BASE_SCORE (but multiplied by 3 if the url is a root url): "
"
| \n"
"
\n" "http://www.xxx.com/form.php" " | \n" "\n" "<indexExt>X</>" " | \n" "If X is yes then the file extension, if any, of the url would be indexed with the ext: prefix and a score of BASE_SCORE. So a query of ext:php would bring up the document in this example case." " | \n" "
\n" "links" " | \n" "\n" "<indexLinks>X</>" " | \n" "If X is yes then the various links in the document will be indexed with a link: prefix. Scores are special in this case." " | \n" "
\n" "collection name" " | \n" "\n" "--" " | \n" "The collection name of the document is indexed with the coll: prefix and a BASE_SCORE of 1. " " | \n" "
\n" "content type" " | \n" "\n" "--" " | \n" "The content type of the document is indexed with the type: (or filetype:) prefix and a BASE_SCORE of 1. If the content type is not one of these supported content types, then nothing will be indexed: "
"
| \n"
"
" "Char Decimal Hex Entity Char Decimal Hex Entity\n" " Reference Reference\n" "NUL 0 0 SOH 1 1\n" "STX 2 2 ETX 3 3\n" "EOT 4 4 ENQ 5 5\n" "ACK 6 6 BEL 7 7\n" "BS 8 8 HT 9 9\n" "NL 10 a VT 11 b\n" "NP 12 c CR 13 d\n" "SO 14 e SI 15 f\n" "DLE 16 10 DC1 17 11\n" "DC2 18 12 DC3 19 13\ DC4 20 14 NAK 21 15\ SYN 22 16 ETB 23 17\ CAN 24 18 EM 25 19\ SUB 26 1a ESC 27 1b\ FS 28 1c GS 29 1d\ RS 30 1e US 31 1f\ SP 32 20 ! 33 21\ \" 34 22 " # 35 23\ $ 36 24 %% 37 25\ & 38 26 & ' 39 27\ ( 40 28 ) 41 29\ * 42 2a + 43 2b\ , 44 2c - 45 2d\ . 46 2e / 47 2f\ 0 48 30 1 49 31\ 2 50 32 3 51 33\ 4 52 34 5 53 35\ 6 54 36 7 55 37\ 8 56 38 9 57 39\ : 58 3a ; 59 3b\ < 60 3c < = 61 3d\ > 62 3e > ? 63 3f\ @ 64 40 A 65 41\ B 66 42 C 67 43\ D 68 44 E 69 45\ F 70 46 G 71 47\ H 72 48 I 73 49\ J 74 4a K 75 4b\ L 76 4c M 77 4d\ N 78 4e O 79 4f\ P 80 50 Q 81 51\ R 82 52 S 83 53\ T 84 54 U 85 55\ V 86 56 W 87 57\ X 88 58 Y 89 59\ Z 90 5a [ 91 5b\ \\ 92 5c ] 93 5d\ ^ 94 5e _ 95 5f\ ` 96 60 a 97 61\ b 98 62 c 99 63\ d 100 64 e 101 65\ f 102 66 g 103 67\ h 104 68 i 105 69\ j 106 6a k 107 6b\ l 108 6c m 109 6d\ n 110 6e o 111 6f\ p 112 70 q 113 71\ r 114 72 s 115 73\ t 116 74 u 117 75\ v 118 76 w 119 77\ x 120 78 y 121 79\ z 122 7a { 123 7b\ | 124 7c } 125 7d\ ~ 126 7e DEL 127 7f\ -- 128 80 -- 129 81\ -- 130 82 -- 131 83\ -- 132 84 -- 133 85\ -- 134 86 -- 135 87\ -- 136 88 -- 137 89\ -- 138 8a -- 139 8b\ -- 140 8c -- 141 8d\ -- 142 8e -- 143 8f\ -- 144 90 -- 145 91\ -- 146 92 -- 147 93\ -- 148 94 -- 149 95\ -- 150 96 -- 151 97\ -- 152 98 -- 153 99\ -- 154 9a -- 155 9b\ -- 156 9c -- 157 9d\ -- 158 9e -- 159 9f\ 160 a0 ¡ 161 a1 ¡\ ¢ 162 a2 ¢ £ 163 a3 £\ ¤ 164 a4 ¤ ¥ 165 a5 ¥\ ¦ 166 a6 ¦ § 167 a7 §\ ¨ 168 a8 ¨ © 169 a9 ©\ ª 170 aa ª « 171 ab «\ ¬ 172 ac ¬ 173 ad ­\ ® 174 ae ® ¯ 175 af ¯\ ° 176 b0 ° ± 177 b1 ±\ ² 178 b2 ² ³ 179 b3 ³\ ´ 180 b4 ´ µ 181 b5 µ\ ¶ 182 b6 ¶ · 183 b7 ·\ ¸ 184 b8 ¸ ¹ 185 b9 ¹\ º 186 ba º » 187 bb »\ ¼ 188 bc ¼ ½ 189 bd ½\ ¾ 190 be ¾ ¿ 191 bf ¿\ À 192 c0 À Á 193 c1 Á\ Â 194 c2 Â Ã 195 c3 Ã\ Ä 196 c4 Ä Å 197 c5 Å\ Æ 198 c6 Æ Ç 199 c7 Ç\ È 200 c8 È É 201 c9 É\ Ê 202 ca Ê Ë 203 cb Ë\ Ì 204 cc Ì Í 205 cd Í\ Î 206 ce Î Ï 207 cf Ï\ Ð 208 d0 Ð Ñ 209 d1 Ñ\ Ò 210 d2 Ò Ó 211 d3 Ó\ Ô 212 d4 Ô Õ 213 d5 Õ\ Ö 214 d6 Ö × 215 d7 ×\ Ø 216 d8 Ø Ù 217 d9 Ù\ Ú 218 da Ú Û 219 db Û\ Ü 220 dc Ü Ý 221 dd Ý\ Þ 222 de Þ ß 223 df ß\ à 224 e0 à á 225 e1 á\ â 226 e2 â ã 227 e3 ã\ ä 228 e4 ä å 229 e5 å\ æ 230 e6 æ ç 231 e7 ç\ è 232 e8 è é 233 e9 é\ ê 234 ea ê ë 235 eb ë\ ì 236 ec ì í 237 ed í\ î 238 ee î ï 239 ef ï\ ð 240 f0 ð ñ 241 f1 ñ\ ò 242 f2 ò ó 243 f3 ó\ ô 244 f4 ô õ 245 f5 õ\ ö 246 f6 ö ÷ 247 f7 ÷\ ø 248 f8 ø ù 249 f9 ù\ ú 250 fa ú û 251 fb û\ ü 252 fc ü ý 253 fd ý\ þ 254 fe þ ÿ 255 ff ÿ\" "
Gigablast uses its own format for logging messages, for example, \n" " \n" "1091228736104 0 INIT Gigablast Version 1.234\n" "1091228736104 0 INIT thread Allocated 435333 bytes for thread stacks.\n" "1091228736104 0 WARN mem Failed to alloc 360000 bytes.\n" "1091228736104 0 WARN query Failed to intersect lists. Out of memory.\n" "1091228736104 0 WARN query Too many words. Query truncated.\n" "1091228736104 0 INFO build GET http://hohum.com/foobar.html\n" "1091228736104 0 INFO build http://hohum.com/foobar.html ip=4.5.6.7 : Success\n" "1091228736104 0 DEBUG build Skipping xxx.com, would hammer IP.\n" "\n" " \n" "The first field, a large number, is the time in milliseconds since the epoch. This timestamp is useful for evaluating performance. \n" " \n" "The second field, a 0 in the above example, is the hostId (from hosts.conf) of the host that logged the message. \n" " \n" "The third field, INIT in the first line of the above example, is the type of log message. It can be any of the following: \n" " \n" "
\n" "The fourth field is the resource that is logging the message. The resource can be one of the following:" "" "
\n" "Finally, the last field, is the message itself." " \n" "You can turn many messages on and off by using the Log Controls." " \n" "The same parameters on the Log Controls page can be adjusted in the gb.conf file." " \n" "\n" "\n" "\n" "" "" "" "" "\n" " \n" "Gigablast is a fairly sophisticated database that has a few things you can tweak to increase query performance or indexing performance.\n" " \n" "\n" "Query Optimizations:\n" "\n" "
\n" "\n" "Build Optimizations:\n" "
\n" "\n" "General Optimizations:\n" "
\n" "\n" "\n" "\n" "\n" ">\n" " \n" " \n" //"## This is the IP and port that a user connects to in order to search this\n" //"## Gigablast network. This should be the same for all gb processes\n" //"<mainExternalIp> 68.35.105.199</>\n" //"<mainExternalPort> 8000</>\n" //"\n" "## Mem available to this process. May be exceeded due to fragmentation.\n" "<maxMem> 445000000</>\n" "\n" "## Max incoming bandwith to use for spidering, for all hosts combined." "<maxIncomingKbps> 3000.0</>\n" "\n" "## The maximum number of pages to spider per second, for all hosts combined." "<maxPagesPerSecond> 20.00</>\n" "\n" "## Max threads for reading spider-related information on disk.\n" "<spiderMaxDiskThreads> 1</>\n" "\n" "## Max threads for reading big/med/small chunks of spider-related info on disk\n" "<spiderMaxBigDiskThreads> 1</>\n" "<spiderMaxMedDiskThreads> 1</>\n" "<spiderMaxSmaDiskThreads> 5</>\n" "\n" "## Max threads for reading query-related information on disk.\n" "<queryMaxDiskThreads> 20</>\n" "\n" "## Max threads for reading big/med/small chunks of query-related info on disk\n" "<queryMaxBigDiskThreads> 1</>\n" "<queryMaxMedDiskThreads> 3</>\n" "<queryMaxSmaDiskThreads> 10</>\n" "\n" "## What are the IP addresses and ports of the DNS servers? Accessed randomly.\n" "<dns><ip>68.35.172.5</><port>53</></>\n" "<dns><ip>68.35.172.6</><port>53</></>\n" "\n" "## How many bytes should we use for caching DNS replies?\n" "<dnsMaxCacheMem> 13000</>\n" "\n" "## Should we save/load the DNS reply cache when we exit/start? 1=YES 0=NO\n" "<dnsSaveCache> 0</>\n" "\n" "## Below the various Gigablast databases are configured.\n" "## <*dbMaxTreeMem> - mem used for holding new recs\n" "## <*dbMaxPageCacheMem> - disk page cache mem for this db\n" "## <*dbMaxCacheMem> - cache mem for holding single recs\n" "## <*dbMinFilesToMerge> - required # files to trigger merge\n" "## <*dbSaveCache> - save the rec cache on exit?\n" "## <*dbMaxCacheAge> - max age for recs in rec cache\n" "## See that Stats page for a record counts and stats.\n" "\n" "## Sitedb holds site-based parsing info. A tagdb record assigns a url or site\n" "## to a ruleset. Each tagdb record is about 100 bytes or so.\n" "<tagdbMaxTreeMem> 1200000</>\n" "<tagdbMaxPageCacheMem> 200000</>\n" "<tagdbMaxCacheMem> 131072</>\n" "<tagdbMinFilesToMerge> 2</>\n" "\n" "## Titledb holds the compressed documents that we've indexed.\n" "<titledbMaxTreeMem> 1000000</>\n" "<titledbMaxCacheMem 10485760</>\n" "<titledbMinFilesToMerge> 3</>\n" "<titledbMaxCacheAge> 86400</>\n" "<titledbSaveCache> 0</>\n" "\n" "## Clusterdb caches small records for site clustering and deduping.\n" "<clusterdbMaxCacheMem> 131072</>\n" "<clusterdbSaveCache> 0</>\n" "\n" "## Checksumdb is used for deduping same-site urls at index time.\n" "<checksumdbMaxTreeMem> 1048576</>\n" "<checksumdbMaxCacheMem> 2097152</>\n" "<checksumdbMaxPageCacheMem> 2097152</>\n" "<checksumdbMinFilesToMerge> 2</>\n" "\n" "## Tfndb holds small records for each url in Titledb.\n" "<tfndbMaxTreeMem> 5000000</>\n" "<tfndbMaxPageCacheMem> 155000000</>\n" "<tfndbMinFilesToMerge> 2</>\n" "\n" "## Spiderdb holds urls to be spidered\n" "<spiderdbMaxTreeMem> 1200000</>\n" "<spiderdbMaxCacheMem> 131072</>\n" "<spiderdbMaxPageCacheMem> 256000</>\n" "<spiderdbMinFilesToMerge> 2</>\n" "\n" "## Robotdb caches robot.txt files.\n" "<robotdbMaxCacheMem> 131072</>\n" "<robotdbSaveCache> 0</>\n" "\n" "## Indexdb holds the terms extracted from spidered documents.\n" "<indexdbMaxTreeMem> 8000000</>\n" "<indexdbMaxCacheMem> 500000</>\n" "<indexdbMinFilesToMerge> 4</>\n" "<indexdbMaxIndexListAge> 86400</>\n" "<indexdbTruncationLimit> 100000</>\n" "<indexdbSaveCache> 0</>\n" "<onlyAddUnchangedTermIds> 1</>" "\n" "## The HTTP server info\n" "## Maximum simultaneous connections. Excess will be closed.\n" "<httpMaxSockets> 500</>\n" "<httpMaxSendBufSize> 32768</>\n" "\n" "## Bytes to use for caching search result pages.\n" "<maxPageCacheMem> 1000000</>\n" "## Maximum age in seconds.\n" "<maxPageCacheAge> 14400</>\n" "<resultsSaveCache> 0</>\n" "\n" "## Max linkers to a doc we sample to determine quality.\n" "<maxIncomingLinksToSample> 100</>\n" "\n" "## Percent more to weight phrases than single words.\n" "<queryPhraseWeight> 100</>\n" "\n" "## Maximum weight one query term can have relative to another in the query.\n" "<queryMaxMultiplier> 10.0</>\n" "\n" "## Sync info\n" "<syncIndexdb> 1</>\n" "<syncTitledb> 1</>\n" "<syncSpiderdb> 1</>\n" "<syncChecksumdb> 1</>\n" "<syncSitedb> 1</>\n" "<syncDoUnion> 1</>\n" "<syncDryRun> 0</>\n" "<syncBytesPerSecond> 100000000</>\n" "\n" "## Is spidering enabled for this host? 1=YES 0=NO\n" "<spideringEnabled> 0</>\n" "\n" "## Is injection enabled for this host? 1=YES 0=NO\n" "<injectionEnabled> 1</>\n" "\n" "## Can others add urls to a collection? 1=YES 0=NO\n" "<addUrlEnabled> 0</>\n" "\n" "## Serve ads from ah-ha? 1=YES 0=NO\n" "<adFeedEnabled> 0</>\n" "\n" "## Can non-admins connect to this webserver? 1=YES 0=NO\n" "<httpServerEnabled> 1</>\n" "\n" "## Send an email when a host is detected as dead? 1=YES 0=NO\n" "<sendEmailAlerts> 0</>\n" "\n" "## Allow software interrupts? 1=YES 0=NO\n" "<allowAsyncSignals> 0</>\n" "\n" "## Read only mode does not allow spidering. 1=YES 0=NO\n" "<readOnlyMode> 0</>\n" "\n" "## Use /etc/hosts file to resolve hostnames? 1=YES 0=NO\n" "<useEtcHosts> 0</>\n" "\n" "## Restrict merging to one host per token group? Hosts that use the same\n" "## disk and mirror hosts are generally in the same token group so that only one\n" "## host in the group can be doing a merge at a time. This prevents query\n" "## response time from suffering too much. 1=YES 0=NO\n" "<useMergeToken> 0</>\n" "\n" "## If this is true we do not retrieve data from the network if we have it\n" "## local. Useful if network is slow or drives are fast. 1=YES 0=NO\n" "<preferLocalReads> 0</>\n" "\n" "## If this is true all writes are synchronous. 1=YES 0=NO\n" "<flushWrites> 1</>\n" "\n" "## Spell checking requires considerably more memory, so only a few hosts should\n" "## have this enabled if possible. 1=YES 0=NO\n" "<doSpellChecking> 1</>\n" "" "## The User-Agent field used by the Gigablast spider.\n" "<spiderUserAgent> Gigabot/1.0</>\n" "" "## Try to save unsaved in-memory data to disk every X minutes.\n" "<autoSaveFrequency> 15</>\n" "> \n" "## Log Controls\n" "<logHttpRequests> 1</>\n" "<logSpideredUrls> 1</>\n" "<logInfo> 1</>\n" "<logNetCongestion> 0</>\n" "<logLimits> 0</>\n" "<logDebugAddurl> 0</>\n" "<logDebugAdmin> 0</>\n" "<logDebugBuild> 0</>\n" "<logDebugDb> 0</>\n" "<logDebugDisk> 0</>\n" "<logDebugHttp> 0</>\n" "<logDebugLoop> 0</>\n" "<logDebugNet> 0</>\n" "<logDebugQuery> 0</>\n" "<logDebugSpeller> 0</>\n" "<logDebugTcp> 0</>\n" "<logDebugThread> 0</>\n" "<logDebugTopics> 0</>\n" "<logDebugUdp> 0</>\n" "<logTimingBuild> 0</>\n" "<logTimingDb> 0</>\n" "<logTimingNet> 0</>\n" "<logTimingQuery> 0</>\n" "<logTimingTopics> 0</>\n" "<logReminders> 0</>\n" "\n" "\n" "\n" "\n" ">\n" " \n" "Every gb process uses the same hosts.conf file. The hosts.conf file describes the hosts (gb processes) participating in the network.\n" "Each line in this file is a host entry. The number of participating hosts must be a power of 2. Each host entry uses the following fields: \n" "
\n" "IMPORTANT: The group IDS in the hosts.conf must be strictly " "increasing, at least up until it hits a host in group #0 again." " \n" "Here is a sample hosts.conf file for a network of 8 hosts running on 8 computers: \n" "\n" " \n" "#ID IP LINKIP UDP1 UDP2 DNS HTTP IDE GRP DIR\n" "\n" "0 64.62.142.231 64.62.142.231 9000 10000 6000 8000 0 0 /a\n" "1 64.62.142.233 64.62.142.233 9000 10000 6000 8000 0 1 /a\n" "2 64.62.142.235 64.62.142.235 9000 10000 6000 8000 0 2 /a\n" "3 64.62.142.237 64.62.142.237 9000 10000 6000 8000 0 3 /a\n" "4 64.62.142.239 64.62.142.239 9000 10000 6000 8000 0 0 /a\n" "5 64.62.142.241 64.62.142.241 9000 10000 6000 8000 0 1 /a\n" "6 64.62.142.244 64.62.142.244 9000 10000 6000 8000 0 2 /a\n" "7 64.62.142.246 64.62.142.246 9000 10000 6000 8000 0 3 /a\n" "\n" " \n" "\n" "\n" "" "" /* "\n" " \n" "A ruleset is a set of rules used for spidering and indexing the content of a URL. This section talks about how to assign a ruleset to a URL. Each ruleset is a file in Gigablast's working directory with a file name like tagdb*.xml, where '*' is a number.\n" " \n" "IMPORTANT: Do not change the indexing section or the <linksUnbanned>, <linksClean> or <linksDirty> tags of a ruleset file if some documents in the index were indexed with that ruleset file. To do so might create some unrepairable data corruption.\n" " \n" "The following is an example ruleset for a particular URL (\"the URL\"):\n" "\n" " \n" "\n" "# This is the unique name of the ruleset which is used for \n" "# display in drop-down menus in administrative, web-based GUIs.\n" "<name>default</>\n" "\n" "\n" "# This is the accompanying description displayed on the Sitedb tool and\n" "# URL Filters pages.\n" "<description>This is the default ruleset used for most urls.</>\n" "\n" "# If a ruleset is no longer actively used, it is not deleted, but retired.\n" "# Retired rulesets are not displayed to spam assassins on the Sitedb tool \n" "# and URL Filters pages.\n" "<retired>no</>\n" "\n" "##############################################################################\n" "# \n" "# The Quality Section. This section of the ruleset is used to determine the \n" "# QUALITY of the URL. The quality ranges from 0%% to over 100%% and is used to \n" "# influence many other things in this file. A quality of 30%% is considered to \n" "# be the quality of the average web page.\n" "#\n" "##############################################################################\n" "\n" "# The quality of the URL will not be allowed to exceed this value.\n" "<maxQuality>100</> (default 100%%)\n" "\n" "# This is the unadjusted quality of the URL. The maps below may modify it to\n" "# get the final quality of the URL.\n" "<baseQuality>30</> (default 30%%)\n" "\n" "# Now for some maps. Each map is a graph that maps one thing to another.\n" "# The first thing listed is the X component. All X components are listed first\n" "# followed by their corresponding Y components. Taken together they create a\n" "# set of points on the Cartesian graph. In this way Gigablast can map an\n" "# arbitrary value in the domain (X axis) to its corresponding value in the\n" "# image (Y axis). The X components must be in ascending order.\n" "#\n" "# The tag name of each map component, usually something like 'numLinks13',\n" "# always contains a number, in the case of this example it is 13. These numbers\n" "# are just used to ensure that the tag name is unique, nothing more.\n" "#\n" "# Gigablast linearly interpolates between the supplied points in the graph in \n" "# order to map X values that are not explicitly given in the graph. The \n" "# interpolation function extends horizontally from the first/last points with \n" "# the same image value of the first/last point.\n" "#\n" "# A map can have up to 32 defined points, but typically just 5 are used.\n" "\n" "# In this map the number of incoming links is mapped to a quality BOOST for the\n" "# URL. Only one incoming link is counted per top 2 bytes of the ip address\n" "# (most significant 2 bytes of the IP) if \"restrict link voting\" is\n" "# turned on in the Spider Controls. This helps prevent spam. This boost\n" "# is added to the baseQuality, not multiplied.\n" "<numLinks11> 0 </>\n" "<numLinks12> 5 </>\n" "<numLinks13> 10 </>\n" "<numLinks14> 20 </>\n" "<numLinks15> 50 </>\n" "<qualityBoost11> 0 </>\n" "<qualityBoost12> 5 </>\n" "<qualityBoost13> 10 </>\n" "<qualityBoost14> 15 </>\n" "<qualityBoost15> 20 </>\n" "\n" "# This map is like the above map, but the SUM of the baseQuality of all \n" "# linkers is mapped to a baseQuality boost for the URL. The boost is added to \n" "# the baseQuality, not multiplied.\n" "<linkQualitySum21> 0 </>\n" "<linkQualitySum22> 50 </>\n" "<linkQualitySum23> 100 </>\n" "<linkQualitySum24> 150 </>\n" "<linkQualitySum25> 200 </>\n" "<qualityBoost21> 0 </>\n" "<qualityBoost22> 5 </>\n" "<qualityBoost23> 10 </>\n" "<qualityBoost24> 15 </>\n" "<qualityBoost25> 20 </>\n" "\n" "# This map is like the above map, but the quality of the root page of the URL\n" "# is mapped to a baseQuality boost for the URL. The boost is added to the \n" "# baseQuality, not multiplied. If the URL is a root URL then the rootQuality\n" "# for purposes of just this map is assumed to be 30%% to prevent explosive\n" "# feedback.\n" "<rootQuality31> 0 </>\n" "<rootQuality32> 50 </>\n" "<rootQuality33> 100 </>\n" "<rootQuality34> 200 </>\n" "<rootQuality35> 500 </>\n" "<qualityBoost31> 0 </>\n" "<qualityBoost32> 5 </>\n" "<qualityBoost33> 10 </>\n" "<qualityBoost34> 15 </>\n" "<qualityBoost35> 20 </>\n" "\n" "\n" "##############################################################################\n" "#\n" "# The Quota Section. How many documents should we index from the site of the \n" "# URL? Quotas can be turned on/off for old/new URLs via the \"Spider Controls\" \n" "# page.\n" "#\n" "##############################################################################\n" "\n" "# How many docs from the site of the URL should we allow into the index?\n" "# A site is typically just the hostname of the URL, but, if a record for\n" "# the URL exists in tagdb, then the site of that record will be the site.\n" "# Use -1 for no max. \n" "<maxDocs>20000</> (default -1)\n" "\n" "# This map maps the quality of the root page of the URL to a quota boost.\n" "# The boost can be negative. A boost of -100%% makes the quota 0.\n" "# The base quota is given by the <maxDocs> field above.\n" "<rootQuality71> 0 </>\n" "<rootQuality72> 30 </>\n" "<rootQuality73> 50 </>\n" "<rootQuality74> 60 </>\n" "<rootQuality75> 70 </>\n" "<quotaBoost71> -100 </>\n" "<quotaBoost72> 0 </>\n" "<quotaBoost73> 100 </>\n" "<quotaBoost74> 200 </>\n" "<quotaBoost75> 300 </>\n" "\n" "# Like the above map, but the quality of the URL is mapped to a quota boost.\n" "# The quota boost is multiplied by the <maxDocs> number and then added\n" "# to it.\n" "<quality81> 0 </>\n" "<quality82> 30 </>\n" "<quality83> 50 </>\n" "<quality84> 60 </>\n" "<quality85> 70 </>\n" "<quotaBoost81> -100 </>\n" "<quotaBoost82> 0 </>\n" "<quotaBoost83> 100 </>\n" "<quotaBoost84> 200 </>\n" "<quotaBoost85> 300 </>\n" "\n" "##############################################################################\n" "#\n" "# The Spider Section. The following parameters control how the URL is spidered.\n" "# Spidering can be turned on/off as a whole or for various spider priority \n" "# queues via the Spider Controls page. Many other parameters exist there as \n" "# well.\n" "#\n" "##############################################################################\n" "\n" "# How int32_t to wait to respider for the first time.\n" "# This is in DAYS. This tag overrides Spider Controls if present.\n" "<firstRespiderWait>3600</> (default is to omit this tag)\n" "\n" //"# How int32_t to wait to respider if there was an error.\n" //"# This is in DAYS. This tag overrides Spider Controls if present.\n" //"<errorRespiderWait>3600</> (default is to omit this tag)\n" //"\n" "# What is the minimum amount of time we should wait before re-spidering a URL?\n" "# Re-spider frequency is usually intelligently determined using a bisection\n" "# method based on the update frequency of the URL.\n" "# This is in seconds. default = 1 day = 24*60*60 = 86400.\n" "<minRespiderWait>86400</> (default 86400)\n" "\n" "# What is the maximum amount of time we should wait before re-spidering a URL?\n" "# Re-spider frequency is usually intelligently determined using a bisection\n" "# method based on the update frequency of the URL.\n" "# This is in seconds. default = 90 days = 90*24*60*60 = 7776000.\n" "<maxRespiderWait>7776000</> (default 7776000)\n" "\n" "# What spider frequency in days should this URL be assigned?\n" "# If this is -1 then the re-spider frequency is intelligently determined using \n" "# a bisection method based on the update frequency of the URL.\n" "# This is not yet supported.\n" "# <spiderFrequency>-1</>\n" "\n" "# What spider priority should this URL be assigned?\n" "# Use -1, the default, to leave unspecified. If not assigned by a matching\n" "# regular expression, it may be determined by the spider priority of the\n" "# page from which it was harvested as a link, minus one.\n" "# This is not yet supported.\n" "# <spiderPriority>-1</> (default -1)\n" "\n" "# What is the min/max spider priority the URL should be assigned.\n" "# Priorities range from 0 up to 7. (see <spiderPriority> tag above)\n" "# This is not yet supported.\n" "#<spiderMinPriority>0</> (default 0)\n" "#<spiderMaxPriority>5</> (default 7)\n" "\n" "# What spider priority should links harvested on the URL's page be assigned?\n" "# Priorities range from 0 up to 7.\n" "# -1, the default, means to use the spider priority of the URL minus one.\n" "# This results in a breadth first spidering algorithm until the URL is\n" "# from the priority 0 spider queue, in which case, the harvested links will\n" "# also be assigned to the priority 0 queue.\n" "<spiderLinkPriority>-1</> (default -1)\n" "\n" "# Should we spider links for the URL? If \"spider links\" is toggled off on the \n" "# Spider Controls page then this will *not* override.\n" "<spiderLinks>yes</> (default yes)\n" "\n" "# Should we only harvest links from the same host of the URL? \n" "# If url is just a domain, then the www hostname is allowed as well.\n" "# This overrides the same control on Spider Controls page, so leave it \n" "# out if you do not want to override that control. This is primarily used " "# good directory sites that have the power to unban soft banned sites, and " "# such unbanned sites are then only permitted to harvest internal links.\n" "#<spiderLinksFromSameHostOnly>no</> (default is to omit this tag)\n" "\n" "##############################################################################\n" "#\n" "# The Classification Section. How is the URL classified?\n" "#\n" "##############################################################################\n" "\n" "\n" "# If the URL's quality is at or below this, then it will be checked for adult \n" "# content.\n" "<maxQualityForAdultDetect>0</> (default 0%%)\n" "\n" "# Do links from the URL point to clean pages?\n" "<linksClean>no</> (default no)\n" "\n" "# Do links from the URL point to clean pages?\n" "<linksDirty>no</> (default no)\n" "\n" "# Is the URL adult-oriented?\n" "<isAdult>no</> (default no)\n" "\n" "# Is the URL banned from the index? The default is no.\n" "# If it is banned it will not be indexed. If it is already indexed then it\n" "# will be removed from the index the next time it is respidered/reinjected.\n" "<isBanned>no</> (default no)\n" "\n" "# Can the URL be unbanned? If the URL's <isBanned> tag is set to yes,\n" "# and this tag is set to yes, then the URL is said to be \"soft banned\".\n" "# If another URL links to the soft banned URL and that\n" "# other URL is indexed with <linksUnbanned>yes< in its ruleset then\n" "# it will UNban the URL. This is useful for doing liberal banning but relying \n" "# on a directory site like dmoz.org to unban URLs that should not have been \n" "# banned.\n" "<canBeUnbanned>no</> (default yes)\n" "\n" "# See above description for <canBeUnbanned> tag for how this works.\n" "<linksUnbanned>no</> (default no)\n" "\n" "# Should we ban the DOMAINS of the the links in the URL's content. The ban \n" "# from the URL expires if the URL is removed from the index.\n" "<linksBanned>no</> (default no)\n" "\n" "# What ruleset should those URLs that the URL links to use? \n" "# Specify it by name. This is a useful way of assigning a URL to a ruleset.\n" "# This is not yet supported.\n" "# <rulesetOfLinks>special</>\n" "\n" "##############################################################################\n" "#\n" "# The Filter Section tells Gigablast what to allow into the index.\n" "#\n" "##############################################################################\n" "\n" "# If the URL's quality is LESS THAN this it will not be indexed. If the URL is\n" "# being reindexed then it will be removed from the index.\n" "<minQualityToIndex>0</> (default 0%%)\n" "\n" "# Allow URLs ending in .cgi or URLs containing ?'s into the index?\n" "<allowCgiUrls>yes</> (default yes)\n" "\n" "# Allow URLs with no canonical domain name into the index?\n" "<allowIpUrls>yes</> (default yes)\n" "\n" "# Delete 404'ed documents from the index?\n" "# If you are making a historical index, you may want to set this to no.\n" "<delete404s>yes</> (default yes)\n" "\n" "# Should the URL be indexed if it is adult-oriented? \n" "<allowAdultContent>yes</> (default yes)\n" "\n" "# Index the URL even if it is a duplicate of another page from the same site?\n" "# This overrides the \"deduping enabled\" switch in the Spider Controls,\n" "# so omit this tag to rely solely on that Spider Controls switch.\n" "<indexDupContent>no</> (default is to omit this tag)\n" "\n" "# Should the checksum hash be computed just from the indexed words? If this\n" "# is true then pages from the same site will be detected as dups more\n" "# often. Useful for newspaper articles where we only index the content of\n" "# the article. Also, it is independent of the order of the words. This\n" "# checksum is also used to see if the content of the page has changed in\n" "# order to set the next respider date for intelligent respidering.\n" "<useLooseChecksums>no</> (default is no)\n" "\n" "# Index document for sort or constrain by date. Almost doubles disk space.\n" "<indexDate>yes</> (default yes)\n" "\n" "# # If the url does not get indexed should we still keep it scheduled to be\n" "# be spidered again later in spiderdb? Handy for seed pages, like good \n" "# directory pages that link to the stuff you want to index.\n" "<keepUnindexedUrls>no</> (default no)\n" "\n" "# Index documents without dollar signs. Special case for shopping index.\n" "<needDollarSign>no</> (default no)\n" //"\n" //"# Does the url need to contain back-to-back digits in its path in order to\n" //"# be indexed?\n" //"<needNumbersInUrl>no</> (default no)\n" "\n" "# If date on page is older than this many days, do not index.\n" "# Omit this tag to default to the value in Spider Controls page.\n" "# 0.0 says to index all documents regardless of their extracted date.\n" "# Good directory sites usually have this set to 0.0 for the news collection.\n" "<daysBeforeNowToIndex>0.0</> (default is to omit this tag)\n" "\n" "##############################################################################\n" "#\n" "# The Link Text Section. When a URL is indexed, Gigablast will determine what\n" "# other URLs link to it and harvest the relevant link text from each of those \n" "# URLs. That link text is then indexed as if it occurred on the URL's page \n" "# itself, but it is not subject to spam detection. See the \n" "# section on link text for more about how link text is \n" "# indexed and what controls are available in the administrative interface.\n" "#\n" "##############################################################################\n" "\n" "# Should we index the URL's incoming link text as if it were on the page?\n" "<indexIncomingLinkText>yes</> (default yes)\n" "\n" "# This maps the URL's quality to a weight on the score of its OUTGOING link \n" "# text. The score of the terms in the link text is multiplied by this weight. \n" "# If the URL links to nothing then this is useless. Currently we limit \n" "# link text to up to 256 chars in LinkInfo.cpp.\n" "<quality41> 0 </>\n" "<quality42> 30 </>\n" "<quality43> 50 </>\n" "<quality44> 70 </>\n" "<quality45> 85 </>\n" "<linkTextScoreWeight41> 25 </>\n" "<linkTextScoreWeight42> 200 </>\n" "<linkTextScoreWeight43> 250 </>\n" "<linkTextScoreWeight44> 275 </>\n" "<linkTextScoreWeight45> 300 </>\n" "\n" "# This maps the number of words in the link text of a link to a boost on the \n" "# score weight of that link text. The score of the terms in the link text is \n" "# multiplied by this weight. Currently we limit link text to 256 chars in \n" "# LinkInfo.cpp.\n" "<linkTextNumWords61> 3 </>\n" "<linkTextNumWords62> 6 </> \n" "<linkTextNumWords63> 9 </> \n" "<linkTextNumWords64> 12 </> \n" "<linkTextScoreWeight61> 150 </>\n" "<linkTextScoreWeight62> 80 </> \n" "<linkTextScoreWeight63> 50 </> \n" "<linkTextScoreWeight64> 25 </> \n" "\n" "# This maps the URL's quality to a maximum score for the terms in the link \n" "# text. 100%% is the maximum 'maximum score'.\n" "<quality51> 0 </>\n" "<quality52> 15 </>\n" "<quality53> 25 </>\n" "<quality54> 45 </>\n" "<quality55> 75 </>\n" "<linkTextMaxScore51> 100 </>\n" "<linkTextMaxScore52> 100 </>\n" "<linkTextMaxScore53> 100 </>\n" "<linkTextMaxScore54> 100 </>\n" "<linkTextMaxScore55> 100 </>\n" "\n" "##############################################################################\n" "#\n" "# The Indexing Section. What parts of the document should be indexed and how?\n" "# IMPORTANT: Do not change this section if some documents in the index \n" "# were indexed with this ruleset file. To do so might create some unrepairable\n" "# data corruption.\n" "#\n" "##############################################################################\n" "\n" "# Should Gigablast index site:, subsite:, url:, suburl:, ip: or link: terms \n" "# of the URL respectively?\n" "<indexSite> yes</> (default yes) site: terms \n" "<indexUrl> yes</> (default yes) url: terms\n" "<indexSubUrl> yes</> (default yes) suburl: terms\n" "<indexIp> yes</> (default yes) ip: terms\n" "<indexLinks> yes</> (default yes) link:/href: terms\n" "\n" "# This is used only for news collections for doing automatic " "categorization.\n" "<indexNewsTopic> yes</> (default no) newstopic: terms\n" "\n" "# This maps the URL's quality to a spam threshold, X. If more than X%% of\n" "# the words in the document are spammed (repeated in a pattern) to some\n" "# degree then all of the words will be indexed with a minimum score.\n" "<quality61> 30 </>\n" "<quality62> 40 </>\n" "<quality63> 50 </>\n" "<quality64> 70 </>\n" "<quality65> 90 </>\n" "<maxPercentSpammed1> 6 </>\n" "<maxPercentSpammed2> 8 </>\n" "<maxPercentSpammed3> 10 </>\n" "<maxPercentSpammed4> 20 </>\n" "<maxPercentSpammed5> 30 </>\n" "\n" "># Gigablast can index the various parts of a document differently. Each\n" "# part of the document can have its own set of indexing and scoring rules.\n" "# Each such part can be represented with an <index> tag. The index tags\n" "# are processed in the order you give them in this ruleset file. Tags that\n" "# are specialized for the <index> tag which contains them are highlighted\n" "# in red.\n" "\n" "# The following <index> tag block tells Gigablast how to index the words\n" "# in the HTML <title> tag. The words in the title tag are indexed before \n" "# the words in the body because we don't want words in the body to count \n" "# towards the <maxScore> limit placed on the words in the title.\n" "<index>\n" "\n" " # The part of the document to which this <index> tag applies.\n" " # This particular one says to index the terms in the <title>\n" " # tag. This could just as easily be an <h1> tag or even a non-HTML\n" " # tag like <foobar>. Omit this tag or leave the value of the tag blank\n" " # to index the whole body of the document.>\n" " <name> title </>\n" "\n" " # Spam detection will be performed on these terms if the URL's quality is\n" " # this or lower. It is mostly disabled for these title terms because they \n" " # are restricted in score by other means below. Spam detection may lower the\n" " # scores of repeated terms.\n" " <maxQualityForSpamDetect> 0 </>\n" "\n" " # If the URL's quality is below this, then do not index the terms in the\n" " # title tag.\n" " <minQualityToIndex> 0 </>\n" "\n" " # If this is 'yes' then convert HTML entities in the title, like &gt;,\n" " # into their represented characters before indexing.\n" " <filterHtmlEntities> yes </>\n" "\n" " # Should each term in the title only be indexed if it has not already been \n" " # indexed? You can affect this by changing the order of the <index> tags.\n" " <indexIfUniqueOnly> no </>\n" "\n" " # Should single words in the title be indexed?\n" " <indexSingletons> yes </>\n" "\n" " # Should phrases in the title be indexed?\n" " <indexPhrases> yes </>\n" "\n" " # Should the whole title be indexed as one \"word\"?\n" " <indexAsWhole> no </>\n" "\n" " # Should stop words be used when indexing phrases in the title?\n" " <useStopWords> yes </>\n" "\n" " # Should we also index the stem of each word indexed? If \n" " # <indexSingletons> is false this is ignored.\n" " <useStems> no </>\n" "\n" " # Map the URL's quality to a maximum length (in characters) of the title.\n" " # Words whose characters occur passed the maximum length will not be \n" " # indexed. Read more about quality or maps.\n" " # This keeps the indexed portion of the title down to 200 characters for \n" " # all qualities.\n" " <quality11> 15 </>\n" " <maxLen11> 200 </>\n" "\n" " # Map the URL's quality to a maximum score for indexing the terms in the\n" " # title. 100%% is the maximum 'maximum score'. You cannot exceed 100%% ever.\n" " <quality21> 15 </>\n" " <quality22> 30 </>\n" " <quality23> 45 </>\n" " <quality24> 60 </>\n" " <quality25> 80 </>\n" " <maxScore21> 30 </>\n" " <maxScore22> 45 </>\n" " <maxScore23> 60 </>\n" " <maxScore24> 80 </>\n" " <maxScore25> 100 </>\n" "\n" " # Map the URL's quality to a percentage score boost for the terms in the \n" " # title. This boost is multiplied by the score of each term indexed.\n" " <quality31> 15 </>\n" " <quality32> 30 </>\n" " <quality33> 45 </>\n" " <quality34> 60 </>\n" " <quality35> 80 </>\n" " <scoreWeight31> 60 </>\n" " <scoreWeight32> 100 </>\n" " <scoreWeight33> 150 </>\n" " <scoreWeight34> 200 </>\n" " <scoreWeight35> 250 </>\n" "\n" " # Map the URL's title length (in characters) to a percentage score boost for\n" " # the terms in the title. This boost is multiplied by the score of each \n" " # term indexed.\n" " <len41> 10 </>\n" " <len42> 50 </> \n" " <len43> 100 </> \n" " <len44> 200 </> \n" " <len45> 500 </> \n" " <scoreWeight41> 200 </>\n" " <scoreWeight42> 150 </> \n" " <scoreWeight43> 100 </> \n" " <scoreWeight44> 75 </> \n" " <scoreWeight45> 50 </> \n" "\n" " # Map the URL's title length (in characters) to a maximum score for the \n" " # terms in the title. This maximum is expressed as a percentage of the\n" " # maximum score physically possible.\n" " <len51> 100 </>\n" " <maxScore51> 30 </>\n" "\n" "</index>\n" "\n" "\n" "# The following <index> block tells Gigablast how to index the body.\n" "# This will index words in the title tag, too, because that is considered \n" "# part of the body. The body is essentially everything not in a meta tag, \n" "# comment or javascript tag.\n" "<index>\n" "\n" " # Should gigablast break the document into sections and score the\n" " # words in sections with mostly link text lower than words in sections\n" " # without much link text? This helps to reduce the effects of menu spam.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <scoreBySection> no </> (default is yes)\n" "\n" " # Should gigablast attempt to isolate just the single most-relevant\n" " # content section from the document and not index anything else?\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <indexContentSectionOnly> no </> (default is no)\n" "\n" " # The minimum score an entire section of the document needs to have its\n" " # words indexed. Each word in a section counts as 128 points, but a\n" " # word in a hyperlink counts as -256 points.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <minSectionScore> -1000000000 </> (default is " " -1000000000)\n" "\n" " # Count words in links as 21 points, words not in links as 128.\n" " # the average score of each word is its score plus the scores of\n" " # its 8 left and its 7 right neighbors divided by 16. If that\n" " # average score is below this value, the word is not indexed and its\n" " # average score is set to 0. Only valid if scoreBySection is true.\n" " <minAvgWordScore> 0 </> (default is 0)\n" "\n" " # If the number of indexable words that have a positive average score\n" " # is below this value, then no words will be indexed. Used\n" " # to just index beefy news articles. -1 means to ignore this constraint.\n" " <minIndexableWords> -1 </> (default is -1)\n" "\n" " # Weight the first X words higher.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <numTopWords> 0 </> (default is 0)\n" "\n" " # Weight the first X words by this much, a rational number.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <topWordsWeight> 1.0 </> (default is 1.0)\n" "\n" " # Weight the first sentence by this much, a rational number.\n" " # Only applies to documents that support western punctuation.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <topSentenceWeight> 1.0 </> (default is 1.0)\n" "\n" " # Do not weight more than this words in the first sentence.\n" " # Used for news articles.\n" " # This only applies to the body of the document.\n" " <maxWordsInSentence> 0 </> (default is 0)\n" "\n" " # For the body, we turn spam detection on for all URLs, regardless of\n" " # their quality. This will demote the scores of terms that are repetitious.\n" " <maxQualityForSpamDetect> 100 </> \n" "\n" " # These are all the same as the <index> tag above this one.\n" " <minQualityToIndex> 0 </> \n" " <filterHtmlEntities> yes </> \n" " <indexIfUniqueOnly> no </> \n" " <indexSingletons> yes </> \n" " <indexPhrases> yes </> \n" " <indexAsWhole> no </> \n" " <useStopWords> yes </> \n" " <useStems> no </> \n" "\n" " # Map the URL's quality to a maximum length (in characters) of the body.\n" " # This length does not include tags. Some tags, like <br> are \n" " # converted into \\n\\n, but most are not. Words whose characters occur \n" " # passed the maximum length will not be indexed. Read more about quality or\n" " # maps.\n" //" # You will still be limited by the \"#define MAX_WORDS 10000\"\n" //" # statement, but this is slated to disappear soon.\n" " <quality11> 15 </>\n" " <quality12> 30 </>\n" " <quality13> 45 </>\n" " <quality14> 60 </>\n" " <quality15> 80 </>\n" " <maxLen11> 80000 </>\n" " <maxLen12> 100000 </>\n" " <maxLen13> 100000 </>\n" " <maxLen14> 100000 </>\n" " <maxLen15> 100000 </>\n" "\n" " # Map the URL's quality to a maximum score for indexing the terms in the\n" " # body. 100%% is the maximum 'maximum score'. You cannot exceed 100%% ever.\n" " <quality21> 15 </>\n" " <quality22> 30 </>\n" " <quality23> 45 </>\n" " <quality24> 60 </>\n" " <quality25> 80 </>\n" " <maxScore21> 30 </>\n" " <maxScore22> 45 </>\n" " <maxScore23> 60 </>\n" " <maxScore24> 80 </>\n" " <maxScore25> 100 </>\n" "\n" " # Map the URL's quality to a percentage score boost for the terms in the\n" " # body. This boost is multiplied by the score of each term indexed.\n" " <quality31> 15 </>\n" " <quality32> 30 </>\n" " <quality33> 45 </>\n" " <quality34> 60 </>\n" " <quality35> 80 </>\n" " <scoreWeight31> 60 </>\n" " <scoreWeight32> 100 </>\n" " <scoreWeight33> 150 </>\n" " <scoreWeight34> 200 </>\n" " <scoreWeight35> 250 </>\n" "\n" " # Map the length of the body (in characters) to a percentage score boost for\n" " # the terms in the body. This boost is multiplied by the score of each term \n" " # indexed. This length does not include tags. Some tags, like <br> are\n" " # converted into \n\n, but most are not.\n" " # This is now obsolete for newer documents. Please use the numWords map \n" " # immediately following. It supports unicode better, too.\n" " #<len41> 100 </>\n" " #<len42> 500 </>\n" " #<len43> 1000 </>\n" " #<len44> 2000 </>\n" " #<len45> 5000 </>\n" " #<len46> 10000 </>\n" " #<len47> 20000 </>\n" " #<len48> 50000 </>\n" " #<scoreWeight41> 300 </>\n" " #<scoreWeight42> 250 </> \n" " #<scoreWeight43> 200 </> \n" " #<scoreWeight44> 150 </> \n" " #<scoreWeight45> 100 </> \n" " #<scoreWeight46> 80 </> \n" " #<scoreWeight47> 60 </> \n" " #<scoreWeight48> 40 </> \n" "\n" " # Map the number of words to a percentage score boost for the terms in \n" " # the body. This boost is multiplied by the score of each term \n" " # indexed.\n" " <numWords41> 20 </>\n" " <numWords42> 100 </>\n" " <numWords43> 200 </>\n" " <numWords44> 400 </>\n" " <numWords45> 1000 </>\n" " <numWords46> 2000 </>\n" " <numWords47> 4000 </>\n" " <numWords48> 10000 </>\n" " <scoreWeight41> 300 </>\n" " <scoreWeight42> 250 </> \n" " <scoreWeight43> 200 </> \n" " <scoreWeight44> 150 </> \n" " <scoreWeight45> 100 </> \n" " <scoreWeight46> 80 </> \n" " <scoreWeight47> 60 </> \n" " <scoreWeight48> 40 </> \n" "\n" " # Map the URL's quality to a maximum score for indexing the terms in the\n" " # body. 100%% is the maximum 'maximum score'. You cannot exceed 100%% ever.\n" " <len51> 100 </>\n" " <len52> 500 </>\n" " <len53> 1000 </>\n" " <len54> 2000 </>\n" " <len55> 5000 </>\n" " <maxScore51> 30 </>\n" " <maxScore52> 45 </>\n" " <maxScore53> 60 </>\n" " <maxScore54> 80 </>\n" " <maxScore55> 100 </>\n" "\n" "</index>\n" "\n" "# This one is similar to above, but we're indexing \"title:\" terms.\n" "# The major difference are in red.\n" "<index>\n" " <name> title </>\n" " <prefix> title </> # prepend a \"title:\" to the term before indexing\n" " <maxQualityForSpamDetect> 0 </>\n" " <minQualityToIndex> 0 </>\n" " <filterHtmlEntities> yes </>\n" "\n" " # This tells Gigablast not to index a word or phrase if it has already been\n" " # indexed. This means that repeating terms in the title will have no affect.\n" " <indexIfUniqueOnly> yes </>\n" "\n" " <indexSingletons> yes </>\n" " <indexPhrases> yes </>\n" " <indexAsWhole> no </>\n" " <useStopWords> yes </>\n" " <useStems> no </>\n" "\n" " # Map URL's quality to a maximum length for this field.\n" " <quality11> 15 </>\n" " <quality12> 30 </>\n" " <quality13> 45 </>\n" " <quality14> 60 </>\n" " <quality15> 80 </>\n" " <maxLen11> 80000 </>\n" " <maxLen12> 100000 </>\n" " <maxLen13> 150000 </>\n" " <maxLen14> 200000 </>\n" " <maxLen15> 250000 </>\n" "\n" " # Map URL's quality to a maximum score for terms in this field.\n" " <quality21> 15 </>\n" " <quality22> 30 </>\n" " <quality23> 45 </>\n" " <quality24> 60 </>\n" " <quality25> 80 </>\n" " <maxScore21> 30 </>\n" " <maxScore22> 45 </>\n" " <maxScore23> 60 </>\n" " <maxScore24> 80 </>\n" " <maxScore25> 100 </>\n" "\n" " # Map URL's quality to a percentage score boost for terms in this field.\n" " <quality31> 15 </>\n" " <quality32> 30 </>\n" " <quality33> 45 </>\n" " <quality34> 60 </>\n" " <quality35> 80 </>\n" " <scoreWeight31> 60 </>\n" " <scoreWeight32> 100 </>\n" " <scoreWeight33> 150 </>\n" " <scoreWeight34> 200 </>\n" " <scoreWeight35> 250 </>\n" "\n" " # Map the field's length to a percentage score boost for terms in this \n" " # field.\n" " <len41> 100 </>\n" " <len42> 500 </>\n" " <len43> 1000 </>\n" " <len44> 2000 </>\n" " <len45> 5000 </>\n" " <scoreWeight41> 300 </>\n" " <scoreWeight42> 200 </>\n" " <scoreWeight43> 150 </>\n" " <scoreWeight44> 100 </>\n" " <scoreWeight45> 50 </>\n" "\n" " # Map the field's length to a maximum score for terms in this field.\n" " <len51> 100 </>\n" " <len52> 500 </>\n" " <len53> 1000 </>\n" " <len54> 2000 </>\n" " <len55> 5000 </>\n" " <maxScore51> 30 </>\n" " <maxScore52> 45 </>\n" " <maxScore53> 60 </>\n" " <maxScore54> 80 </>\n" " <maxScore55> 100 </>\n" "\n" "</index>\n" "\n" "# Now this one is for all the standard, supported meta tags.\n" "# Terms in these tags have not been indexed yet, but we do that here.\n" "<index>\n" "\n" " # Gigablast allows multiple fields/parts to be specified for indexing\n" " # under the same parameters. In this case, we treat the meta summary,\n" " # meta description and meta keywords tags all equally.\n" " <name> meta.summary </>\n" " <name> meta.description </> \n" " <name> meta.keywords </>\n" "\n" "\n" " <maxQualityForSpamDetect> 0 </>\n" " <minQualityToIndex> 0 </>\n" " <filterHtmlEntities> yes </>\n" "\n" " # This tells Gigablast not to index a word or phrase if it has already been\n" " # indexed. This means that repeating terms in these meta tags will have no\n" " # affect.\n" " <indexIfUniqueOnly> yes </>\n" "\n" " <indexSingletons> yes </>\n" " <indexPhrases> yes </>\n" " <indexAsWhole> no </>\n" " <useStopWords> yes </>\n" " <useStems> no </>\n" "\n" " # Map URL's quality to a maximum length for this field.\n" " <quality11> 15 </>\n" " <maxLen11> 200 </>\n" "\n" " # Map URL's quality to a maximum score for terms in this field.\n" " <quality21> 15 </>\n" " <maxScore21> 100 </>\n" "\n" " # Map URL's quality to a percentage score boost for terms in this field.\n" " <quality31> 15 </>\n" " <scoreWeight31> 100 </>\n" "\n" " # Map the field's length to a percentage score boost for terms in this \n" " # field.\n" " <len41> 100 </>\n" " <scoreWeight41> 100 </>\n" "\n" " # Map the field's length to a maximum score for terms in this field.\n" " <len51> 100 </>\n" " <maxScore51> 100 </>\n" "\n" "</index>\n" "\n" "\n" "\n" */ /* " \n" "\n" ">\n" " \n" "This simple script is used to start up all the gb hosts (processes) native to a particular computer. It also redirects the gb programs standard error to a log file. Notice that the gb executable takes the gb.conf filename as its first argument." " \n" " \n" "#!/bin/bash\n" "# move the old log file\n" "mv /workdir/loga /workdir/loga-`date '+%%Y_%%m_%%d-%%H:%%M:%%S'`.log\n" "# start up gb\n" "/workdir/gb -c /workdir/hosts.conf >& /workdir/loga &\n" "\n" "\n" " \n" */ " \n" "\n" "" "" "" "" "" "" "" "" "" "\n" " \n" " \n" "at be by of on\n" "or do he if is\n" "it in me my re\n" "so to us vs we\n" "the and are can did\n" "per for had has her\n" "him its not our she\n" "you also been from have\n" "here hers ours that them\n" "then they this were will\n" "with your about above ain\n" "could isn their there these\n" "those would yours theirs aren\n" "hadn didn hasn ll ve \n" "should shouldn\n" "\n" " \n" " \n" "\n" "\n" " \n" "Certain punctuation breaks up a phrase. All single character punctuation marks can be phrased across, with the exception of the following:\n" "
\n" "The following 2 character punctuation sequences break phrases:\n" "
\n" "\n" "All 3 character sequences of punctuation break phrases with the following exceptions:\n" "
\n" "\n" "All sequences of punctuation greater than 3 characters break phrases with the sole exception being a sequence of strictly whitespaces.\n" "\n" " \n" " \n" " |