# All <, >, " and # characters that are values for a field contained herein # must be represented as <, >, " and # respectively. # Mem available to this process. May be exceeded due to fragmentation. 4000000000 # Below the various Gigablast databases are configured. # <*dbMaxTreeMem> - mem used for holding new recs # <*dbMaxDiskPageCacheMem> - disk page cache mem for this db # <*dbMaxCacheMem> - cache mem for holding single recs # <*dbSaveCache> - save the rec cache on exit? # <*dbMaxCacheAge> - max age (seconds) for recs in rec cache # See that Stats page for record counts and stats. # How many bytes should be used for caching DNS replies? 128000 # A tagdb record assigns a url or site to a ruleset. Each tagdb record is # about 100 bytes or so. 1028000 200000 # A catdb record assigns a url or site to DMOZ categories. Each catdb record # is about 100 bytes. 1000000 25000000 0 # Clusterdb caches small records for site clustering and deduping. 1000000 0 # Max memory for dup vector cache. 10000000 # Robotdb caches robot.txt files. 128000 0 0 5000000 0 1000000 # Maximum bytes of a doc that can be sent before having to read more from disk 128000 # Bytes to use for caching search result pages. 100000 # Read only mode does not allow spidering. 0 # Controls all spidering for all collections 1 # What is the maximum number of web pages the spider is allowed to download # simultaneously for ALL collections PER HOST? 100 # Can people use the add url interface to add urls to the index? 1 # Save data in memory to disk after this many minutes have passed without the # data having been dumped or saved to disk. Use 0 to disable. 5 # Maximum sockets available to serve incoming HTTP requests. Too many # outstanding requests will increase query latency. Excess requests will # simply have their sockets closed. 100 # Maximum sockets available to serve incoming HTTPS requests. Like max http # sockets, but for secure sockets. 100 # Identification seen by web servers when the Gigablast spider downloads their # web pages. It is polite to insert a contact email address here so webmasters # that experience problems from the Gigablast spider have somewhere to vent. # If this is true, gb will send Accept-Encoding: gzip to web servers when # doing http downloads. 0 # How many seconds should we cache a search results page for? 10800 # Keep track of ips which do queries, disallow non-customers from hitting us # too hard. 0 # If a call to a message callback or message handler in the udp server takes # more than this many milliseconds, then log it. Logs 'udp: Took %lli ms to # call callback for msgType=0x%hhx niceness=%li'. Use -1 or less to disable # the logging. -1 # Sends emails to admin if a host goes down. 0 # Do not send email alerts about dead hosts to anyone except # sysadmin@gigablast.com between the times given below unless all the twins of # the dead host are also dead. Instead, wait till after if the host is still # dead. 0 # Email alerts will include the cluster name # Send an email after a host has not responded to successive pings for this # many milliseconds. 62000 # Send email alerts when query success rate goes below this threshold. # (percent rate between 0.0 and 1.0) 0.850000 # Send email alerts when average query latency goes above this threshold. (in # seconds) 2.000000 # Record this number of query times before calculating average query latency. 300 # At what temperature in Celsius should we send an email alert if a hard drive # reaches it? 45 # Look for this string in the kernel buffer for sending email alert. Useful # for detecting some strange hard drive failures that really slow performance. # Look for this string in the kernel buffer for sending email alert. Useful # for detecting some strange hard drive failures that really slow performance. # Look for this string in the kernel buffer for sending email alert. Useful # for detecting some strange hard drive failures that really slow performance. # Sends to email address 1 through email server 1. 0 # Sends to email address 1 through email server 1 if any parm is changed. 0 # Connects to this IP or hostname directly when sending email 1. Use # apt-get install sendmail to install sendmail on that IP or hostname. # Add From:10.5 RELAY to /etc/mail/access to allow sendmail to forward # email it receives from gigablast if gigablast hosts are on the 10.5.*.* IPs. # Then run /etc/init.d/sendmail restart as root to pick up those # changes so sendmail will forward Gigablast's mail to the address you give # below. # Sends to this address when sending email 1 # The from field when sending email 1 # Sends to email address 2 through email server 2. 0 # Sends to email address 2 through email server 2 if any parm is changed. 0 # Connects to this server directly when sending email 2 # Sends to this address when sending email 2 # The from field when sending email 2 # Sends to email address 3 through email server 3. 0 # Sends to email address 3 through email server 3 if any parm is changed. 0 # Connects to this server directly when sending email 3 # Sends to this address when sending email 3 # The from field when sending email 3 # IP address of the primary DNS server. Assumes UDP port 53. REQUIRED FOR # SPIDERING! Use Google's public DNS 8.8.8.8 as default. 8.8.8.8 # IP address of the secondary DNS server. Assumes UDP port 53. Will be # accessed in conjunction with the primary dns, so make sure this is always # up. An ip of 0 means disabled. Google's secondary public DNS is 8.8.4.4. 8.8.4.4 # All hosts send to these DNSes based on hash of the subdomain to try to split # DNS load evenly. 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 # add Ips here to bar them from accessing this gigablast server. # add Ips here to give them an infinite query quota. # Don't try to autoban queries that have one of these codes. Also, the code # must be valid for us to use &uip=IPADDRESS as the IP address of the # submitter for purposes of autoban AND purposes of addurl daily quotas. # Append extra default parms to queries that match certain substrings. # Format: text to match in url, followed by a space, then the list of extra # parms as they would appear appended to the url. One match per line. # ban any query that matches this list of substrings. Must match all # comma-separated strings on the same line. ('\n' = OR, ',' = AND) # Any matching password will have administrative access to Gigablast and all # collections. # Use tag. # Any IPs in this list will have administrative access to Gigablast and all # collections. # Use tag. # Log GET and POST requests received from the http server? 1 # Should we log queries that are autobanned? They can really fill up the log. 1 # If query took this many millliseconds or longer, then log the query and the # time it took to process. 5000 # Log query reply in proxy, but only for those queries above the time # threshold above. 0 # Log status of spidered or injected urls? 1 # Log messages if Gigablast runs out of udp sockets? 0 # Log messages not related to an error condition, but meant more to give an # idea of the state of the gigablast process. These can be useful when # diagnosing problems. 1 # Log it when document not added due to quota breech. Log it when url is too # long and it gets truncated. 0 # Log various debug messages. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # Log various timing related messages. 0 # Log various timing related messages. 0 0 0 0 # Log various timing related messages. 0 0 # Log reminders to the programmer. You do not need this. 0 # If enabled, gigablast will repair the rdbs as specified by the parameters # below. When a particular collection is in repair mode, it can not spider or # merge titledb files. 0 # Comma or space separated list of the collections to repair or rebuild. # In bytes. 300000000 # Maximum number of outstanding inject spiders for repair. 32 # If enabled, gigablast will reinject the content of all title recs into a # secondary rdb system. That will the primary rdb system when complete. 0 # If enabled, gigablast will keep the new spiderdb records when doing the full # rebuild or the spiderdb rebuild. 1 # If enabled, gigablast will recycle the link info when rebuilding titledb. 0 # If enabled, gigablast will rebuild this rdb 1 # If enabled, gigablast will rebuild this rdb 0 # If enabled, gigablast will rebuild this rdb 0 # If enabled, gigablast will rebuild this rdb 0 # If enabled, gigablast will rebuild this rdb 0 # If disabled, gigablast will skip root urls. 1 # If disabled, gigablast will skip non-root urls. 1 # When rebuilding spiderdb and scanning it for new spiderdb records, should a # tagdb lookup be performed? Runs much much faster without it. Will also keep # the original doc quality and spider priority in tact. 0