open-source-search-engine/html/compare.html
2021-05-06 01:52:55 +10:00

886 lines
13 KiB
HTML

<html>
<!--TODO: make interactive comparison-->
<title>Comparison of Gigablast vs SOLR Open Source Search Engine</title>
<h2>Comparing Gigablast to SOLR</h2>
<table cellspacing=10 border=1>
<tr>
<td style=max-width:100px;min-width:10%;></td>
<td style=min-width:30%><b><a href=http://www.gigablast.com/>Gigablast</a></b></td>
<td style=min-width:30%><b><a href=http://lucene.apache.org/solr/>Solr</a></b></td>
<!--
<td><b><a href=http://www.elasticsearch.org/>ElasticSearch</a></b></td>-->
</tr>
<tr valign=top>
<td><b>Package Installation</b></td>
<!-- gb install -->
<td>
<a href=/admin.html#quickstart>Download packages for Ubuntu or RedHat</a>
</td>
<!-- solr install-->
<td>
<a href=http://wiki.apache.org/solr/SolrInstall>Instructions</a>
</td>
<!-- elastic search install-->
<!--
<td>
<ul>
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
<li>unzip elasticsearch-0.90.3.zip
<li>cd elasticsearch-0.90.3
<li>cd bin
<li>./elasticsearch -f
<li>curl -X GET http://localhost:9200/
</ul>
</td>
-->
</tr>
<tr valign=top>
<td><b>Source Language</b></td>
<!-- gb install -->
<td>
<font color=green><b>
C/C++
</b></font>
</td>
<!-- solr install-->
<td>
Java
</td>
</tr>
<tr valign=top>
<td><b>Runs on Linux</b></td>
<!-- gb install -->
<td>
Yes.
</td>
<!-- solr install-->
<td>
Yes.
</td>
</tr>
<tr valign=top>
<td><b>Runs on Windows</b></td>
<!-- gb install -->
<td>
Yes with Virtual Box. Soon natively.
</td>
<!-- solr install-->
<td>
Yes.
</td>
</tr>
<tr valign=top>
<td><b>License</b></td>
<!-- gb install -->
<td>
Apache Open Source License 2
</td>
<!-- solr install-->
<td>
Apache Open Source License 2
</td>
</tr>
<tr valign=top>
<td><b>Release Date</b></td>
<!-- gb install -->
<td>
2000
</td>
<!-- solr install-->
<td>
2007
</td>
</tr>
<tr valign=top>
<td><b>Scalability</b></td>
<!-- gb install -->
<td>
<font color=green><b>
Has scaled to over 12 billion unique web pages.
Can scale to over 100 billion pages in a single collection.
</b></font>
</td>
<!-- solr install-->
<td>
Good luck!
</td>
</tr>
<tr valign=top>
<td><b>HTTP API</b></td>
<!-- gb install -->
<td>
<a href=/admin/api>here</a>
</td>
<!-- solr install-->
<td>
<a href=http://wiki.apache.org/solr/SchemaRESTAPI>here</a>
</td>
</tr>
<tr valign=top>
<td><b>Search Results</b></td>
<!-- gb install -->
<td>
<a href=http://www.google.com/search?q=gigablast>here</a>
</td>
<!-- solr install-->
<td>
<a href=http://www.google.com/search?q=solr>here</a>
</td>
</tr>
<tr valign=top>
<td><b>Source Repository</b></td>
<!-- gb install -->
<td>
<a href=https://github.com/gigablast/open-source-search-engine>github</a>
</td>
<!-- solr install-->
<td>
<a href=https://github.com/apache/lucene-solr>github</a>
</td>
</tr>
<tr valign=top>
<td><b>Github Star Ratings</b></td>
<!-- gb install -->
<td>
<a href=https://github.com/gigablast/open-source-search-engine>326</a> (8/2/2014)
</td>
<!-- solr install-->
<td>
<a href=https://github.com/apache/lucene-solr>767</a> (8/2/2014)
</td>
</tr>
<tr valign=top>
<td><b>Source Installation</b></td>
<!-- gb install -->
<td>
<font color=green><b>
Just a <a href=/admin.html#src>few simple steps</a>
</b></font>
</td>
<!-- solr install-->
<td>
<a href=http://lucene.apache.org/solr/downloads.html>Source download instructions</a>
</td>
<!-- elastic search install-->
<!--
<td>
<ul>
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
<li>unzip elasticsearch-0.90.3.zip
<li>cd elasticsearch-0.90.3
<li>cd bin
<li>./elasticsearch -f
<li>curl -X GET http://localhost:9200/
</ul>
</td>
-->
</tr>
<tr>
<td>
<b>Complete Web GUI</b>
</td>
<!--gigablast-->
<td>
<font color=green><b>
Yes.
</b></font>
</td>
<!--solr-->
<td>
???
</td>
</tr>
<tr>
<td>
<b>Operating Layout</b>
</td>
<!--gigablast-->
<td>
<font color=green><b>
A single binary containing web server, database, admin tools, spider logic, etc.
</b></font>
</td>
<!--solr-->
<td>
Many different packages quilted together. Apache, MySQL, Lucene, Tika, Zookeeper, Solr, Nutch, ...
</td>
</tr>
<tr>
<td>
<b>Indexing a Single File Containing Multiple Documents via cmdline</b>
</td>
<!--gigablast-->
<td>
<font color=green><b>
Use curl using args (including <i>delim</i>) listed <a href=/admin/api#/admin/inject>here</a>
</b></font>
<br>
</td>
<!--solr-->
<td>
unsupported
</td>
</tr>
<tr>
<td>
<b>Indexing an Individual File via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl to post the content of the file with args listed
<a href=/admin/api#/admin/inject>here</a>
</td>
<!--solr-->
<td>
You can index individual local files as such:
<b>curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html'</b>
but it does not seem to work unless your HTML meets stringent requirements for some reason.
</td>
</tr>
<tr>
<td>
<b>Indexing an Individual URL via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl to inject the url with args listed
<a href=/admin/api#/admin/inject>here</a>
</td>
<!--solr-->
<td>
???
</td>
</tr>
<tr>
<td>
<b>Indexing a File of URLs via cmdline</b>
</td>
<!--gigablast-->
<td>
Use one curl command for each url, using the interface described
<a href=/admin/api#/admin/inject>here</a></b>
</td>
<!--solr-->
<td>
???
</td>
</tr>
<tr>
<td>
<b>Deleting Documents via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl command to delete a url, using the interface described
<a href=/admin/api#/admin/inject>here</a></b>
</td>
<!--solr-->
<td>
You can delete individual documents by specifying queries that match just those documents:
<b>java -Dcommit....</b>
</td>
</tr>
<tr>
<td><b>Getting Results via cmdline</b></td>
<td>
Use curl command to do a search, using the interface described
<a href=/admin/api#/search>here</a></b>
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>Facets</b></td>
<td>
Yes. Basic support. See gbfacet operators in the <a href=/help.html>help file</a>.
</td>
<td>
Yes.
</font>
</td>
</tr>
<tr>
<td><b>Search Result Limitations Based on Facet Value Counts</b></td>
<td>
Coming soon.
</td>
<td>
<font color=green><b>
Yes.
</b></font>
</td>
</tr>
<tr>
<td><b>Numeric Fields</b></td>
<td>
You can forward/reverse sort by and constrain by numeric fields.
</td>
<td>
You can forward/reverse sort by and constrain by numeric fields.
</td>
</tr>
<tr>
<td><b>Boolean Search</b></td>
<td>
Fully nested boolean search with AND OR NOT.
</td>
<td>
Fully nested boolean search with AND OR NOT.
</td>
</tr>
<!-- title: inurl: -->
<tr>
<td><b>Searchable Fields</b></td>
<td>
Yes. Any meta tag, or if indexing JSON or XML.
</td>
<td>
???
</td>
</tr>
<!-- CTS -->
<tr>
<td><b>Site Restricted Searches</b></td>
<td>
Yes. Using the site: query operator. Or use &sites=... to constrain your search up to 500 sites.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>Spell Checker</b></td>
<td>
Yes. But currently disabled until improved.
</td>
<td>
<font color=green><b>
Yes.
</b></font>
</td>
</tr>
<tr>
<td><b>Language Identification</b></td>
<td>
<font color=green><b>
Yes. On a per word level for searching purposes.
</b></font>
</td>
<td>
Yes. Not on a per word level for searching purposes.
</td>
</tr>
<tr>
<td><b>Index Multiple Languages</b></td>
<td>
Yes. Can expand words in many languages to all their different forms. More forms coming soon, too.
</td>
<td>
Yes, but stemming/expansion may be limited.
</td>
</tr>
<tr>
<td><b>Show Images in Search Results</b></td>
<td>
<font color=green><b>
Yes.
</b></font>
</td>
<td>
No.
</td>
</tr>
<!-- gigabits -->
<tr>
<td><b>Related Concepts</b></td>
<td>
<font color=green><b>
Yes. Called <i>Gigabits</i>.
</b></font>
</td>
<td>
No.
</td>
</tr>
<tr>
<td><b>Query Expansion (Synonyms)</b></td>
<td>
Yes. And also uses mysynonyms.txt file to add your own expansion terms.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>Cached Pages</b></td>
<td>
Yes.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>RESTful/XML/JSON APIs</b></td>
<td>
Yes, XML. JSON coming soon.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>Schemas</b></td>
<td>
<font color=green><b>
You do not need to define schemas to begin indexing files and urls.
</b></font>
</td>
<td>
You have to define annoying schemas.
</td>
</tr>
<tr>
<td><b>Spidering</b></td>
<td>
<font color=green><b>
Gigablast has a complete distributed web spider with powerful controls.
</b></font>
</td>
<td>
SOLR has no spider. You can try to integrate Nutch.
</td>
</tr>
<tr>
<td><b>Document Filters</b></td>
<td>
antiword (for Microsoft Word)<br>
pdftohtml (for PDF)
xlstohtml (for Excel)
ppthtml (for power point)
pstotext (for PostScript)
</td>
<td>
uses Apache Tika for several formats.
</td>
</tr>
<tr>
<td><b>Scalability</b></td>
<td>
<font color=green><b>
Highly scalable. Has scaled to over
12 billion pages while serving millions
of queries per day. Can easily add new servers to the
hosts.conf file and click <i>rebalance shards</i> to
rebalance the data.
</b></font>
</td>
<td>
Has not scaled nearly as high to our knowledge. Not originally built for more than one server.
</td>
</tr>
<tr>
<td><b>Cluster Administration</b></td>
<td>
<font color=green><b>
Built into the web GUI.
</b></font>
</td>
<td>
Requires separate Zookeeper package installation.
</td>
</tr>
<tr>
<td><b>Performance</b></td>
<td>
<font color=green><b>
High performance. Written in C/C++.
</b></font>
</td>
<td>
Slower. Written in Java. Has garbage collection, etc.
</td>
</tr>
<!--
<tr>
<td><b>Configuration Files and Descriptions</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Duplicate Content</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Duplicate Sections</b></td>
<td>
Can remove duplicate content at spider time
or query time.
</td>
<td>
</td>
</tr>
<tr>
<td><b>Section Classification</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>Phrases</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Query Weighting</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>Index Layout</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<tr>
<td><b>Ranking Algorithm</b></td>
<td>
<font color=green><b>
Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods.
</b></font>
</td>
<td>
Old school TF/IDF based on simple statistics.
</td>
</tr>
<tr>
<td><b>Scoring Explanations</b></td>
<td>
Complete scoring information provided.
</td>
<td>
Complete scoring information provided.
</td>
</tr>
<tr>
<td><b>Inlink Text</b></td>
<td>
<font color=green><b>
Indexed incoming link text, compensates for link spam.
</b></font>
</td>
<td>
None. Not geared for web search.
</td>
</tr>
<tr>
<td><b>Page Rank</b></td>
<td>
<font color=green><b>
Uses <i>Site Rank</i> based on number of incoming links to a site
from other sites. Detects link spam and compensates accordingly.
</b></font>
</td>
<td>
None. Not geared for web search.
</td>
</tr>
<tr>
<td><b>On-Page Spam</b></td>
<td>
<font color=green><b>
Demotes terms deemed spammy on a page.
</b></font>
</td>
<td>
None.
</td>
</tr>
<tr>
<td><b>Reliability</b></td>
<td>
Pretty good.
</td>
<td>
Pretty good.
</td>
</tr>
<!--
<tr>
<td><b>Administration</b></td>
<td>
Simple web-based GUI and API.
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>File Descriptions</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<tr>
<td><b>Developer Documentation</b></td>
<td>
Yes. <a href=/developer.html>Here</a>.
</td>
<td>
Yes. Lots of documentation.
</td>
</tr>
<tr>
<td><b>Graphing</b></td>
<td>
Graphs performance of various subroutines and query times.
</td>
<td>
Unknown.
</td>
</tr>
<tr>
<td><b>Monitoring</b></td>
<td>
<font color=green><b>
Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts.
</b></font>
</td>
<td>
None known.
</td>
</tr>
<tr>
<td><b>Geospatial</b></td>
<td>
Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields.
See <a href=/help.html>help file</a> for examples using these operators.
</td>
<td>
Yes.
</td>
</tr>
<tr>
<td><b>Dynamic Summaries</b></td>
<td>
Yes. Contain query terms.
</td>
<td>
Yes. Contain query terms.
</td>
</tr>
<tr>
<td><b>Site Clustering</b></td>
<td>
Yes.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>More Like This</b></td>
<td>
Coming soon.
</td>
<td>
<font color=green><b>
Yes.
</b></font>
</td>
</tr>
<tr>
<td><b>Sort by Date</b></td>
<td>
<i>gbsortbyint:gbspiderdate</i><br>
<i>gbsortbyint:gbindexdate</i><br>
<i>gbrevsortbyint:gbspiderdate</i><br>
<i>gbrevsortbyint:gbindexdate</i><br>
See <a href=/help.html>help file</a> for examples using these operators.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>Query Completion</b></td>
<td>
Coming soon.
</td>
<td>
<font color=green><b>
Available with additional module.
</b></font>
</td>
</tr>
<tr>
<td><b>Document Collections</b></td>
<td>
<font color=green><b>
Supports tens of thousands of separate collections,
and federated search across them.
</b></font>
</td>
<td>
???
</td>
</tr>
</table>
<br><br><br>