Open source search engine in information retrieval
Subscribe to my newsletter and never miss my upcoming articles
Your browser does not support the audio element. SPEED 1 X
A search engine is a software program that helps people find the information they are looking for online using search queries containing keywords or phrases.
Search engines are able to return results quickly even with millions of records by indexing every data record they find. In this blog post, I will list 5 popular open-source search engines which can be used to build search functionality in your website.
Apache Lucene is a free and open-source search engine software library, originally written completely in Java. It is supported by the Apache Software Foundation and is released under the Apache Software License.It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Below are some of the key features of Apache Lucene.
- Data Indexing
- Keyword Highlighting
- Advanced analysis/tokenization capabilities.
Apache Solr is the popular, blazing-fast, open-source enterprise search platform built on Apache Lucene. Solr is a standalone search server with a REST-like API. You can put documents in it (called “indexing”) via JSON, XML, CSV, or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV, or binary results.
Below are some of the key features of Solr.
Advanced Full-Text Search Capabilities : Solr enables powerful matching capabilities including phrases, wildcards, joins, grouping, and much more across any data type.
Optimized for High Volume Traffic: Solr is proven at extremely large scales the world over
Comprehensive Administration Interfaces: Solr provides a responsive administrative user interface to make it easy to control your Solr instances.
Easy Monitoring: Solr publishes loads of metric data via JMX which helps to get more insights about your Solr instances.
Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. It is built based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Elasticsearch provides key features like Advanced Full-Text Search Capabilities like Data indexing, Search capabilities including phrases, wildcards, auto suggestions, filters & facets, etc… Elasticsearch can also be used for other use-cases like
- Logs – Storing logs via ELK (Elasticsearch, Logstash, Kibana)
- Metrics – Monitor and Visualise your system metrics
- APM – Get insights into your application performance
- App Search – Search across your documents, geodata, and more.
MeiliSearch is an open-source, blazingly fast and hyper-relevant search engine that will improve your search experience. It provides an extensive toolset for customization. It works out-of-the-box with a preset that easily answers the needs of most applications. Communication is done with a RESTful API because most developers are already familiar with its norms.
Below are the few key features of MeiliSearch.
- Synonyms : Ability to create synonyms for a better search experience.
- Highlight: With highlight, users understand their search results and act upon them.
- Custom Relevancy: It gives you the possibility to add new sorting rules. You can order results by date, likes, whatever suits your dataset.
- Filters: Improve your search query by adding custom filters.
- Faceting: Empower users to drill down on search results and find what they need faster.
Typesense is a fast, typo-tolerant search engine for building delightful search experiences. It claims that it is an Easier-to-Use ElasticSearch Alternative & an Open Source Algolia Alternative.
Below are few key features of Typesense
- Typo Tolerance: Handles typographical errors elegantly, out-of-the-box.
- Simple and Delightful: Simple to set up, integrate with, operate, and scale.-⚡ Blazing Fast: Built-in C++. Meticulously architected from the ground-up for low-latency (<50ms) instant searches.
- Tunable Ranking: Easy to tailor your search results to perfection.Sorting: Sort results based on a particular field at query time (helpful for features like “Sort by Price”).
- Faceting & Filtering: Drill down and refine results.
- Grouping & Distinct: Group similar results together to show more variety.
- Federated Search: Search across multiple collections (indices) in a single HTTP request.
- Synonyms: Define words as equivalents of each other, so searching for a word will also return results for the synonyms defined.
- Geo search – Search and sort by results around a geographic location
Other Popular Enterprise Search Engines
Below are a few other popular Enterprise Search Engines that are not free
Thank you for reading
Hope you find these resources useful. If you like what you read and want to see more about system design, microservices, and other technology-related stuff… You can follow me on
Searches are integral parts of any application. Performing searches on terabytes and petabytes of data can be challenging when speed, performance, and high availability are core requirements. This blog post will pit Solr vs Elasticsearch, two of the most popular open source search engines whose fortunes over the years have gone in different directions.
Both of them are built on top of Apache Lucene, so the features they support are very similar. However, they differ significantly in terms of deployment, scalability, query language, and many other functionalities.
About Apache Solr
Apache Solr is an open-source search server built on top of Lucene that provides all of Lucene’s search capabilities through HTTP requests. It has been around for almost a decade and a half, making it a mature product with a broad user community.
Solr offers powerful features such as distributed full-text search, faceting, near real-time indexing, high availability, NoSQL features, integrations with big data tools such as Hadoop, and the ability to handle rich-text documents such as Word and PDF.
Elasticsearch is also an open-source search engine built on top of Apache Lucene, as the rest of the ELK Stack, including Logstash and Kibana. It extends Lucene’s powerful indexing and search functionalities using RESTful APIs, and it archives the distribution of data on multiple servers using the index and shards concept. Elasticsearch is completely based on JSON and is suitable for time series and NoSQL data.
This tool is much younger than Solr, but it has gained a lot of popularity because of its feature-rich use cases. Some of its primary features include distributed full-text distributed search, high availability, powerful query DSL, multitenancy, Geo Search, and horizontal scaling.
According to DB-Engines, which ranks database management systems and search engines according to their popularity, Elasticsearch is ranked number one, and Solr is ranked number three.
Solr had gained popularity in the first ten years of its existence, but Elasticsearch has been the most popular search engine since 2016.
Figure 1: DB-Engines Ranking—Elasticsearch vs. Solr Popularity (Source: DB-Engines)
Installation and Configuration
Java is the primary prerequisite for installing both of these engines, but the default Elasticsearch configuration requires 1GB of HEAP memory. This can be changed in the jvm.options file inside the config directory.
By default, Solr needs at least 512MB of HEAP memory to allocate to instances. This setting can be changed in either the solr script file or the solr.in.cmd file. Both files are located inside the bin directory of the Solr installation.
Elasticsearch is easy to install and configure, but it’s quite a bit heavier than Solr. The latest version of Elasticsearch (version 7.7.1, released in June 2020) has a compressed size of 314.5MB, whereas Solr (version 8.5.2, released in May 2020) ships at 191.7MB.
Configuration files in Elasticsearch are written in YML format. Solr supports XML-based configuration files.
Indexing and Searching
Both Solr and Elasticsearch write indexes in Lucene. But, since differences exist in sharding and replication (among other features), there are also differences in their files and architectures. Additionally, Elasticsearch has native DSL support while Solr has a robust Standard Query Parser that aligns to Lucene syntax.
Both tools support a wide range of data sources.
Solr uses request handlers to ingest data from XML files, CSV files, databases, Microsoft Word documents, and PDFs. With native support for the Apache Tika library, it supports extraction and indexing from over one thousand file types. Solr ships with a simple command line post. To ingest CSV-based data in a collection named
testcollection, for example, you just need to use the following command:
bin/post -c testcollection *.csv
Elasticsearch, on the other hand, is completely JSON-based. It supports data ingestion from multiple sources using the Beats family (lightweight data shippers available in the ELK Stack) and Logstash.
While both products are document-oriented search engines, Solr has always been more focused on enterprise-directed text searches with advanced information retrieval (IR). Consequently, it’s more suited for search applications that use massive amounts of static data. Solr fits better into enterprise applications that already implement big data ecosystem tools, such as Hadoop and Spark. Additionally, Solr stands out in handling Rich Text Format (RTF) documents. To compete with Elasticsearch, recent Solr releases have offered new features such as Parallel SQL Interface and streaming expressions.
Elasticsearch is focused more on scaling, data analytics, and processing time series data to obtain meaningful insights and patterns. Its large-scale log analytics performance makes it quite popular. Elasticsearch is more suited to modern web applications where data is carried in and out in JSON format. Elasticsearch has also put a lot of development effort into making its tool more resilient. This turns it into a primary data store.
Both Solr and Elasticsearch support NRT (near real-time) searches and take advantage of all of Lucene’s search capabilities. They both have additional search-related feature sets, described below, since they both support JSON-based Query DSL.
Earlier Solr versions had to rely on its Standard Query Parser, but Solr now also supports JSON-based Query DSL. While Solr’s Standard Query Parser allows users to create a variety of structured queries, the chances of making syntax errors while writing these queries is much higher. Nevertheless, you can write very complex search queries in Solr that are unavailable in Elasticsearch. Solr includes a sample search UI, called Velocity Search, that offers powerful features such as searching, faceting, highlighting, autocomplete, and Geo Search.
Elasticsearch’s DSL is native. The aggregation framework in Elasticsearch is powerful with aggregation queries in the APIs with better caching. The more recent releases of the tool offer better management of memory footprints.
Because Elasticsearch is schemaless, it is easy to index unstructured data and dynamic fields without defining the schema of the index in advance. Earlier Solr versions required a defined schema before indexing data. However, Solr now supports a schemaless mode.
Both search engines support custom analyzers, synonym-based indexing, stemming, and various tokenization options.
Scalability and Distribution
Search engines have to quickly process large amounts of data and complex queries on sets of hundreds of millions of records. Sometimes these queries can be so resource-intensive that they can take the whole system down—especially if you haven’t planned for the load in advance and can’t scale quickly. For this reason, a search engine must be scalable and fault-tolerant in nature.
Clusters, Sharding, and Rebalancing
Both Elasticsearch and SolrCloud provide support for sharding. But, since Elasticsearch’s design has horizontal scaling in mind, it has better support for scaling and cluster management. Its disadvantage is that the shards cannot increase once they’ve been created, although you can use a shrink API to reduce the shards of an index. SolrCloud supports further splitting of an existing shard but not the shrinking of shards.
Elasticsearch’s built-in zen discovery module handles cluster coordination. SolrCloud requires Apache Zookeeper, an additional service.
In case of a shard or node failure, Elasticsearch does cluster rebalancing itself and rarely requires a manual intervention. In SolrCloud, rebalancing is complex and hard to manage.
Solr had a broad, open source community. Anyone can still contribute to Solr, and new Solr developers or code committers are elected based on merit only. Elasticsearch is technically open source but not fully. All contributors have access to the source code, and users can make changes and contribute them. But final changes get confirmation from employees of Elastic (the company that runs Elasticsearch and other software). Therefore, Elasticsearch is driven more by a single company rather than a whole community. This is not to mention the number of non-open, premium features Elasticsearch (and the Elastic/ELK Stack in general) offer).
Going back to the mid-2010s, Solr contributors and committers span multiple organizations while Elasticsearch committers are from Elastic only. Solr’s strong community had a healthy project pipeline and many well-known companies that take part. These members also invest in the platform by contributing throughout the entire development and engineering process.
This has changed drastically in the last five years. Elasticsearch’s community of contributors and its user base have grown immensely. It is by far the most popular open source time-series database and search engine in DevOps at the beginning of the 2020s.
Historically, both have had great user bases as well as rich developer communities, but Elasticsearch has overtaken Solr. Solr has been around for a much longer period of time, but its ecosystem has stagnated even after having a well-developed and has a larger user base.
On this, Elasticsearch documentation wins. Not only does Elasticsearch’s official website offer well-organized, high quality documentation with clear examples, the internet is flush with books and guides, thanks to the tool’s popularity. Over the last four years, Elasticsearch enhanced its documentation to go beyond organization. Additionally, it offers good examples and clear configuration instructions.
In comparison, Solr documentation is lacking. The overall coverage of Solr’s APIs is minimal, and it’s hard to find good technical examples and tutorials. It used to be the other way around: Solr was a very well-documented product with clear examples and contexts for API use cases. However, its documentation maintenance has fallen behind, with gaps noted by many users.
Summary: Solr vs Elasticsearch
Selecting a clear winner between these two technologies requires a complete understanding of the use cases they support, their feature sets, the scaling options they offer, and their ease of maintenance.
Here’s a summary of each tool’s attributes:
SolrElasticsearchInstallation and ConfigurationEasy to get up and running with and very supportive documentationEasy to get up and running with with very supportive documentation. Several packages are available for various platforms.Searching and IndexingOptimal for text search and enterprise applications close to the big data ecosystemUseful as both a text search and an analytical engine because of its powerful aggregation moduleScalability and ClusteringSupport from Solr Cloud and Apache Zookeeper dependence for cluster coordinationBetter inherent scalability; design optimal for cloud deploymentsCommunityA historically large ecosystemA thriving ecosystem for the FOSS version of Elasticsearch and the ELK StackDocumentationPatchy, out-of-dateWell-documented
Both of these technologies are quite easy to begin working with. Solr offers great functionalities in the field of information retrieval, but Elasticsearch is much easier to take into production and scale. When choosing your tool, make sure to look at your requirements and make the best selection for your specific use case.