Currently, we are upgrading a site from Drupal 7 to Drupal 9. During this upgrade, we have the opportunity to the site's search functionality and select a tool that is most suited to the website's requirements.
Searching on a website is integral to a site's functionality. Looking at Solr and Elasticsearch as the primary search engines. They're two of the most popular open-source search engines. To begin both of them are built on top of Apache Lucene, so the features they support are very similar. Yet what they deliver is significantly different in terms of deployment, scalability, query language and the like.
Solr is a completely open-source search engine. Whereas Elasticsearch though open source is still managed by Elastic’s employees. Solr supports text search while Elasticsearch is mainly used for analytical querying, filtering, and grouping.
Begin by asking yourself:
- Have either been used previously?
- Is the data analytical or static data?
Elasticsearch is an open-source search engine built on top of the Apache Lucene library. It provides a full-text search engine, that is supported by a multi-tenant with HTTP web interface (REST) and JSON documents without schema. Elasticsearch is based on JSON and is suitable for time series and NoSQL data.
Distributed search engines contain indexes that can be divided into fragments, where each fragment is capable of having multiple copies. Its engine also acts as a coordinator, and delegates operations on the correct fragments.
- If you don't have a search engine and hadn’t invested a lot of time, money and energy in its integration;
- Have large volumes of data and needed to more easily shard and replicate data (search indices) and shrink or grow their search cluster;
- Elasticsearch is easier to get going out of the box and without too much understanding of how things work. Good for getting started, but be aware of the dangers when data/cluster grows;
- Elasticsearch lends itself to scaling, and attracts use cases demanding larger clusters with more data and more nodes;
- Elasticsearch is more dynamic – you can move data more easily around the cluster as its nodes come and go, and this can impact the stability and performance of the cluster;
- While Solr has traditionally been more geared toward text search, Elasticsearch is aiming to handle analytical types of queries.
A list of key features:
- Distributed Search;
- Multi-lease period;
- A string of Analyzers;
- Scan Search;
- Group Aggregation.
As noted earlier, Solr is an open-source search platform built in a java library called Lucene and provides Apache Lucene search function in an easy to use way. Some of the benefits that Solr offers automatic load balancing, distributed reindexing, failover and recovery queries. Like Elasticsearch, Solr has the ability to more easily grow and shrink clusters, create indices more dynamically, shard them on the fly, route documents and queries. If implemented correctly and managed well, it can become a highly reliable, scalable, fault-tolerant search engine.
A list of key features:
- Full-text search;
- Multi-array Search;
- Real-time indexing;
- Database integration;
- NoSQL functionality and productive document handling (e.g. words and PDF files).
Solr vs Elasticsearch: Community and Open Source
To begin, both have really active communities. Important to get an understanding of longevity. Solr is an open source: anyone can help and contribute. Whereas, Elasticsearch employees or companies need to accept changes to the search code.
Is this good or bad? This means that if you need a function, and you contribute it to the community, with sufficient quality, it can be accepted. With Elasticsearch, it depends on whether the elastic decision will be accepted or not. Therefore, contributions to Elasticsearch go through more quality controls.
Essentially, if your project/site requirements need to add missing functionality to either Solr or Elasticsearch, you may have more luck with Solr.
Installation and Configuration
The main prerequisite for installing both of these engines is Java. The default Elasticsearch configuration requires 1GB of HEAP memory and you modify it in the jvm.options file inside the config directory. Whereas, Solr needs at least 512MB of HEAP memory. Changes to this setting is achieved in either the solr script file or the solr.in.cmd file. Generally, I've found if there is a lot of data, allowing for 2GB of HEAP memory is key.
Elasticsearch is easy to install and configure. The latest version of Elasticsearch (version 7.4.1) has a compressed size of 350MB, whereas Solr (version 8.11.2,) ships at 134MB compressed. Noted, I used Solr 8.11.2 as at this point it has been better to Drupal 9 than 9.0.0. In Drupal, it is best to integrate Solr using the Search API contrib module.
A segment is a Lucene index built by several files, mostly immutable and containing data. When indexing data, Lucene generates segments and can also merge several smaller existing segments into larger segments during a process called segment merging.
Solr uses global caches, therefore a single cache instance of a given type for a shard/segment. If a single segment is modified, then the consequence is that the entire cache needs to be invalidated and refreshed.
Whereas in Elasticsearch caches are per segment, then only a small portion of the cached data needs to be invalidated and refreshed. Per segment, better for dynamically changing data.
Indexing and Searching
Solr and Elasticsearch differences exist in sharding and replication. But also in their files and architectures. Additionally, Elasticsearch has native DSL support while Solr has a robust Standard Query Parser that aligns to Lucene syntax.
Solr uses request handlers broad array including XML files, CSV files, databases, Microsoft Word documents, and PDFs. Solr is capable of supporting extraction and indexing from over one thousand file types. Solr ships with a simple command line post.
Whereas, Elasticsearch, is JSON based. It supports data ingestion from multiple sources using the Beats family (lightweight data shippers available in the ELK Stack) and Logstash.
Solr is more focused on enterprise-directed text searches with advanced information retrieval (IR). Therefore, being more suited for search applications that use massive amounts of static data. Additionally, Solr stands out in handling Rich Text Format documents.
Elasticsearch is focused more on scaling, data analytics, and processing time series data to obtain meaningful insights and patterns. Elasticsearch is better applied to applications where data is transferred in and out in JSON format.
Both support near real-time searches and take advantage of all of Lucene’s search capabilities. They both have additional search-related feature sets.
You can write very complex search queries in Solr that at this stage can't be achieved in Elasticsearch. Solr offers powerful features such as searching, faceting, highlighting, autocomplete and Geo Search.