Launched in 2010, Elasticsearch was one of the first distributed search engines made for fast querying of data to display in analytics or big data output. At the time, more enterprise businesses were accumulating massive amounts of data, but traditional database engines weren’t able to keep up. Elasticsearch was introduced as a vector database capable of storing structured and unstructured data. It was the first big leap into big data indexing and faster queries when enterprise analytics exceeded terabytes of data causing performance issues.
What Is Elasticsearch?
Elasticsearch is a datastore used to pull data together and make it searchable via an API. It’s based on Apache Lucene, which is an indexing and storage service that distributes data in shards. Each shard holds its own data, but the shards are kept separate from each other to distribute data across nodes. Elasticsearch pools all shards together and provides an API for developers to query data. With the API, administrators can set permissions to specific users to further secure data and give access to specific data to only authorized users.
Developers aren’t limited to structured or unstructured data. Elasticsearch lets users pull both structured and unstructured data, but it queries data in its distributed shards as if storage is one large database. The way Elasticsearch handles data makes it much faster than a standard database engine, so it’s best for applications with analytics, search functionality across a lot of data, or network traffic analysis.
Core Components of Elasticsearch Architecture
The first core component of Elasticsearch is the node. A node is a server or device where data is stored. Clusters are made up of a collection of nodes. Nodes and clusters can be distributed across data centers for redundancy, but distributed data is what improves performance in data queries. Elasticsearch is a vector database, but it stores data as a document. A document is an entity that stores unstructured data, though structured data can be stored in it as well.
Data is sharded across nodes. Shards are a portion of data to segment large datastores into smaller portions, which makes it easier to distribute and simultaneously query to pull data together in a result set for the frontend application. Logstash is used as the data pipeline, which takes data in its raw form and transforms it into a usable form.
Elasticsearch also has an API which is the gateway to data. Developers must authenticate into the API and query it with a key. The API controls access to the data and the way developers can query it. Having an API also obscures the backend architecture and secures it, making it accessible to developers unfamiliar with the way Apache Lucene and other components function.
Nodes and Clusters
A cluster is a group of nodes, but nodes have their own specific role in a cluster. The master node, generally speaking, controls the cluster. The master node can create or delete indexes and track other nodes that participate in the cluster. Every cluster has a master node.
A data node stores the data. Any manipulation or changes in data are the responsibility of the data node. As you aggregate data, you add to the data node. Search functionality also happens at the data node.
Think of coordinating nodes as the gateways that control traffic to the right node. A coordinating node sends requests to the master node or data nodes, depending on their destination. For example, when a search is sent to the cluster, the coordinating node manages its request.
Elasticsearch has a pipeline for transforming and moving data. The ingest node is responsible for managing documents and transforming them for indexing. Elasticsearch recommends having an ingest node independent from the master and data nodes in environments with heavy data transfers.
Remote-eligible nodes send requests to other clusters in the Elasticsearch system. Search queries can find data using cross-cluster functionality with a remote-eligible node. Replication of data across clusters is also the responsibility of the remote-eligible nodes.
Shards and Replicas
Fast searches and queries require indexes. Indexes are a datastore’s way of organizing data in a way that makes searches faster. In Elasticsearch, each index is made up of a number of shards. Shards are stored across nodes where Elasticsearch distributes them across the cluster for faster processing. Shards hold one copy of data, but Elasticsearch can perform simultaneous searches on multiple shards.
Redundancy is necessary for failover and fault tolerance, so replicas handle a copy of shards. Replicas are stored on different nodes so that data is not lost when one node fails. If a node fails, Elasticsearch would then be able to access data on a replica on a different node.
Data Flow in Elasticsearch
Elasticsearch provides an API for developers to send their queries. The API shields developers from the complexity of the backend. The backend, as discussed, is made up of several components, including shards, nodes, indexes, and replicas. Instead of forcing developers to manage Elasticsearch researchers, data flow begins with a query to the API.
The API sends the query first to a coordinator. The coordinator sends it to the appropriate node where shards are located. A query could also be sent to multiple shards each with their own set of data. Routing is done by the coordinator, which determines the right shards for the query.
After shards collect documents, indexes for the documents are sent back to the coordinator. When several shards send data back to the coordinator, the coordinator will organize the indexes, merge them, and sort them. With the indexes sorted, the coordinator then retrieves the actual documents from the shards. After data is retrieved, it’s sent back to the application via the API.
Best Practices for Optimizing Elasticsearch Architecture
Elasticsearch is far more complex than the average database engine, so optimizing it takes a different approach. You first need to ensure that enough resources are executing in the backend. If queries are too slow, consider increasing CPU, memory, or server storage capacity.
Indexes are necessary for searching data, but some data is not as frequently used as other data. Elasticsearch lets you freeze indexes, which moves unused indexes to another shard. Coordinators then have fewer shards to search for every query, improving performance.
Thread pools in Elasticsearch control query sizes and should be optimized to support the amount of data handled in each query. Configure thread pools to handle enough data for your queries, but they also tie into node resources. Both nodes and thread pools should have enough resources, or you could see degraded performance during high-volume queries.
Conclusion
Elasticsearch is much more complex than a standard database, so having the right architecture and computing resources is necessary for optimal performance. Use best practices when you configure Elasticsearch, but it’s also important to have the right computing resources to support the queries and data storage necessary for the backend.
One way to ensure that you have enough storage resources is to leverage Pure Storage® FlashBlade®. FlashBlade supports exponential growth for small to large enterprise businesses. Your applications can scale as more users store data and support any number of customers.