Elasticsearch Architecture

1. Distributed Nature

Elasticsearch is inherently distributed, meaning it can run on a cluster of interconnected nodes to distribute data and workload across multiple machines. This distributed architecture allows Elasticsearch to scale horizontally, enabling it to handle large amounts of data and support high query loads.

Cluster

A cluster in Elasticsearch consists of one or more nodes working together to provide the search and indexing functionality.
Each node is an instance of Elasticsearch running on a server, and multiple nodes form a cluster.
Nodes communicate with each other to share data, coordinate operations and ensure fault tolerance.

Node

A node is a single instance of Elasticsearch running on a machine within a cluster.
Each node stores a part of the data and participates in the cluster’s indexing and search capabilities.
Nodes can be categorized into different roles, such as master-eligible nodes, data nodes, and coordinating nodes.

2. Indexing and Data Model

Elasticsearch organizes and stores data in the form of documents within indices. Documents are JSON objects that contain data and metadata associated with the data.

Index

An index is a grouping of documents that share common characteristics.
Indices are similar to databases in traditional SQL databases.
Each document within an index has a unique identifier (_id) and is stored in a structured format using JSON.

Document

A document is a basic unit of information in Elasticsearch.
Documents are represented as JSON objects and contain data fields and their corresponding values.
Elasticsearch automatically indexes each field within a document and allowing for efficient searching and retrieval.

Example:

Consider an example of indexing a document in Elasticsearch:

POST /my_index/_doc/1
{
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com"
}

In this example, we’re indexing a document with three fields (name, age, email) into the my_index index.

3. Sharding and Replication

Elasticsearch uses sharding and replication to distribute data across nodes and ensure high availability and fault tolerance.

Shards

A shard is a subset of an index that contains a portion of the index’s data.
Each shard is stored on a separate node in the cluster.
Sharding enables Elasticsearch to horizontally partition data and distribute it across multiple nodes for scalability and parallel processing of queries.

Replicas

Replicas are copies of index shards that provide redundancy and high availability.
Replicas are used to improve search performance and handle node failures gracefully.
Elasticsearch automatically distributes replicas across nodes to ensure fault tolerance.

Example:

When creating an index, we can specify the number of primary shards and replica shards:

PUT /my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

In this example, we’re creating an index named my_index with 5 primary shards and 1 replica for each shard.

4. Querying and Search

Elasticsearch provides a powerful query DSL (Domain-Specific Language) for searching and retrieving data from indices.

Query DSL

The Elasticsearch Query DSL allows us to construct complex queries using JSON-like syntax.
Queries can perform full-text search, aggregations, filtering, sorting, and more.
Elasticsearch analyzes query requests and executes them efficiently across distributed nodes.

Example:

Performing a simple match query to search for documents containing a specific term:

GET /my_index/_search
{
  "query": {
    "match": {
      "name": "John"
    }
  }
}

This query retrieves all documents from the my_index index where the name field contains the term “John”.