Shape-based Methods

Shape-based approaches are a type of similarity measure in which time-series data sets are transformed into a new representation, such as the Fourier transform or Symbolic Aggregate Approximation (SAX), and then compared based on their shape. These approaches are good at collecting shape-based similarities and are commonly used in pattern recognition, clustering, and anomaly identification. Nevertheless, the success of shape-based approaches is dependent on the transformation used and the amount of noise and outliers in the data.

The mathematical equation for the popular shape-based time-series data analysis methodology Symbolic Aggregate Approximation (SAX):

The SAX method turns a time-series data collection ‘ts’ of length ‘n’ and a number of segments ‘m’ into a symbolic representation by splitting it into ‘m’ segments of equal length ‘w = n/m’ and transforming each segment into a symbol based on its mean value. This may be stated as follows:

1. Using a basic normal distribution lookup table, calculate the breakpoints for the required number of symbols (generally represented by ‘a’). The breakpoints are the values that divide the normal distribution into a set of equiprobable zones, and they are represented by the ‘a-1′ quantiles ‘q1, q2,…, q (a-1)’.

2. Compute the mean values of each ‘w-length’ segment in ‘ts’, designated by ‘v1, v2,…, vm’.

3. Based on its position relative to the breakpoints, convert each segment mean value ‘v_i’ to a symbol. Let ‘alpha’ be an ‘m-length’ string made up of the symbols corresponding to each segment mean value, such that:

alpha[i] = j if q(j-1) <= v_i < q(j) (for 1 <= i <= m)

where ‘j’ is an integer between ‘1’ and ‘a’.

4. The string ‘alpha’ is the resultant symbolic representation of ‘ts’.

It is worth noting that the distance is determined by comparing the SAX representations of the two time-series data sets symbol by symbol and adding the squared differences between the relevant breakpoints.

The advantages of Shape-based methods are as follows:

Converting time-series data sets into a new representation, such as Fourier transform or Symbolic Aggregate approXimation (SAX), then comparing them depending on their shape;
can handle time series with diverse forms and magnitudes.

The limitations of Shape-based methods are as follows:

The transformation used may influence the similarity measure;
The success of the strategy may be dependent on the specific application.

In Python, here’s an example of computing the Symbolic Aggregate approXimation (SAX) distance time-series data:

Python3

import numpy as np
from sklearn.preprocessing import StandardScaler
 
def to_sax(time_series, window_size, alphabet_size):
    # Normalize the time series data to have a mean of 0 and a standard deviation of 1.
    scaler = StandardScaler()
    normalized_ts = scaler.fit_transform(time_series.reshape(-1, 1)).flatten()
 
    # Define the breakpoints for the bins.
    breakpoints = np.linspace(-np.sqrt(2), np.sqrt(2), alphabet_size - 1)
 
    # Divide the normalized time series into equal-sized segments.
    segments = np.array([normalized_ts[i:i+window_size] for i in range(0, len(normalized_ts) - window_size + 1)])
 
    # Convert each segment to a single letter or symbol.
    symbols = []
    for segment in segments:
        bin_id = np.searchsorted(breakpoints, np.mean(segment))
        symbols.append(chr(97 + bin_id))
 
    return ''.join(symbols)
time_series = np.array([1, 2, 3, 4, 5, 4, 3, 2, 1])
window_size = 3
alphabet_size = 4
sax_representation = to_sax(time_series, window_size, alphabet_size)
print(sax_representation)

Output:

The output for the above code: bcccccb

Similarity Search for Time-Series Data

Time-series analysis is a statistical approach for analyzing data that has been structured through time. It entails analyzing past data to detect patterns, trends, and anomalies, then applying this knowledge to forecast future trends. Time-series analysis has several uses, including in finance, economics, engineering, and the healthcare industry.

Time-series datasets are collections of data points that are recorded over time, such as stock prices, weather patterns, or sensor readings. In many real-world applications, it is often necessary to compare multiple time-series datasets to find similarities or differences between them.

Similarity search, which includes determining the degree to which similarities exist between two or more time-series data sets, is a fundamental task in time-series analysis. This is an essential phase in a variety of applications, including anomaly detection, clustering, and forecasting. In anomaly detection, for example, we may wish to find data points that differ considerably from the predicted trend. In clustering, we could wish to combine time-series data sets that have similar patterns, but in forecasting, we might want to discover the most comparable past data to reliably anticipate future trends.

In time-series analysis, there are numerous approaches for searching for similarities, including the Euclidean distance, dynamic time warping (DTW), and shape-based methods like the Fourier transform and Symbolic Aggregate ApproXimation (SAX). The approach chosen is determined by the individual purpose, the scope and complexity of the data collection, and the amount of noise and outliers in the data.

Although time-series analysis and similarity search are strong tools, they are not without their drawbacks. Handling missing data, dealing with big and complicated data sets, and selecting appropriate similarity metrics, can be challenging. Yet, these obstacles may be addressed with thorough data preparation and the selection of relevant procedures.

Types of similarity measures

Time-series analysis is the process of reviewing previous data to detect patterns, trends, and anomalies and then utilizing this knowledge to forecast future trends. Similarity search, which includes determining the degree to which similarities exist among two or more time-series data sets, is an essential problem in time-series analysis.

Similarity metrics, which quantify the degree to which there is similarity or dissimilarity among two time-series data sets, are critical in this endeavor. This article will go through the several types of similarity metrics that are often employed in time-series analysis.