Find Trend-Following Assets by the Employment of Dynamic Time Warping (DTW) and Machine Learning Algorithm

Matin Karbasioun
28 min readSep 12, 2024

--

The fluctuating nature of stock prices generates a wealth of data that, despite its richness, often proves challenging to interpret. Yet, discernible patterns persistently emerge amidst this complexity, offering valuable insights into future price movements. One effective method for identifying such patterns is through a technique known as Dynamic Time Warping (DTW).

Dynamic Time Warping is a distance measure used to compare the similarity between two different time series, even if they differ in length. This characteristic makes it ideal for detecting patterns in asset price data, where prices can fluctuate and move at varying speeds. By identifying similar patterns or those that have occurred in the past, we can gain insights into future price movements and transform historical data into a valuable predictive tool.

In this article, we explore the use of Dynamic Time Warping (DTW) for the recognition of co-trending stocks within stock price data, focusing on those with the highest similarity to one another. By finding similar time series, we also aim to identify suitable alternatives for stocks in our portfolio to enhance liquidity. This becomes crucial as the number of users and copy trades on the platform increases. We need alternative stocks to handle situations where liquidity constraints or trading queues arise due to increased demand or supply, ensuring that other copy traders can benefit from similar trends.

A key consideration in applying DTW is determining a threshold to identify whether two assets are co-trending. Since establishing such a criterion requires identifying and analyzing a significant number of similar real-world examples, DTW alone is not sufficient. Therefore, this method should be complemented by an additional tool to simulate and identify co-trending assets effectively.

Our approach involves generating synthetic data for both co-trending and non-co-trending assets and leveraging a machine-learning model. We employ DTW combined with Principal Component Analysis (PCA) to obtain reduced and effective features. This combined method extracts significant features from time series data by integrating DTW for distance computation and PCA for dimensionality reduction. This approach not only identifies patterns related to price but also incorporates technical indicators to provide a comprehensive solution for data analysis.

The goal of this method is to identify patterns within a specific time window of price data for two assets and compare them with numerous windows in other assets to find similar alternatives. These windows cover a certain number of days or hours, with a focus on common time windows between both assets.

This study involves selecting a target stock and comparing it with synthetic data generated based on the target stock using DTW. Two types of comparisons are conducted: one between the closing price trends of the generated samples and the target stock, and another between technical indicator data. We use a combined distance measure for the trends from price data and technical indicators to compare different samples and then train a machine learning model based on these distances.

To manage computational complexity and feature extraction for various stocks efficiently, we use PCA to reduce the dimensions of technical indicator data to a manageable set of uncorrelated variables, eliminating redundant and ineffective data. We then train our model based on the data obtained from this process for each stock or sample.

Implement the trend-following detector

The following sections provide a detailed explanation of the techniques, implementation methods, and result interpretations in stock data analysis using DTW, including data normalization, DTW distance computation, pattern identification, and data preprocessing.

The process of identifying trends begins with the data received from a database or other services like MDP, which provides market data. Before initiating the process, it is essential to generate the supplementary data needed using the received information. For example, in this study, as the goal is to identify similar profitable and loss-making trends between two various time series, the first step is to calculate the percentage changes in closing prices.

The reason for using closing prices to identify trends is that they reflect the trading volume and the amount of trades conducted at the closing price. Essentially, closing prices represent the concentration of trading activity and provide more accurate and insightful information about the stock’s trend.

The next step is data preprocessing. Data preprocessing is a crucial step in data analysis and machine learning algorithms. It involves various techniques essential for preparing data to maximize the accuracy and efficiency of subsequent models. Key aspects of data preprocessing include:

  • Error and Noise Removal: Raw data often contains errors, missing values, or noise, which can negatively impact model performance. Preprocessing helps in correcting or removing these issues to improve result accuracy and reliability.
  • Data Transformation: Converting categorical variables to numerical forms or applying logarithmic transformations can make data suitable for specific algorithms. This ensures that data is in a usable format for analysis.
import pandas as pd


class DateModifier:

def modified(self, series_1: pd.DataFrame, series_2: pd.DataFrame):
min_datetime_1, max_datetime_1 = series_1.index.min(), series_1.index.max()
min_datetime_2, max_datetime_2 = series_2.index.min(), series_2.index.max()

start_datetime = max(min_datetime_1, min_datetime_2)
end_datetime = min(max_datetime_1, max_datetime_2)

modified_series_1 = series_1[(start_datetime <= series_1.index) &
(series_1.index <= end_datetime)]
modified_series_2 = series_2[(start_datetime <= series_2.index) &
(series_2.index <= end_datetime)]

return self.__time_modification(modified_series_1, modified_series_2)

@classmethod
def __time_modification(cls, df_1: pd.DataFrame, df_2: pd.DataFrame):
common_index = pd.date_range(start=min(df_1.index.min(), df_2.index.min()),
end=max(df_1.index.max(), df_2.index.max()),
freq='h')

# Reindex and interpolate the series to align them on the common index
series1_aligned = df_1.reindex(common_index).infer_objects(copy=False).interpolate(method='time')
series2_aligned = df_2.reindex(common_index).infer_objects(copy=False).interpolate(method='time')
series1_daily = series1_aligned.ffill().bfill()
series2_daily = series2_aligned.ffill().bfill()
return series1_daily, series2_daily
  • Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) help reduce the complexity of data by eliminating redundant features. This not only speeds up model training but also helps in avoiding overfitting.
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA


class PCAFeatureReducer:

@classmethod
def reduce_feature(cls, data, n_components=3):
pca = PCA(n_components=n_components)
reduced_features = pca.fit_transform(data)
return reduced_features

@classmethod
def plot_pca_and_close(cls, data, reduced_features):
plt.figure(figsize=(14, 10))

# Plotting the Close Price
plt.plot(data.index, data['Close'], label='Close Price', color='blue')

# Plotting the PCA reduced features
for i in range(reduced_features.shape[1]):
plt.plot(data.index, reduced_features[:, i], label=f'PCA Component {i + 1}')

plt.title('Close Price with PCA-Reduced Features')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()
  • Standardization and Normalization: Ensuring that data is on a consistent scale is critical, especially when comparing time series data. Normalization helps to maintain uniformity in data scales, making comparisons more meaningful.
import pandas as pd


class DataNormalizer:

@classmethod
def normalize(cls, data_series: pd.Series) -> pd.Series:
return (data_series - data_series.min()) / (data_series.max() - data_series.min())
  • Handling Missing Data: Techniques like Imputer and Simple Imputer help fill in missing values using methods such as mean, median, or nearest neighbors. This step is vital for ensuring complete datasets for accurate analysis.
from sklearn.impute import SimpleImputer


class Impute:
def __init__(self):
self.__impute_strategy = 'mean'

def fill_missing(self, data):
# Impute missing values
impute = SimpleImputer(strategy=self.__impute_strategy)
imputed_ta_features = impute.fit_transform(data)
return imputed_ta_features
  • Noise Reduction: Financial data often contain random noise that can skew analysis. Applying Gaussian filters smooth the data, remove unnecessary fluctuations, and improve the accuracy of time series comparisons using methods like Dynamic Time Warping (DTW).
import pandas as pd
from scipy.ndimage import gaussian_filter1d


class NoiseReducer:
def __init__(self):
self.__sigma = 2 # Standard deviation for Gaussian kernel

def reduce(self, candle) -> pd.DataFrame:
new_candle = candle
new_candle['Open'] = gaussian_filter1d(candle['Open'], sigma=self.__sigma)
new_candle['High'] = gaussian_filter1d(candle['High'], sigma=self.__sigma)
new_candle['Low'] = gaussian_filter1d(candle['Low'], sigma=self.__sigma)
new_candle['Close'] = gaussian_filter1d(candle['Close'], sigma=self.__sigma)
return new_candle

In this study, preprocessing steps include identifying and removing inactive symbols, standardizing data, managing missing values, and applying Gaussian filters to reduce noise. These processes ensure that the data is clean, consistent, and ready for detailed analysis. We gather all of the procedures of our data preprocessing in a class with the same name.

class PreProcessor:
def __init__(self, plot: bool = False, time_windows_day: int = 7):
self.__windows_hours: int = time_windows_day * 24
self.__date_modifier = DateModifier()
self.__impute = Impute()
self.__noise_reducer = NoiseReducer()
self.__pca_feature_reducer = PCAFeatureReducer()
self.__plot = plot
self.__plotter = StockTrendPlotter(num_columns=3)

def pre_process(self, df_1, df_2):
time_modified_series_1, time_modified_series_2 = self.__date_modifier.modified(df_1, df_2)
noise_reduced_series_1 = self.__noise_reducer.reduce(time_modified_series_1)
noise_reduced_series_2 = self.__noise_reducer.reduce(time_modified_series_2)

if self.__plot:
self.__plot_preprocessing(df_1, time_modified_series_1, noise_reduced_series_1)
self.__plot_preprocessing(df_2, time_modified_series_2, noise_reduced_series_2)

return noise_reduced_series_1, noise_reduced_series_2

@classmethod
def normalize(cls, series):
return DataNormalizer.normalize(series)

def fill_na(self, data):
return self.__impute.fill_missing(data)

def reduced_features(self, data_array, pca_component=3):
return self.__pca_feature_reducer.reduce_feature(data_array, pca_component)

def __plot_preprocessing(self, series, time_modified_series, noise_reduced_series):
series = {'Original Series': series,
'time_modified_series': time_modified_series,
'noise_reduced_series': noise_reduced_series}
self.__plotter.plot(candle_series=series,
column_name='Close',
x_label='Time',
y_label='Price',
title='PreProcessing')

Feature Extraction

The important step is Feature Extraction in Stock Analysis and Technical Indicators. Dynamic Time Warping (DTW) is widely used for identifying similar patterns in stock price data with minimal error. It helps in recognizing repetitive price patterns and predicting time series trends. However, for detecting stocks with similar trends to enhance liquidity in our asset portfolio, price data alone might not provide a comprehensive view of current market conditions. This is because price data, influenced by human behavior and market emotions, can exhibit noisy variations.

In this project, the pattern detection framework can be improved by incorporating financial indicators to filter out noise and better represent time series dynamics. While the core of the pattern detection method remains DTW, it now includes various financial indicators.

The method assumes that distinct time series (or stocks) can be considered substitutes if their behavior aligns in terms of both financial indicators and price trends. Technical indicators widely used in financial market analysis will be included. These indicators, such as moving averages, Bollinger Bands, and Ichimoku Cloud, help in forecasting future price movements based on historical data. Here are some key indicators:

  • Ichimoku Cloud:
  • Conversion Line: Shows short-term trends and helps identify key price turning points.
  • Senkou Span A & B: These intersecting lines help detect market trends and determine support and resistance zones.
  • Kijun-sen: Represents the average price range over a period and aids in trend detection and trade entry/exit points.
  • Moving Averages:
  • Simple Moving Average (SMA): Includes fast (short-term) and slow (long-term) moving averages to identify price trends.
  • Exponential Moving Average (EMA): Uses a weighted average to detect rapid and intermediate price changes.
  • KAMA (Kaufman’s Adaptive Moving Average): Adjusts for price volatility and direction, helping identify quick price changes.
  • Parabolic Stop and Reverse (PSAR): Identifies potential price direction reversals through SAR points.
  • Volume Weighted Average Price (VWAP): Determines fair price levels based on trading volume.
  • Volatility Indicators:
  • Keltner Channel: Includes lower, middle, and upper lines to measure volatility.
  • Donchian Channel: Comprises middle and upper lines for volatility measurement.
  • Bollinger Bands: Features lower and middle lines to assess market volatility.
  • Performance Metrics:
  • Cumulative Return: Measures the overall return of a stock over a series of periods.
  • Adjusted Close: Provides the final price of an asset for technical analysis and various computations.

All of these indicators can calculated by the ta module.

import pandas as pd
from matplotlib import pyplot as plt
from ta import trend, momentum, volatility, volume, others


class TechnicalFeatures:

@classmethod
def add_ta_features(cls, data):
data['trend_ichimoku_conv'] = trend.ichimoku_a(data['High'], data['Low'])
data['trend_ema_slow'] = trend.ema_indicator(data['Close'], 50)
data['momentum_kama'] = momentum.kama(data['Close'])
data['trend_psar_up'] = trend.psar_up(data['High'], data['Low'], data['Close'])
data['volume_vwap'] = volume.VolumeWeightedAveragePrice(data['High'], data['Low'], data['Close'], data['Volume']).volume_weighted_average_price()
data['trend_ichimoku_a'] = trend.ichimoku_a(data['High'], data['Low'])
data['volatility_kcl'] = volatility.KeltnerChannel(data['High'], data['Low'], data['Close']).keltner_channel_lband()
data['trend_ichimoku_b'] = trend.ichimoku_b(data['High'], data['Low'])
data['trend_ichimoku_base'] = trend.ichimoku_base_line(data['High'], data['Low'])
data['trend_sma_fast'] = trend.sma_indicator(data['Close'], 20)
data['volatility_dcm'] = volatility.DonchianChannel(data['High'],
data['Low'],
data['Close']).donchian_channel_mband()
data['volatility_bbl'] = volatility.BollingerBands(data['Close']).bollinger_lband()
data['volatility_bbm'] = volatility.BollingerBands(data['Close']).bollinger_mavg()
data['volatility_kcc'] = volatility.KeltnerChannel(data['High'],
data['Low'],
data['Close']).keltner_channel_mband()
data['volatility_kch'] = volatility.KeltnerChannel(data['High'],
data['Low'],
data['Close']).keltner_channel_hband()
data['trend_sma_slow'] = trend.sma_indicator(data['Close'],
200)
data['trend_ema_fast'] = trend.ema_indicator(data['Close'],
20)
data['volatility_dch'] = volatility.DonchianChannel(data['High'],
data['Low'],
data['Close']).donchian_channel_hband()
data['others_cr'] = others.cumulative_return(data['Close'])
data['Adj Close'] = data['Close']
return data

@classmethod
def get_price_change(cls, candlesticks) -> pd.DataFrame:
return candlesticks['Close'].pct_change().dropna()

@classmethod
def plot_indicators(cls, data):
# Adding technical indicators
data = cls.add_ta_features(data)

# Plotting
plt.figure(figsize=(14, 10))
plt.plot(data.index, data['Close'], label='Close Price', color='blue')
plt.plot(data.index, data['trend_ema_slow'], label='EMA 50', linestyle='--', color='orange')
plt.plot(data.index, data['trend_sma_fast'], label='SMA 20', linestyle='--', color='green')
plt.plot(data.index, data['trend_sma_slow'], label='SMA 200', linestyle='--', color='red')
plt.plot(data.index, data['momentum_kama'], label='KAMA', linestyle='--', color='purple')
plt.plot(data.index, data['volume_vwap'], label='VWAP', linestyle='--', color='brown')
plt.plot(data.index, data['volatility_bbm'], label='Bollinger Bands Middle', linestyle='--', color='magenta')
plt.plot(data.index, data['volatility_bbl'], label='Bollinger Bands Lower', linestyle='--', color='cyan')
plt.plot(data.index, data['volatility_kcc'], label='Keltner Channel Middle', linestyle='--', color='gray')
plt.plot(data.index, data['trend_ichimoku_base'], label='Ichimoku Base Line', linestyle='--', color='black')

plt.title('Close Price with Technical Indicators')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.show()

By utilizing these indicators, analysts can better assess market trends, identify entry and exit points, and make more informed trading decisions. When two stocks show similarities in financial indicators, it suggests they may exhibit concurrent movements, enhancing the accuracy of analysis.

Dynamic Time Warping (DTW) in Stock Pattern Recognition

Dynamic Time Warping (DTW) is a method used to measure the similarity between two time series that may vary in speed or length. Unlike traditional methods such as Euclidean distance, which may not accurately reflect similarities due to alignment issues, DTW aligns sequences in a time-warped manner to allow for more accurate comparisons.

Originally developed for speech recognition, DTW has found applications in various fields, including finance. For instance, it can compare two time series, A and B, where traditional Euclidean distance might fail to capture their true similarity due to differences in time alignment.

DTW calculates similarity by minimizing the cumulative distance between aligned points in the time series, offering a flexible and effective way to measure the similarity between sequences that may not be perfectly synchronized.

In stock analysis, where price patterns are driven by multiple factors, DTW can identify similar patterns across different stocks, even if they vary in timing or magnitude. By aligning patterns optimally, DTW helps detect stocks with similar trends or movements.
By using the FastDTW library, we can calculate the dynamic time warping between two different time series.

from fastdtw import fastdtw
from scipy.spatial.distance import euclidean

from src.Domain.SimilarityDetector.PreProcessing import PreProcessor


class DTWCalculator:

@classmethod
def calculate(cls, time_series_1, time_series_2):
ts1_normalized = PreProcessor.normalize(time_series_1)
ts2_normalized = PreProcessor.normalize(time_series_2)
distance, _ = fastdtw(ts1_normalized.reshape(-1, 1), ts2_normalized.reshape(-1, 1), dist=euclidean)
return distance

Principal Component Analysis (PCA) for Dimensionality Reduction

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data, making it valuable for navigating high-dimensional spaces such as technical indicators in finance. PCA identifies “principal components” that capture the most variance in the data, allowing for a reduced-dimensional representation.

PCA involves several key steps:

  1. Standardization: Data is standardized to have a mean of zero and a variance of one.
  2. Covariance Matrix Calculation: A covariance matrix is computed to capture relationships between variables.
  3. Eigenvalue and Eigenvector Calculation: Eigenvalues and eigenvectors are derived from the covariance matrix, representing the principal components.
  4. Component Selection: Principal components explaining the most variance are selected.
  5. Data Transformation: Data is projected into the space defined by the principal components.
  6. Result Interpretation: Analysis of principal components helps identify patterns and significant features.

Before applying DTW, it’s essential to reduce the dimensionality of technical indicators using PCA due to their high interdependence. This process eliminates redundant data and trends, improving the effectiveness of pattern recognition.

from matplotlib import pyplot as plt
from sklearn.decomposition import PCA


class PCAFeatureReducer:

@classmethod
def reduce_feature(cls, data, n_components=3):
pca = PCA(n_components=n_components)
reduced_features = pca.fit_transform(data)
return reduced_features

@classmethod
def plot_pca_and_close(cls, data, reduced_features):
plt.figure(figsize=(14, 10))

# Plotting the Close Price
plt.plot(data.index, data['Close'], label='Close Price', color='blue')

# Plotting the PCA reduced features
for i in range(reduced_features.shape[1]):
plt.plot(data.index, reduced_features[:, i], label=f'PCA Component {i + 1}')

plt.title('Close Price with PCA-Reduced Features')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

In summary, combining DTW and PCA enables more accurate and efficient analysis of stock price patterns by aligning sequences and reducing dimensionality, thus facilitating better detection of similar trends and movements.

Calculating Composite Index

One way to calculate the distance of two different stocks and time series based on the what ? said in his article to introduce a composite index by defining a weighting coefficient and then calculating the composite distance of two different time series. So, we can define this coefficient as below:

# w is a coefficient
D_composite = D_time_series + w * D_technical_distan

Data Sorting

The output obtained from the previous methods and steps is a matrix of combined distances derived from the reduction of technical trend distances and the time series themselves. To utilize the results of these trends and select stocks with alternative trends, data sorting is performed based on increasing distances. This is because, in the DTW method, a smaller distance indicates greater similarity between two different time series.

Limitations of Composite Index Method

Identifying stocks with similar trends in stock price movements is crucial for enhancing liquidity and flexibility in a liquid portfolio, particularly in inefficient markets, and for investment freedom. While the application of DTW in identifying stock price patterns, as demonstrated in this study, can be effective, it also has several weaknesses.

The method presented in this study is a starting point for using a combination of DTW and PCA to create a composite index for pattern recognition in stock price data. Therefore, there are limitations and areas for improvement that should be considered. Despite these limitations, there are ways to improve this approach, many of which have been addressed in this study.

  • Parameter Optimization Through Cross-Validation: The performance of DTW and PCA techniques heavily depends on parameter selection, such as the weighting coefficient in distance calculation, considered time intervals, and selected variables in PCA. Currently, there is no systematic method for parameter tuning, but adding techniques like grid search or evolutionary optimization algorithms could help find an optimal set of parameters that maximizes the accuracy and reliability of the pattern recognition and prediction method.
  • Assumption of Repeating Patterns: One of the key assumptions in the DTW method is that patterns in time series will reappear in the future. This assumption may not always hold in the highly volatile and multifaceted stock market, influenced by numerous, sometimes unprecedented, variables. This assumption, that past price patterns will repeat in the future, is a common simplification underlying technical analysis in financial markets. However, it should be noted that market dynamics evolve, and historical patterns do not necessarily repeat or lead to accurate predictions.
  • Univariate Analysis: The current implementation of DTW focuses mainly on stock prices and ignores other influential factors such as trading volume or volatility, which can provide critical insights into market behavior and enhance prediction capability. Including these additional aspects can offer a more comprehensive understanding of stock price movements. Engineering features and expanding the feature set by adding more diverse predictive indicators or alternative data sources could also improve the accuracy of the pattern recognition method.
  • Computational Load: Especially with larger datasets, DTW can create a high computational burden due to its quadratic time complexity, leading to inefficiencies because of its complexities. When applied to price data, the extensive range of stock prices and time intervals may result in longer processing times and significant resource use. One way to address this is by using optimization methods, such as FastDTW, which reduces time complexity by approximating DTW distances, which can be crucial in managing computational needs.
  • Sensitivity to Noise: DTW’s sensitivity to noise in data can affect pattern recognition accuracy, where temporary fluctuations may disproportionately impact the DTW distance calculation. Although technical variables are used in this study to identify price movement trends, this issue might introduce errors in comparing the price trends of two stock time series. Designing mechanisms to identify and adjust for anomalous prices or extreme events could improve the accuracy and robustness of the pattern recognition and prediction method. One approach to smoothing anomalies is using moving average methods, which have been considered as an indicator in this study. However, in raw time series distance calculation, implementing noise reduction methods can help separate important patterns by reducing the impact of temporary data anomalies.
  • Integration with Other Machine Learning Models: Combining DTW with other machine learning models can enhance predictive and pattern recognition capabilities by integrating precise pattern identification with predictive validation.
  • Multi-Asset Pattern Recognition: Expanding the scope to include diverse assets or even different markets could provide richer analysis. By examining patterns across assets or related sectors, broader market trends or recurring patterns not limited to the primary analyzed asset might be discovered. For example, similar trends can be observed between the behavior of sectoral funds and physical gold in the sample market, which can expand liquidity.

Using Machine Learning Models

Despite the strengths of the DTW + PCA method, finding a suitable threshold for detecting similarities between two stocks and trends requires studying both similar and dissimilar samples. To address this, the output from the DTW + PCA method can be used to train a machine learning model with a large number of similar and dissimilar samples, creating a model capable of identifying co-trending and non-co-trending stocks.

In this study, since the problem involves identifying similar and dissimilar stocks relative to the target stock, it is categorized as a classification problem. Based on the problem type, the amount of available data, and other studies discussed further, the Random Forest Classification model has been selected for this research.

Synthetic Data Generation in Machine Learning

Synthetic data generation refers to the process of creating data that are produced by computer algorithms and resemble real data, considering the specific conditions of the research. This technique is particularly useful when real data are limited, sensitive, or expensive. Synthetic data can help machine learning models learn better and perform more effectively.

In this study, the use of synthetic data is essential due to the lack of sufficient real data for specific stocks and the limited number of stocks in the stock markets. Additionally, the business approach to co-movement data is another important aspect. For example, a consumer might define a stock with a profitability trend and cumulative profit difference ranging from -5% to +5% as a co-movement stock, while others might have different limitations in this regard.

Methods of Synthetic Data Generation

1 - Generative Adversarial Networks (GANs)
GANs consist of two neural networks: a generator and a discriminator. These two networks are trained simultaneously, where the generator creates synthetic data that are indistinguishable from real data, and the discriminator tries to differentiate between real and synthetic data. The generator produces random data samples, and the discriminator attempts to identify whether these samples are real or artificial. Over time and through repeated iterations, the generator learns to produce high-quality synthetic data.

2 — Variational Autoencoders (VAEs)
VAEs are a type of autoencoder designed to generate synthetic data using probabilistic distributions. These models use a neural network to compress data into a latent space and then reconstruct the data from this latent space. Initially, input data are mapped to a latent space, and then new data are generated from this space. This method helps create synthetic data that have a similar distribution to real data.

3 — Data Augmentation
Data augmentation is a technique where random modifications are applied to existing data to create a larger and more diverse training dataset. These modifications can include rotation, scaling, cropping, adding noise, etc. By applying various transformations to real data, new samples are generated, which helps the model better respond to changes and fluctuations in the data.

4 - Simulations
Simulations are a method for generating synthetic data using mathematical and physical models that mimic real-world processes. This method is particularly used in engineering, physics, and economics. By employing simulation models, synthetic data with similar characteristics to real data can be produced. These models are usually designed based on scientific principles and laws.

Synthetic Data Generation in This Study
In the present research, due to the limited amount of relevant data on co-movement stocks and the limited number of symbols, training a machine-learning model is considerably challenging and complex. Additionally, a machine learning model does not have any pre-training evaluation of the similarity or dissimilarity of input data. Therefore, synthetic data need to be generated randomly with some controls.

The benefits of using synthetic data in this research include:

  • Increasing Data Volume: Generating synthetic data helps increase the dataset size, which leads to improved accuracy and performance of machine learning models.
  • Cost and Time Reduction: Collecting and labeling real data can be time-consuming and costly. Synthetic data generation offers an economical and rapid solution for creating large datasets.
  • Privacy and Security: Real data may contain sensitive and personal information. Synthetic data can help preserve privacy and data security.
  • Improving Model Generalization: Diverse synthetic data help the model adapt better to variations and changes in data, improving its generalization.

However, there are considerations when generating synthetic data:

  • Data Quality: One of the main challenges is ensuring the quality and accuracy of the generated data. Synthetic data must closely resemble real data for the model to learn effectively.
  • Avoiding Overfitting: Synthetic data might lead to overfitting, especially if the generated data lacks sufficient diversity. Care must be taken to ensure that synthetic data help the model learn new patterns effectively.
  • Complexity of Generative Models: Generative models like GANs and VAEs are often complex and require significant computation and fine-tuning, which can increase the time and resources needed for synthetic data generation.

Initially, the similarity conditions of the two stocks should be considered based on price candles. According to studies (Cristian Velasquez) and (Mengxia et al.) and other sources, the following conditions can be considered for co-movement:

  • The stock trends, including rises, falls, and direction changes, should be as similar as possible.
  • The price changes within a specific time range for both stocks should not exceed an error margin.
  • The main trend should be evaluated on the closing price data.
  • Maximum, and minimum prices and volume data should be random and consistent with the initial data.
  • Stocks can have multiple fluctuations in different periods.
  • Stocks can have Gaussian noise.

Based on these conditions, two functions are used: one for generating synthetic data similar to the main data and trend, and another for generating synthetic data with a trend different from the main stock. Additionally, some functions and conditions are created to meet the requirements for these data. After this, machine learning can be applied using the synthetic data before examining the similarity of the stocks.

To generate a similar trend we define a simple class as below and we ensure that our generated time series has the specification introduced for our learning purpose.

import random
from multiprocessing import Pool

import pandas as pd
import numpy as np


class SimilarCandleGeneratorNew:
def __init__(self, base_df):
self.__base_df: pd.DataFrame = base_df
self.__calculate_indicators(self.__base_df)
self.__base_df['Date'] = self.__base_df.index.date
self.__counter = 0

@staticmethod
def generate_multiprocessing_candlestick(args):
self, price_range, volatility_range, fluctuation_range, volume_range, return_window, main_candle = args
return self.generate_sample(price_range, volatility_range, fluctuation_range, volume_range, return_window, main_candle)

@classmethod
def __calculate_indicators(cls, candle_df):
candle_df['EMA_12'] = candle_df['Close'].ewm(span=12, adjust=False).mean().bfill()
candle_df['EMA_26'] = candle_df['Close'].ewm(span=26, adjust=False).mean().bfill()
candle_df['SMA_12'] = candle_df['Close'].rolling(window=12).mean().bfill()
candle_df['SMA_26'] = candle_df['Close'].rolling(window=26).mean().bfill()
candle_df['Returns'] = candle_df['Close'].ffill().pct_change()
candle_df['Cumulative_Returns'] = (1 + candle_df['Returns']).cumprod() - 1

def __generate_hourly_candles(self, open_price, high_constraint, low_constraint, volume_range, date, main_candle):
hourly_candles = []
previous_close = open_price

for hour in range(24):
volume = random.randint(*volume_range)

# Ensure similar trend and return
main_candle_return = (main_candle['Close'] - main_candle['Open']) / main_candle['Open']
price_change = main_candle_return * previous_close + np.random.uniform(-0.001, 0.001) * previous_close

fluctuation = np.random.uniform(0.001, 0.01) * previous_close

open_price = previous_close
low_price = max(low_constraint, open_price - fluctuation)
high_price = min(high_constraint, open_price + fluctuation)
close_price = open_price + price_change

# Ensure close price is within bounds
close_price = max(low_price, min(close_price, high_price))

generated_candle = {
'OpeningTime': f'{date} {hour:02d}:00:00',
'Open': open_price,
'High': high_price,
'Low': low_price,
'Close': close_price,
'Volume': volume,
}

hourly_candles.append(generated_candle)
previous_close = close_price

return hourly_candles

def __generate_daily_candles(self, previous_close, volume_range, main_candle):
all_candlesticks = []

for date, group in self.__base_df.groupby('Date'):
daily_open = previous_close
daily_high_constraint = daily_open * 1.05
daily_low_constraint = daily_open * 0.95

daily_candles = self.__generate_hourly_candles(daily_open, daily_high_constraint, daily_low_constraint, volume_range, date, main_candle)
all_candlesticks.extend(daily_candles)

previous_close = np.mean([candle['Close'] for candle in daily_candles])

return all_candlesticks

def __generate_sample(self, price_range, volatility_range=(0.001, 0.99), fluctuation_range=(0.001, 0.99),
volume_range=(1000, 10000), return_window=0.05, main_candle=None):

initial_close = self.__base_df.iloc[0]['Close']
all_candlesticks = self.__generate_daily_candles(initial_close, volume_range, main_candle)

new_df = pd.DataFrame(all_candlesticks)
new_df['OpeningTime'] = pd.to_datetime(new_df['OpeningTime'])
new_df.set_index('OpeningTime', inplace=True)
return new_df

def __ensure_similarity(self, sample_df, main_candle):
base_returns = self.__base_df['Returns'].dropna()
sample_returns = sample_df['Returns'].dropna()

# Align the indices of the base and sample returns
aligned_base_returns, aligned_sample_returns = base_returns.align(sample_returns, join='inner')

# Check if the sample returns are similar to the main candle returns
main_candle_return = (main_candle['Close'] - main_candle['Open']) / main_candle['Open']
sample_returns_mean = sample_returns.mean()

return np.isclose(sample_returns_mean, main_candle_return, atol=0.05)

def generate_sample(self, price_range, volatility_range=(0.1, 0.5), fluctuation_range=(0.1, 0.5),
volume_range=(1000, 10000), return_window=0.05, main_candle=None):
print(f'Generating similar sample number {self.__counter}')
try:
sample_df = self.__generate_sample(price_range,
volatility_range=volatility_range,
fluctuation_range=fluctuation_range,
volume_range=volume_range,
return_window=return_window,
main_candle=main_candle)
if self.__ensure_similarity(sample_df, main_candle):
return sample_df
else:
return None
except Exception as e:
print(e)

def generate_samples(self, price_range,
sampling_count: int = 100,
volatility_range=(0.001, 0.5),
fluctuation_range=(0.001, 0.5),
volume_range=(1000, 10000), return_window=0.05, main_candle=None):
self.__counter = 0
print('Similar sample generator start to generate samples')
with Pool() as pool:
tasks = [(self, price_range, volatility_range, fluctuation_range, volume_range, return_window, main_candle)
for _ in range(sampling_count)]
generated_candles = pool.map(SimilarCandleGeneratorNew.generate_multiprocessing_candlestick, tasks)

return [candle for candle in generated_candles if candle is not None]

below are shown some of the similar datasets generated examples:

Generated similar stocks

On the other hand, we define another class to generate dissimilar time series which has different properties to similar ones.

import random
from multiprocessing import Pool

import pandas as pd
import numpy as np


class DissimilarCandleGenerator:
def __init__(self):
self.__base_df: pd.DataFrame | None = None

@staticmethod
def generate_multiprocessing_candlestick(args):
self, price_range, volatility_range, fluctuation_range, volume_range, return_window = args
return self.generate_sample(price_range, volatility_range, fluctuation_range, volume_range, return_window)

@classmethod
def __calculate_indicators(cls, candle_df):
candle_df['EMA_12'] = candle_df['Close'].ewm(span=12, adjust=False).mean().bfill()
candle_df['EMA_26'] = candle_df['Close'].ewm(span=26, adjust=False).mean().bfill()
candle_df['SMA_12'] = candle_df['Close'].rolling(window=12).mean().bfill()
candle_df['SMA_26'] = candle_df['Close'].rolling(window=26).mean().bfill()
candle_df['Returns'] = candle_df['Close'].ffill().pct_change()
candle_df['Cumulative_Returns'] = (1 + candle_df['Returns']).cumprod() - 1

def __generate_hourly_candles(self, open_price, high_constraint, low_constraint, volume_range, date):
hourly_candles = []
previous_close = open_price

for hour in range(24):
volume = random.randint(*volume_range)

# Calculate random fluctuation and price change
fluctuation = np.random.uniform(0.001, 0.01) * previous_close
price_change = np.random.uniform(-0.05, 0.05) * previous_close

open_price = previous_close
low_price = max(low_constraint, open_price - fluctuation)
high_price = min(high_constraint, open_price + fluctuation)
close_price = open_price + price_change

# Ensure close price is within bounds
close_price = max(low_price, min(close_price, high_price))

generated_candle = {
'OpeningTime': f'{date} {hour:02d}:00:00',
'Open': open_price,
'High': high_price,
'Low': low_price,
'Close': close_price,
'Volume': volume,
}

hourly_candles.append(generated_candle)
previous_close = close_price

return hourly_candles

def __generate_daily_candles(self, previous_close, volume_range):
all_candlesticks = []

for date, group in self.__base_df.groupby('Date'):
daily_open = previous_close
daily_high_constraint = daily_open * 1.05
daily_low_constraint = daily_open * 0.95

daily_candles = self.__generate_hourly_candles(daily_open, daily_high_constraint, daily_low_constraint, volume_range, date)
all_candlesticks.extend(daily_candles)

previous_close = np.mean([candle['Close'] for candle in daily_candles])

return all_candlesticks

def __generate_sample(self, price_range, volatility_range=(0.001, 0.99), fluctuation_range=(0.001, 0.99),
volume_range=(1000, 10000), return_window=0.05):

initial_close = self.__base_df.iloc[0]['Close']
all_candlesticks = self.__generate_daily_candles(initial_close, volume_range)

new_df = pd.DataFrame(all_candlesticks)
new_df['OpeningTime'] = pd.to_datetime(new_df['OpeningTime'])
new_df.set_index('OpeningTime', inplace=True)
return new_df

def __ensure_non_similarity(self, sample_df):
base_returns = self.__base_df['Returns'].dropna()
sample_returns = sample_df['Returns'].dropna()

# Align the indices of the base and sample returns
aligned_base_returns, aligned_sample_returns = base_returns.align(sample_returns, join='inner')

return not np.allclose(aligned_base_returns, aligned_sample_returns, atol=0.05)

def generate_sample(self, price_range, volatility_range=(0.1, 0.5), fluctuation_range=(0.1, 0.5),
volume_range=(1000, 10000), return_window=0.05):

try:
return self.__generate_sample(price_range,
volatility_range=volatility_range,
fluctuation_range=fluctuation_range,
volume_range=volume_range,
return_window=return_window)

except Exception as e:
pass

def generate_samples(self, goal_candle: pd.DataFrame,
price_range,
sampling_count: int = 100,
volatility_range=(0.001, 0.5),
fluctuation_range=(0.001, 0.5),
volume_range=(1000, 10000), return_window=0.05):
self.__base_df = goal_candle.copy()
self.__calculate_indicators(self.__base_df)
self.__base_df['Date'] = self.__base_df.index.date
print('Dissimilar sample generator start to generate samples')
with Pool() as pool:
tasks = [(self, price_range, volatility_range, fluctuation_range, volume_range, return_window)
for _ in range(sampling_count)]
generated_candles = pool.map(DissimilarCandleGenerator.generate_multiprocessing_candlestick, tasks)

return [candle for candle in generated_candles if candle is not None]

and dissimilar to the goal stock are shown too:

Generated dissimilar stocks

Model Selection and Training in Machine Learning

Classification Random Forest Model

The Random Forest Classification model is one of the most powerful and widely used ensemble learning algorithms for classification and regression problems. This algorithm operates by creating a collection of decision trees and combining their results to provide a final prediction.

The structure and operation of Random Forest can be outlined in several main steps:

  • Bootstrap Sampling: Training data is randomly divided into multiple subsets with replacement. Each subset may contain duplicate data points. This method is known as Bootstrap Sampling.
  • Decision Tree Creation: For each subset, a decision tree is created independently. At each node of the tree, a random subset of features is selected, and the best split is made based on criteria such as the Gini Index or Entropy.
  • Result Aggregation: For classification problems, the final result is obtained by majority voting among the decision trees. For regression problems, the mean of the tree results is used as the final prediction.

Advantages and reasons for using the Random Forest Classification model in various problems include:

  • High Accuracy: By aggregating results from multiple decision trees, the model typically achieves high accuracy in predictions, making it a suitable choice for complex problems.
  • Resistance to Overfitting: Since each decision tree operates independently and is trained on a subset of the data, Random Forest is less prone to overfitting.
  • Flexibility: This learning model can be used for both classification and regression tasks, making it a versatile and useful tool.
  • Feature Importance Computation: The model can determine the importance of each feature for prediction, which is valuable for subsequent analyses and feature selection.
  • Resistance to Imbalanced Data: The algorithm performs well with imbalanced datasets and can provide good performance even when some classes have significantly fewer samples than others.

In the context of identifying trending stocks and assets, Random Forest is particularly useful and efficient for the following reasons:

  • Combining Different Data Types: This machine learning model can effectively combine and analyze various data types, including price candlesticks and PCA-reduced technical data, which helps in identifying complex patterns and similar trends among stocks.
  • Stability and Accuracy: Given the volatility and complexities of financial markets, an algorithm with high stability and accuracy is crucial. Random Forest provides stable and accurate predictions by reducing noise and combining results from multiple trees.
  • Ability to Handle Multiple Features: In technical analysis, there are numerous indicators and features. Random Forest handles large feature sets well and can identify important features.
  • Resistance to Imbalanced Data: Financial market data can often be imbalanced. Random Forest adapts well to such data and can deliver good performance even in imbalanced conditions.

Model Optimization in Machine Learning

In machine learning, models need to be trained to strike a proper balance between bias and variance. These two phenomena can lead to poor model performance. Overfitting and high bias are common issues that must be managed carefully.

  1. Overfitting

Overfitting occurs when a model becomes excessively faithful to the training data, learning its details and noise, which impairs its performance on new, unseen data. Key considerations for preventing overfitting include:

  • Reduced Generalizability: An overfitted model cannot effectively learn general patterns in the data and performs poorly on new data.
  • Increased Complexity: Overfitting increases model complexity, which can lead to longer training and prediction times.

2. High Bias

High bias occurs when a model, due to excessive simplification, fails to learn complex relationships between features and outputs, resulting in high training and testing error. Key considerations include:

  • Reduced Accuracy: A high-bias model cannot identify complex patterns, leading to lower accuracy.
  • Inability to Learn: High bias can prevent the model from learning from the training data.

3. High Variance

High variance occurs when a model is sensitive to small fluctuations in the training data, leading to unstable results on new data. Key considerations include:

  • Prediction Fluctuations: A model with high variance will produce unstable results on new data, making reliable predictions difficult.
  • Reduced Reliability: High variance can diminish model reliability in real-world scenarios.

Strategies to Prevent Model Training Issues

Various strategies exist to address these training problems, with one of the most important being the use of Cross-Validation. Tools like GridCV in the Scikit-Learn library are useful for this purpose.

This tool optimizes and tunes model parameters to find the best settings for a model. It effectively helps in preventing overfitting, high bias, and high variance.

Important parameters in decision tree models include the number of trees, maximum tree depth, minimum samples required to split a node, and minimum samples required at a leaf. GridCV helps in selecting optimal parameters by generating all possible combinations and evaluating their results. For instance, in the Random Forest model used in this research, the following parameters were evaluated:

  • Number of trees: [50, 100, 200]
  • Maximum tree depth: [None, 10, 20, 30]
  • Minimum samples required to split a node: [2, 5, 10]
  • Minimum samples at a leaf: [1, 2, 4]

This tool assesses each parameter combination using Cross-Validation to identify the optimal settings for better model performance.

Train and validate machine learning Model

At this step, we define a class to train, validate, and save our model in the defined path. To do this job, we use scikit-learn library to train and validate our model.

At first, we should label our sampling data, to do this we create a pandas data frame to store our extracted distance between the time series of our datasets and generate similar and dissimilar time series.
After that, we should partition our datasets as cross-validation, test, and train datasets, then optimize our model by our test data and then train our model.

We create a class to handle all of this procedure.

import logging
import time

import pandas as pd

from src.Domain.Contract.Sampling import SamplingConf
from src.Domain.SimilarityDetector.Classification import Classifier
from src.Domain.SimilarityDetector.Dataset import DatasetMaker


class MLBasedSimilarityDetector:
def __init__(self, model_name, file_path, model_path, sampling_count, sampling_conf: SamplingConf = SamplingConf()):
self.__model_path = model_path
self.__datasets_maker = DatasetMaker(file_path+model_name, sampling_count, sampling_conf)
self.__classifier = Classifier(model_name)
self.__file_path = file_path
self.__logger = logging.getLogger(__name__)

def update_conf(self, sampling_count: int = None, sampling_conf: SamplingConf = None):
self.__datasets_maker.update_conf(sampling_count, sampling_conf)

def start(self, goal_candles: dict[str, pd.DataFrame]):
if not self.__classifier.load_model(self.__model_path):
self.train(goal_candles)

def train(self, goal_candles: dict[str, pd.DataFrame]):
print('Similarity Detector start to Running')
t_1 = time.time()
self.__prepare_datasets(goal_candles)
t_2 = time.time()
print(f'Data Set Generated in {t_2 - t_1} seconds')
self.__train()
t_3 = time.time()
print(f'Model Trained in {t_3 - t_2} seconds')

def __prepare_datasets(self, goal_candles):
self.__datasets_maker.generate_dataset(goal_candles)

def __train(self):
self.__classifier.train(self.__datasets_maker.data_set)
self.__classifier.validate()
self.__classifier.save_model(self.__model_path)

def is_similar(self, pct_distance: float, ta_distance: float, cumulative_return: float):
feature = pd.DataFrame({'series_distance': [pct_distance],
'ta_distance': [ta_distance],
'return_distance': [cumulative_return]}, dtype=float)
return self.__classifier.is_similar(feature)

and define another module to handle the training and validating classification model:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, classification_report,
confusion_matrix)
import joblib

from src.Common.File.FileChecking import file_exists


class Classifier:
def __init__(self, model_name: str):
self.__model_name = model_name
self.__estimator = None
self.__data_sets = {
'X_train': None,
'Y_train': None,
'X_test': None,
'Y_test': None
}

def train(self, data_sets: pd.DataFrame):
x_series = data_sets[['series_distance', 'ta_distance', 'return_distance']]
y_series = data_sets['result']

x_train_series, x_test_series, y_train_series, y_test_series = train_test_split(x_series, y_series,
test_size=0.3,
random_state=42)
self.__data_sets.update({
'X_train': x_train_series,
'Y_train': y_train_series,
'X_test': x_test_series,
'Y_test': y_test_series
})
self.__estimator = RandomForestClassifier(n_estimators=100, random_state=42)
# self.__estimator = self.__cross_validation(x_train_series, y_train_series)
self.__estimator.fit(x_train_series, y_train_series)

def validate(self):
x_test_series, y_test_series = self.__data_sets.get('X_test'), self.__data_sets.get('Y_test')
y_pred = self.__estimator.predict(x_test_series)

# Evaluate the model
accuracy = accuracy_score(y_test_series, y_pred)
precision = precision_score(y_test_series, y_pred)
recall = recall_score(y_test_series, y_pred)
f1 = f1_score(y_test_series, y_pred)
conf_matrix = confusion_matrix(y_test_series, y_pred)
class_report = classification_report(y_test_series, y_pred)

# Print the results
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

def is_similar(self, feature: pd.DataFrame):
y_pred = self.__estimator.predict(feature)
return bool(y_pred)

def save_model(self, path):
joblib_file = path + f"{self.__model_name}.pkl"
joblib.dump(self.__estimator, joblib_file)
print(f"Model saved to {joblib_file}")
return self.__estimator

def load_model(self, file_path):
joblib_file = f"{self.__model_name}.pkl"
if file_exists(file_path, joblib_file):
self.__estimator = joblib.load(file_path+joblib_file)
return True

else:
return False

@classmethod
def __cross_validation(cls, x_train_series, y_train_series):
# Define the parameter grid for GridSearchCV
param_grid = {
'n_estimators': [100, 200, 500, 1000],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Initialize the Random Forest Classifier
classifier = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid,
cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(x_train_series, y_train_series)

# Print the best parameters and the best score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross-Validation Accuracy: {grid_search.best_score_}')

return grid_search.best_estimator_

@property
def model(self):
return self.__estimator

after validation of our model, the output of the model is shown below:

Metric Value Details
Accuracy 0.933
Precision 0.928 [[0 0.93],[1 0.94]]
Recall 0.89 [[0 0.94],[1 0.93]]
F1 Score 0.91 [[0 0.94],[1 0.93]]
Confusion Matrix [[237 30],[20 265]]
Support 552 [[0 257],[1 295]]

Identifying Trend-Following Stocks

Once the model is ready, the next step is to identify trend-following stocks. To achieve this, you need to repeat the steps outlined for sample data using the stored model and available data for each symbol. The process is as follows:

  • Data Preprocessing: First, preprocess the data for the target stocks and the test stocks.
  • Feature Calculation: Compute all features, including the distance between the two stocks, the distance of their reduced technical indicators, and their cumulative returns.
  • Similarity Assessment: Input this information into the stored model to evaluate the similarity or dissimilarity between the two stocks.

To ensure the accuracy of the results, it is essential to consider control conditions. Conduct a preliminary test using the available cumulative returns. If the cumulative return obtained from this test falls within the expected range considering the predefined error margin, the stock is selected as similar.

Storing the Obtained Data

After identifying the trend-following stocks, and for use in our portfolio management services, categorize all similar stocks based on their cumulative returns ranking. Store this information in a database for the service designed to provide these stocks.

Technologies Used in Implementing Trend-Following Stock Identification

The implementation of this method was carried out using Python programming language and its developed libraries. The libraries used for data manipulation were Pandas and Numpy. Additionally, to handle the computational load of calculating distances and to manage the speed and volume of stock price data in the financial market, the FastDTW library was employed for more efficient time series distance calculation. This library offers various distance calculation methods, and the Euclidean distance method was utilized for point-to-point distance calculations, implemented through the Scipy library within FastDTW.

For data preprocessing and handling missing values, the impute module from the Scikit-learn library was used. To perform Principal Component Analysis (PCA) for dimensionality reduction, the Decomposition module from Scikit-learn was applied.

Furthermore, for modeling, model training, and Cross-Validation processes, the chosen model was Random Forest Classification and GridCVfrom the Scikit-learn library was used for parameter optimization.

You can access the full code of this project on GitHub

--

--

Matin Karbasioun
Matin Karbasioun

Written by Matin Karbasioun

A machine learning engineer passionate about using AI to solve real-world problems. My work focuses on developing intelligent agents and digital twins.