data_processor module
Time Series Data Processing and Transformation Module.
This module handles the preparation and transformation of time series data for statistical modeling. It provides tools for handling missing values, scaling data, testing for stationarity, and transforming data to achieve stationarity.
Key Components: - MissingDataHandler: Strategies for handling missing data - DataScaler: Methods to standardize or normalize data - StationaryReturnsProcessor: Transforms data to achieve stationarity - Factory classes: Create appropriate handlers based on strategies
Key Functions: - fill_data: Handle missing values with various strategies - scale_data: Normalize or standardize data - stationarize_data: Transform data to achieve stationarity - test_stationarity: Test if data is stationary using statistical tests - prepare_timeseries_data: Comprehensive data preparation - calculate_ewma_covariance/volatility: Calculate EWMA metrics
Typical Usage Flow: 1. Prepare data (handle missing values, convert dates) 2. Test for stationarity 3. Transform to achieve stationarity if needed 4. Scale data for modeling 5. Proceed to stats_model.py for modeling
This module is designed to work with data generated by data_generator.py or real-world financial/economic time series data.
- class timeseries_compute.data_processor.DataScaler[source]
Bases:
objectProvides methods to scale numeric data in a pandas DataFrame.
- scale_data_standardize(data
pd.DataFrame) -> pd.DataFrame: Standardizes all numeric columns except the index by subtracting the mean and dividing by the standard deviation.
- scale_data_minmax(data
pd.DataFrame) -> pd.DataFrame: Scales all numeric columns using MinMaxScaler by dividing each value by the range (max - min) of the column.
- scale_data_minmax(data: DataFrame) DataFrame[source]
Scales the numeric columns of the given DataFrame using Min-Max scaling.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing the data to be scaled.
- Returns:
The DataFrame with scaled numeric columns.
- Return type:
pd.DataFrame
Notes
This function scales each numeric column to a range between 0 and 1.
Non-numeric columns are not affected by this scaling.
The function logs the scaling process and the first 5 rows of the scaled DataFrame.
- scale_data_standardize(data: DataFrame) DataFrame[source]
Standardizes all numeric columns in the given DataFrame except the index.
- Parameters:
data (pd.DataFrame) – The input DataFrame containing numeric columns to be standardized.
- Returns:
The DataFrame with standardized numeric columns.
- Return type:
pd.DataFrame
Notes
The standardization is performed by subtracting the mean and dividing by the standard deviation for each numeric column.
The index of the DataFrame is not modified.
Logs the process of scaling and displays the first 5 rows of the scaled DataFrame.
- class timeseries_compute.data_processor.DataScalerFactory[source]
Bases:
objectFactory class for creating data scaling handlers based on the specified strategy.
- create_handler(strategy
str) -> Callable[[pd.DataFrame], pd.DataFrame]: Returns the appropriate scaling function based on the provided strategy.
- static create_handler(strategy: str) Callable[[DataFrame], DataFrame][source]
Returns the appropriate scaling function based on the provided strategy.
- Parameters:
strategy (str) – The scaling strategy to use. Supported values are “standardize” and “minmax”.
- Returns:
The scaling function corresponding to the specified strategy.
- Return type:
Callable[[pd.DataFrame], pd.DataFrame]
- Raises:
ValueError – If the provided strategy is not recognized.
- class timeseries_compute.data_processor.MissingDataHandler[source]
Bases:
objectHandles missing data through various strategies such as dropping or forward filling.
- drop_na(data: DataFrame) DataFrame[source]
Drops rows with missing values from the given DataFrame.
- Parameters:
data (pd.DataFrame) – The DataFrame from which to drop rows with missing values.
- Returns:
A DataFrame with rows containing missing values removed.
- Return type:
pd.DataFrame
- forward_fill(data: DataFrame) DataFrame[source]
Fills missing values in the DataFrame using the forward fill method.
- Parameters:
data (pd.DataFrame) – The DataFrame containing missing values to be filled.
- Returns:
The DataFrame with missing values filled using forward fill.
- Return type:
pd.DataFrame
- class timeseries_compute.data_processor.MissingDataHandlerFactory[source]
Bases:
objectFactory for creating missing data handlers based on a specified strategy.
- static create_handler(strategy: str) Callable[[DataFrame], DataFrame][source]
Creates a handler function based on the specified strategy.
- Parameters:
strategy (str) – The strategy to handle missing data. Options are “drop” or “forward_fill”.
- Returns:
A function that handles missing data accordingly.
- Return type:
Callable[[pd.DataFrame], pd.DataFrame]
- Raises:
ValueError – If an unknown strategy is provided.
- class timeseries_compute.data_processor.StationaryReturnsProcessor[source]
Bases:
objectA class to process and test the stationarity of time series data.
- make_stationary(data
pd.DataFrame, method: str) -> pd.DataFrame: Apply the chosen method to make the data stationary.
- test_stationarity(data
pd.DataFrame, test: str) -> Dict[str, Dict[str, float]]: Perform the Augmented Dickey-Fuller test to check for stationarity.
- log_adf_results(data
Dict[str, Dict[str, float]], p_value_threshold: float) -> None: Log the interpreted results of the ADF test.
- log_adf_results(data: Dict[str, Dict[str, float]], p_value_threshold: float = 0.05) None[source]
Logs interpreted Augmented Dickey-Fuller (ADF) test results.
- Parameters:
data (Dict[str, Dict[str, float]]) – A dictionary where keys are series names and values are dictionaries containing ADF test results. Each value dictionary should have the keys “ADF Statistic” and “p-value”.
p_value_threshold (float, optional) – The threshold for the p-value to determine if the series is stationary. Defaults to 0.05.
- Returns:
None
- make_stationary(data: DataFrame, method: str = 'difference') DataFrame[source]
Apply the chosen method to make the data stationary.
- Parameters:
data (pd.DataFrame) – The input data to be made stationary.
method (str, optional) – The method to use for making the data stationary. Currently supported method is “difference”. Defaults to “difference”.
- Returns:
The transformed data with the applied stationarity method.
- Return type:
pd.DataFrame
- Raises:
ValueError – If an unknown method is provided.
- test_stationarity(data: DataFrame, test: str = 'adf') Dict[str, Dict[str, float]][source]
Perform the Augmented Dickey-Fuller (ADF) test for stationarity on the given data.
The null hypothesis (H0) is that the series is non-stationary (has a unit root). The alternative hypothesis (H1) is that the series is stationary.
- Parameters:
data (pd.DataFrame) – The input data containing time series to be tested.
test (str, optional) – The type of stationarity test to perform. Currently, only “adf” is supported. Defaults to “adf”.
- Returns:
- A dictionary where keys are column names and values are dictionaries containing
the ADF Statistic and p-value for each numeric column in the input data.
- Return type:
Dict[str, Dict[str, float]]
- Raises:
ValueError – If an unsupported stationarity test is specified.
- class timeseries_compute.data_processor.StationaryReturnsProcessorFactory[source]
Bases:
objectFactory class for creating handlers for stationary returns processing strategies.
- create_handler(strategy
str) -> Callable: Returns the appropriate processing function based on the provided strategy.
- static create_handler(strategy: str) StationaryReturnsProcessor[source]
Returns the appropriate processing function based on the provided strategy.
- Parameters:
strategy (str) – The name of the strategy for which the processing function is to be created. Supported strategies are: - “transform_to_stationary_returns” - “test_stationarity” - “log_stationarity”
- Returns:
A processor instance for the specified strategy.
- Return type:
- Raises:
ValueError – If an unknown strategy is provided.
- timeseries_compute.data_processor.calculate_ewma_covariance(series1: Series, series2: Series, lambda_val: float = 0.95) Series[source]
Calculate Exponentially Weighted Moving Average covariance between two series.
- Parameters:
series1 – First time series
series2 – Second time series
lambda_val – Decay factor (0.95 or 0.97 from thesis)
- Returns:
Series of EWMA covariances
- timeseries_compute.data_processor.calculate_ewma_volatility(series: Series, lambda_val: float = 0.95) Series[source]
Calculate Exponentially Weighted Moving Average volatility for a series.
- Parameters:
series – Time series
lambda_val – Decay factor
- Returns:
Series of EWMA volatilities
- timeseries_compute.data_processor.fill_data(df: DataFrame, strategy: str = 'forward_fill') DataFrame[source]
Fills missing data in the given DataFrame according to the specified strategy.
- Parameters:
df (pd.DataFrame) – The DataFrame containing the data to be processed.
strategy (str, optional) – Strategy for handling missing values. Options are “drop” or “forward_fill”. Defaults to “forward_fill”.
- Returns:
The DataFrame with missing values handled according to the specified strategy.
- Return type:
pd.DataFrame
- timeseries_compute.data_processor.log_stationarity(adf_results: Dict[str, Dict[str, float]], p_value_threshold: float = 0.05) None[source]
Logs the stationarity of the given DataFrame using the Augmented Dickey-Fuller (ADF) test.
- Parameters:
adf_results (Dict[str, Dict[str, float]]) – Results from test_stationarity function.
p_value_threshold (float, optional) – The p-value threshold for the ADF test. Defaults to 0.05.
- Returns:
None
- timeseries_compute.data_processor.prepare_timeseries_data(df: DataFrame) DataFrame[source]
Prepares time series data for analysis by: 1. Converting date column to datetime and setting as index (if not already) 2. Ensuring numeric columns are properly typed 3. Removing non-numeric columns
- Parameters:
df (pd.DataFrame) – Input DataFrame with time series data
- Returns:
Properly formatted DataFrame for time series analysis
- Return type:
pd.DataFrame
- timeseries_compute.data_processor.price_to_returns(prices: DataFrame) DataFrame[source]
Convert prices to log returns, similar to MATLAB’s price2ret function.
- Parameters:
prices – DataFrame of price series
- Returns:
DataFrame of log returns with Date as index
- timeseries_compute.data_processor.scale_data(df: DataFrame, method: str = 'standardize') DataFrame[source]
Scales the input DataFrame according to the specified method.
- Parameters:
df (pd.DataFrame) – The input data to be scaled.
method (str, optional) – Scaling method to use. Options are “standardize” or “minmax”. Defaults to “standardize”.
- Returns:
The scaled DataFrame.
- Return type:
pd.DataFrame
- timeseries_compute.data_processor.scale_for_garch(df: DataFrame, target_scale: float = 10.0) DataFrame[source]
Scale data to appropriate range for GARCH modeling.
Adaptively scales data to bring it into the optimal range (1-1000) for GARCH parameter estimation.
This is crucial because GARCH models can be sensitive to the scale of input data.
- Parameters:
df (pd.DataFrame) – Input data
target_scale (float) – Target scale to achieve
- Returns:
Scaled data
- Return type:
pd.DataFrame
- timeseries_compute.data_processor.stationarize_data(df: DataFrame, method: str = 'difference') DataFrame[source]
Processes the given DataFrame to make the data stationary.
- Parameters:
df (pd.DataFrame) – The input data to be made stationary.
method (str, optional) – Method to use for making data stationary. Currently only “difference” is supported. Defaults to “difference”.
- Returns:
The stationary version of the input data.
- Return type:
pd.DataFrame
- timeseries_compute.data_processor.test_stationarity(df: DataFrame, method: str = 'adf') Dict[str, Dict[str, float]][source]
Tests the stationarity of a given DataFrame.
- Parameters:
df (pd.DataFrame) – The DataFrame containing the data to be tested for stationarity.
method (str, optional) – Method to use for testing stationarity. Currently only “adf” is supported. Defaults to “adf”.
- Returns:
Results of the stationarity test.
- Return type:
Dict[str, Dict[str, float]]