data_generator module
Time Series Data Generation Module.
This module provides functionality for generating synthetic price series data with controlled statistical properties. It’s designed as the first step in a typical time series analysis pipeline, creating test data with known characteristics.
Key Components: - PriceSeriesGenerator: Class for generating correlated price series - generate_price_series: Convenience function with simplified interface - set_random_seed: Function to ensure reproducible results
Typical Usage Flow: 1. Create a PriceSeriesGenerator instance with desired date range 2. Generate price series with specific initial values and correlations 3. Proceed with the generated data to data_processor.py for preparation
The generated price series follow a random walk with drift, with options to control cross-series correlations.
- class timeseries_compute.data_generator.PriceSeriesGenerator(start_date: str, end_date: str)[source]
Bases:
objectClass generates a series of prices for given tickers over a specified date range.
- start_date
The start date of the price series in YYYY-MM-DD format.
- Type:
str
- end_date
The end date of the price series in YYYY-MM-DD format.
- Type:
str
- dates
A range of dates from start_date to end_date, including only weekdays.
- Type:
pd.DatetimeIndex
- __init__(start_date
str, end_date: str): Initializes the PriceSeriesGenerator with the given date range.
- generate_correlated_prices(anchor_prices
dict, correlation_matrix: Optional[Dict[Tuple[str, str], float]] = None) -> Dict[str, list]: Generates a series of correlated prices for the given tickers with initial prices.
Create price series for given tickers with initial prices and correlations.
- Parameters:
anchor_prices (Dict[str, float]) – Dictionary where keys are ticker symbols (e.g., ‘AAPL’, ‘MSFT’) and values are their respective initial prices.
correlation_matrix (Dict[Tuple[str, str], float], optional) – Dictionary specifying correlations between ticker pairs. Each key should be a tuple of two ticker symbols (e.g., (‘AAPL’, ‘MSFT’)), and each value should be the desired correlation coefficient between -1.0 and 1.0. For example: {(‘AAPL’, ‘MSFT’): 0.7, (‘AAPL’, ‘GOOG’): 0.5, (‘MSFT’, ‘GOOG’): 0.6} If None, a default correlation of 0.6 will be used for all pairs.
- Returns:
- Dictionary where keys are ticker symbols and values are lists
containing the generated price series for each ticker.
- Return type:
Dict[str, list]
Example
>>> generator = PriceSeriesGenerator(start_date="2023-01-01", end_date="2023-01-31") >>> anchor_prices = {"AAA": 150.0, "BBB": 250.0} >>> correlations = {("AAA", "BBB"): 0.7} >>> prices = generator.generate_correlated_prices(anchor_prices, correlations)
- timeseries_compute.data_generator.generate_price_series(start_date: str = '2023-01-01', end_date: str = '2023-12-31', anchor_prices: Dict[str, float] | None = None, random_seed: int | None = None, correlations: Dict[Tuple[str, str], float] | None = None) Tuple[Dict[str, list], DataFrame][source]
Generates a series of price data based on the provided parameters.
I return both a dict and a df. Supporting both means i can stop second guessing which to return.
- Parameters:
start_date (str, optional) – The start date for the price series. Defaults to “2023-01-01”.
end_date (str, optional) – The end date for the price series. Defaults to “2023-12-31”.
anchor_prices (Dict[str, float], optional) – A dictionary of tickers and their initial prices. Defaults to {“GME”: 100.0, “BYND”: 200.0} if None.
random_seed (int, optional) – Seed for random number generation. If provided, overrides the module-level seed.
correlations (Dict[Tuple[str, str], float], optional) – Dictionary specifying correlations between ticker pairs.
- Returns:
A dictionary of generated prices and a DataFrame.
- Return type:
Tuple[Dict[str, list], pd.DataFrame]