8000 Add Hunting Anomalies in the Stock Market scripts (#780) · polygon-io/client-python@8034ba4 · GitHub
[go: up one dir, main page]

Skip to content

Commit 8034ba4

Browse files
Add Hunting Anomalies in the Stock Market scripts (#780)
* Add Hunting Anomalies in the Stock Market scripts * Fix lint * Ignore type and fix typo * Fix type def from linter * Removed json dump
1 parent 53808f8 commit 8034ba4

File tree

5 files changed

+506
-0
lines changed

5 files changed

+506
-0
lines changed
+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Hunting Anomalies in the Stock Market
2+
3+
This repository contains all the necessary scripts and data directories used in the [Hunting Anomalies in the Stock Market](https://polygon.io/blog/hunting-anomalies-in-stock-market/) tutorial, hosted on Polygon.io's blog. The tutorial demonstrates how to detect statistical anomalies in historical US stock market data through a comprehensive workflow that involves downloading data, building a lookup table, querying for anomalies, and visualizing them through a web interface.
4+
5+
### Prerequisites
6+
7+
- Python 3.8+
8+
- Access to Polygon.io's historical data via Flat Files
9+
- An active Polygon.io API key, obtainable by signing up for a Stocks paid plan
10+
11+
### Repository Contents
12+
13+
- `README.md`: This file, outlining setup and execution instructions.
14+
- `aggregates_day`: Directory where downloaded CSV data files are stored.
15+
- `build-lookup-table.py`: Python script to build a lookup table from the historical data.
16+
- `query-lookup-table.py`: Python script to query the lookup table for anomalies.
17+
- `gui-lookup-table.py`: Python script for a browser-based interface to explore anomalies visually.
18+
19+
### Running the Tutorial
20+
21+
1. **Ensure Python 3.8+ is installed:** Check your Python version and ensure all required libraries (polygon-api-client, pandas, pickle, and argparse) are installed.
22+
23+
2. **Set up your API key:** Make sure you have an active paid Polygon.io Stock subscription for accessing Flat Files. Set up your API key in your environment or directly in the scripts where required.
24+
25+
3. **Download Historical Data:** Use the MinIO client to download historical stock market data. Adjust the commands and paths based on the data you are interested in.
26+
```bash
27+
mc alias set s3polygon https://files.polygon.io YOUR_ACCESS_KEY YOUR_SECRET_KEY
28+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/08/ ./aggregates_day/
29+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/09/ ./aggregates_day/
30+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/10/ ./aggregates_day/
31+
gunzip ./aggregates_day/*.gz
32+
```
33+
34+
4. **Build the Lookup Table:** This script processes the downloaded data and builds a lookup table, saving it as `lookup_table.pkl`.
35+
```bash
36+
python build-lookup-table.py
37+
```
38+
39+
5. **Query Anomalies:** Replace `2024-10-18` with the date you want to analyze for anomalies.
40+
```bash
41+
python query-lookup-table.py 2024-10-18
42+
```
43+
44+
6. **Run the GUI:** Access the web interface at `http://localhost:8888` to explore the anomalies visually.
45+
```bash
46+
python gui-lookup-table.py
47+
```
48+
49+
For a complete step-by-step guide on each phase of the anomaly detection process, including additional configurations and troubleshooting, refer to the detailed [tutorial on our blog](https://polygon.io/blog/hunting-anomalies-in-stock-market/).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Download flat files into here.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
import os
2+
import pandas as pd # type: ignore
3+
from collections import defaultdict
4+
import pickle
5+
import json
6+
from typing import DefaultDict, Dict, Any, BinaryIO
7+
8+
# Directory containing the daily CSV files
9+
data_dir = "./aggregates_day/"
10+
11+
# Initialize a dictionary to hold trades data
12+
trades_data = defaultdict(list)
13+
14+
# List all CSV files in the directory
15+
files = sorted([f for f in os.listdir(data_dir) if f.endswith(".csv")])
16+
17+
print("Starting to process files...")
18+
19+
# Process each file (assuming files are named in order)
20+
for file in files:
21+
print(f"Processing {file}")
22+
file_path = os.path.join(data_dir, file)
23+
df = pd.read_csv(file_path)
24+
# For each stock, store the date and relevant data
25+
for _, row in df.iterrows():
26+
ticker = row["ticker"]
27+
date = pd.to_datetime(row["window_start"], unit="ns").date()
28+
trades = row["transactions"]
29+
close_price = row["close"] # Ensure 'close' column exists in your CSV
30+
trades_data[ticker].append(
31+
{"date": date, "trades": trades, "close_price": close_price}
32+
)
33+
34+
print("Finished processing files.")
35+
print("Building lookup table...")
36+
37+
# Now, build the lookup table with rolling averages and percentage price change
38+
lookup_table: DefaultDict[str, Dict[str, Any]] = defaultdict(
39+
dict
40+
) # Nested dict: ticker -> date -> stats
41+
42+
for ticker, records in trades_data.items():
43+
# Convert records to DataFrame
44+
df_ticker = pd.DataFrame(records)
45+
# Sort records by date
46+
df_ticker.sort_values("date", inplace=True)
47+
df_ticker.set_index("date", inplace=True)
48+
49+
# Calculate the percentage change in close_price
50+
df_ticker["price_diff"] = (
51+
df_ticker["close_price"].pct_change() * 100
52+
) # Multiply by 100 for percentage
53+
54+
# Shift trades to exclude the current day from rolling calculations
55+
df_ticker["trades_shifted"] = df_ticker["trades"].shift(1)
56+
# Calculate rolling average and standard deviation over the previous 5 days
57+
df_ticker["avg_trades"] = df_ticker["trades_shifted"].rolling(window=5).mean()
58+
df_ticker["std_trades"] = df_ticker["trades_shifted"].rolling(window=5).std()
59+
# Store the data in the lookup table
60+
for date, row in df_ticker.iterrows():
61+
# Convert date to string for JSON serialization
62+
date_str = date.strftime("%Y-%m-%d")
63+
# Ensure rolling stats are available
64+
if pd.notnull(row["avg_trades"]) and pd.notnull(row["std_trades"]):
65+
lookup_table[ticker][date_str] = {
66+
"trades": row["trades"],
67+
"close_price": row["close_price"],
68+
"price_diff": row["price_diff"],
69+
"avg_trades": row["avg_trades"],
70+
"std_trades": row["std_trades"],
71+
}
72+
else:
73+
# Store data without rolling stats if not enough data points
74+
lookup_table[ticker][date_str] = {
75+
"trades": row["trades"],
76+
"close_price": row["close_price"],
77+
"price_diff": row["price_diff"],
78+
"avg_trades": None,
79+
"std_trades": None,
80+
}
81+
82+
print("Lookup table built successfully.")
83+
84+
# Convert defaultdict to regular dict for JSON serialization
85+
lookup_table_dict = {k: v for k, v in lookup_table.items()}
86+
87+
# Save the lookup table to a file for later use
88+
with open("lookup_table.pkl", "wb") as f: # type: BinaryIO
89+
pickle.dump(lookup_table_dict, f)
90+
91+
print("Lookup table saved to 'lookup_table.pkl'.")

0 commit comments

Comments
 (0)
0