Pierian Data – Python for Finance & Algorithmic Trading
Course Notes
Anaconda
Jupyter Notebook system
Ipynb files
Nbconvert library for file conversion
No space in variables
Tuple (, , ,) is not mutable whereas [, , ,] assignment is changeable
Set function returns unique assignments
!= inequality check
Def my_func(): #to define a function
Append allows you to add at the end of an array
s.lower() turns string into lower case
Always ensure (inplace=True)
Python allows you to create anonymous function i.e function having no names using a
facility called lambda function.
lambda functions are small functions usually not more than a line. It can have any
number of arguments just like a normal function. The body of lambda functions is very
small and consists of only one expression. The result of the expression is the value when
the lambda is applied to an argument. Also there is no need for any return statement in
lambda function.
Numpy
linspace (evenly spaced numbers or linear)
np.eye(2): identical matrix made of 1 & 0
np.random rand (provides a random number distribution depending on choice)
np.ones(10)*5 uses broadcasting to multiply each element by 5
Why does this create an error a.reshape(3,3) when written on a separate line?
Slicing formatting required when outputting as matrix
Mat.sum(axis=0) is the sum of columns
Mat.sum(axis=1) is the sum of rows
Np.random(101) apply the same random number set & use Np.random.rand(1) to extract
Multiple large variable assignments will consume ram hence smart variable assignment needed.
#conditional selection based on,
Pandas
Panel Data created by Wes McKinney which is an open source library
Series are like arrays except can be given date time index
Pandas series can also hold functions
Axis = 1 refers to columns
df.drop() default axis set to zero
Conditional statements return Boolean data frames
Python’s and operator can only process Boolean values (true/false). Use & instead
For an or statement use pipe operator ‘|’
df.set_index() is used to turn an array into an index
Multi-level index calling; df.loc[‘G1’].loc[1] (nested indexing)
df.loc[‘G2’].loc[2][B] (specifies a particular column B)
df.xs(1,level = ‘Num’) returns row 1 from column name Num
Dealing with missing data:
df.dropna() by default drops any rows with null values, df.dropna(1) will drop any columns
with null values
df.fillna(value = ‘fill value’)
Groupby allows for aggregation based off a column
Finding unique values in a data frame. Df[‘colname’].unique returns unique values in an
array
Data I/O: csv, excel, html, sql files can be linked
to_csv allows to write to files (to_excel(‘file name.xlsx’,sheet_name=’newsheet’))
pandas can only import data not macros or formulas
DataFrame.columns (returns column names)
DataFrame[‘column name’].nunique() = returns number of unique objects in a certain
column. nunique without brackets returns the full list
banks.groupby("ST").count().sort_values('Bank Name',ascending=False).iloc[:5]['Bank
Name']
o uses a nested iloc function to return a sub set of values and the sort_values function
to rank
o .count() method counts
banks['Acquiring Institution'].value_counts()
o the value_counts() method counts the number of times a certain value occurs in a
specified Data Frame column
banks[banks['Acquiring Institution']=='State Bank of Texas']
o Nesting like the above within data frames can be used to selectively pull certain data
that full-fills a criteria such as only pulling up entries that are where the acquiring
institution was the State Bank of Texas
banks[banks['ST']=='CA'].groupby('City').count().sort_values('Bank
Name',ascending=False).iloc[:1]
o can count within a particular subset such as number of banks for cities within a
particular state
o
o
sum(banks['Bank Name'].apply(lambda name: 'Bank' not in name))
o
sum(banks['Bank Name'].apply(lambda name: name[0].upper() == 'S'))
o finds any banks that start with capital S. name[0] returns the starting position of
name
o
sum(banks['Bank Name'].apply(lambda name: len(name.split())==2))
o split () function, splits a string based off what is specified, if blank splits it based on
empty space
o the len function checks to see whether 2 name objects were returned or not
Matplotlib and Pandas – Visualisation
Matplotlib has 2 API structures
o Object oriented structure & Function oriented structure
o Gallery has good examples
Function oriented
o import matplotlib.pyplot as plt
o %matplotlib inline (for jupyter workbook)
o plt.plot(x,y,'b')
o plt.title('my first python chart')
o plt.xlabel('Jerrod')
o plt.ylabel('prpasfqf')
Object oriented
o Fig = plt.figure()
o Axes = fig.add_axes([0.1,0.1,0.8,0.8])
o Axes.plot(x,y,’b’)
o axes.set_xlabel('Set X Label') # Notice the use of set_ to begin methods
o axes.set_ylabel('Set y Label')
o axes.set_title('Set Title')
Matplotlib allows the aspect ratio, DPI and figure size to be specified when the Figure object
is created. You can use the `figsize` and `dpi` keyword arguments.
* `figsize` is a tuple of the width and height of the figure in inches
* `dpi` is the dots-per-inch (pixel per inch).
For example:
Matplotlib allows the aspect ratio, DPI and figure size to be specified when the Figure object
is created. You can use the figsize and dpi keyword arguments.
figsize is a tuple of the width and height of the figure in inches
dpi is the dots-per-inch (pixel per inch).
ax2.set_xlim(20,22) & ax2.set_ylim(30,50). With these commands you can set the x & y
limits for each axis to zoom in on a section of a plot
axes[0].plot(x,y,color="red",lw=5,ls=":"). Can be used to adjust the style and colour of the
chart
fig,axes=plt.subplots(1,2,figsize=(12,2)). Figsize object allows the user to code the size of the
figure
Pandas Visualisation
%matplotib inline
import numpy as np; import pandas as pd; %matplotlib inline
import matplotlib.pyplot as plt
There are several plot types built-in to pandas, most of them statistical plots by nature:
df.plot.area
df.plot.barh
df.plot.density
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.bar
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.pie
You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms
shown in the list above (e.g. 'box','barh', etc..)
idx returns the index
%matplotlib notebook (makes the plot interactive)
df3['a'].plot.hist(color="blue",bins=100,alpha=0.5)
df3[['a','b']].plot.box()
Data Sources
Pandas data-reader (Google’s stock API)
Quandl (robust python API). Need a user key for access more than 50 times a day
Yahoo and google have changed their APIs and are sometimes unstable. Use the codes "iex" or
"morningstar" instead
Calls to commence pandas data reader
Import pandas_datareader.data as web
import datetime
datetime.datetime(2015,1,1) records the time object
E.g. facebook = web.DataReader(‘FB’,’google’,start,end)
Returns data into a data frame
For Options;
from pandas_datareader.data import Options
fb_options = Options(‘FB’,’google’)
Quandl
Import quandl
mydata = quandl.get(‘EIA/PET_RWTC_D’)
Pandas with Time Series Data
DateTime index, Time Resampling, Time Shifts, Rolling & Expanding
From datetime import datetime (Python’s in-built date time library). Defaults to 0 hrs 0 mins
Df[‘name’] = pd.to_datetime(df[‘name’])
Df.resample(rule=’A’).mean() This will do an annual mean of data set
Df.tshift allows for grouping by a certain time period frequency
Moving average or rolling mean can be computed using df.rolling object
(df.rolling(7).mean().head(14))
Plotting a 30 day MA on stock price would use the following
df.rolling(window=30).mean()[‘Close’].plot()
Expanding() object returns the cumulative average
Bollinger Bands
Volatility bands which are placed above or below the moving average line
20 day means are used
.std() returns the standard deviation of data set
Capstone Project: Stock Market
Ford['Volume'].idxmax(). Returns the index value of the maximum value in the ‘Volume’
column
gm['MA50'] = gm['Open'].rolling(50).mean() & gm['MA200'] =
gm['Open'].rolling(200).mean(). Computes moving averages
df.index = pd.to_datetime(df.index). Converts index to datetime object
car_comp = pd.concat([tesla['Open'],gm['Open'],ford['Open']],axis=1)
Candlestick Chart Code
from matplotlib.finance import candlestick_ohlc
from matplotlib.dates import DateFormatter, date2num, WeekdayLocator, DayLocator, MONDAY
# Rest the index to get a column of January Dates
ford_reset = ford.loc['2012-01':'2012-01'].reset_index()
# Create a new column of numerical "date" values for matplotlib to use
ford_reset['date_ax'] = ford_reset['Date'].apply(lambda date: date2num(date))
ford_values = [tuple(vals) for vals in ford_reset[['date_ax', 'Open', 'High', 'Low', 'Close']].values]
mondays = WeekdayLocator(MONDAY) # major ticks on the mondays
alldays = DayLocator() # minor ticks on the days
weekFormatter = DateFormatter('%b %d') # e.g., Jan 12
dayFormatter = DateFormatter('%d') # e.g., 12
#Plot it
fig, ax = plt.subplots()
fig.subplots_adjust(bottom=0.2)
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
candlestick_ohlc(ax, ford_values, width=0.6, colorup='g',colordown='r');
gm['returns'] = gm['Close'].pct_change(1). This method allows for computing of percent
change from one value to another off a certain column
GM['returns'].plot(kind="kde",label="GM"). This plots a density distribution curve
box_df=pd.concat([tesla['returns'],Ford['returns'],GM['returns']],axis=1)
box_df.columns=['Tesla Returns','Ford Returns','GM Returns']
box_df.plot(kind="box",figsize=(8,11)). This plots a box plot of the data set based on 3
columns.
Statsmodels
Statistics
ETS Models: refer to Error Trend Seasonality models that will take each of those terms for
smoothing. It breaks up data into the following components:
o Trend
o Seasonality
o Residual
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(airline['Thousands of Passengers'],model='multiplicative')
result.plot()
EWMA Models (Error Weighted Moving Average): reduce time lag, more weight applied to
more recent values
EWMA
Exponentially-weighted moving average
We just showed how to calculate the SMA based on some window.However, basic SMA has some
"weaknesses".
•Smaller windows will lead to more noise, rather than signal
•It will always lag by the size of the window
•It will never reach to full peak or valley of the data due to the averaging.
•Does not really inform you about possible future behaviour, all it really does is describe trends in
your data.
•Extreme historical values can skew your SMA significantly
To help fix some of these issues, we can use an EWMA (Exponentially-weighted moving average).
EWMA will allow us to reduce the lag effect from SMA and it will put more weight on values that
occured more recently (by applying more weight to the more recent values, thus the name). The
amount of weight applied to the most recent values will depend on the actual parameters used in
the EWMA and the number of periods given a window size. Full details on Mathematics behind
this can be found here Here is the shorter version of the explanation behind EWMA.
The formula for EWMA is:
Where x_t is the input value, w_i is the applied weight (Note how it can change from i=0 to t), and
y_t is the output.
Now the question is, how to we define the weight term w_i ?
This depends on the adjust parameter you provide to the .ewm() method.
When adjust is True (default), weighted averages are calculated using weights:
ARIMA
The general process for ARIMA models is the following:
Visualize the Time Series Data
Make the time series data stationary
Plot the Correlation and Auto Correlation Charts
Construct the ARIMA Model
Use the model to make predictions
ARIMA: is a generalisation of the ARMA (autoregressive moving average model)
o Seasonal & Non-seasonal ARIMA
o Non-seasonal ARIMA: p,d,q
o P: Autoregression
o I: Integration
o Q: Moving average
o Stationary data: has constant mean & variance over time (co-variance should not be
a function of time, how fast variance moves over time)
There are tests for stationarity in data. A common one is the Augmented Dickey Fuller Test
Differencing (first order: computing difference, 2 nd order: adding those up sacrificing one row
of data in the process)
For seasonal data, you could difference by 12 rather than 1 (shift.12 method does that)
Stationarity Tests
Uses the Augmented Dickey Fuller Test with Statsmodels
Has a null hypothesis that it’s not stationary?
Unit root test
P value returned dictates whether it’s a stationary or non-stationary data set
In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis
that a unit root is present in a time series sample. The alternative hypothesis is different
depending on which version of the test is used, but is usually stationarity or trend-stationarity.
Basically, we are trying to whether to accept the Null Hypothesis H0 (that the time series has a
unit root, indicating it is non-stationary) or reject H0 and go with the Alternative Hypothesis (that
the time series has no unit root and is stationary).
We end up deciding this based on the p-value return.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis,
so you reject the null hypothesis.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail
to reject the null hypothesis.
Let's run the Augmented Dickey-Fuller test on our data:
Differencing
First difference of a time series is the series of changes from one period to the next (pandas can
handle this)
Can continue to take 2nd & 3rd difference and keep going until data reaches stationarity
ACF & PACF
Autocorrelation plot shows the correlation of the series with itself lagged by x time units
These plots are usually run on the differenced/stationary data
The ACF plots will also determine whether the AR or MA functions will be used in the ARIMA
model
o If the ACF Plot shows positive autocorrelation at the first lag (-1), AR is suggested for
use
o If the ACF plot shows negative autocorrelation at the first lag (-1) then it suggests
using MA terms
Partial Autocorrelation
Partial autocorrelation is a conditional correlation between 2 variables under the
assumption that we know and account for the values of some other set of variables
Finance Fundamentals
Portfolio Allocation
Sharpe Ratio
(R p−R f )
S=
σp
Sigma-p is the Portfolio standard deviation
If the risk-free rate = 0% then Sharpe Ratio is simplified to Mean Return divided by Std.
Deviation
Can also be applied to compute average of yearly over daily returns
K-factor based off your sampling rate
o K = sqrt(252)
o K = sqrt(52)
o K = sqrt(12)
ASR = K*SR
Portfolio Optimisation
Efficient Frontier
CAPM – Capital Asset Pricing Model
Markowitz efficient portfolio optimisation
Monte Carlo Simulation
Stocks.pct_change(1).corr(). Plots the Pearson
Log returns are used for normalising more advanced time series data as this normalises &
de-trends the time series
Financial Markets Knowledge
Order Book: is an electronic list of buy & sell orders for an instrument organised by price
level. Number of products being bid/offered by each price point or market depth. It also
identifies the market participants but some can remain anonymous. Order imbalances may
also become apparent & point toward a certain market trend. The order book does not
show the activity/batches of ‘dark pools’