Value Sq Ft
$ 1,557,000 1500 Sq Ft
$ 1,035,000 1000 3500
$ 2,520,000 2500 3000
$ 2,059,000 2000 2500
$ 2,052,000 2000 2000
$ 1,539,000 1500 1500
$ 2,023,000 2000 1000
500
$ 2,552,000 2500
0
$ 3,032,000 3000 $- $1,000,000 $2,000,000 $3,000,000 $4,000,000
Nick's Cheatsheet
A Dependent Value is the thing you want to predict. In our case, it's Housing Price.
An independent value is a fact that you know, that you think can use to predict your value. In our case, we know square feet o
A regression allows you to answer the question:
Is Square Foot (aka the fact I know) a good, useful predictor or housing prices? Or, is a really bad, low quality predictor?
If it is a good predictor, I will use it to run predictions!
If it is a bad predictor, I will ignore it, and look for other new facts!
Regression finds the "line of best fit". In overly simplistic terms, it means that it finds the best trendline.
You can then use this trend line to predict different values, but simply plotting your predictions against the trendline.
But….
How do I know if something is a "good" predictor vs a "bad" predictor?
In regression, "good" and "bad" is determined by the R2 value, and the P value.
(if you want to understand R scores in detail, I'm happy to explain. But - the key points are below)
A high R2 value means - this is a good predictor
A very very very low P value means - this is a very good predictor.
Equation of a line y=a+bx
Y intercept a 51000
Slope b 995
X (Sq Ft) x 7000
Y (Housing Price) y $ 7,016,000
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9997252044
R Square 0.9994504844 1
Adjusted R Square 0.9993719821
Standard Error 15273.693538
Observations 9
00 $4,000,000 ANOVA
df SS
Regression 1 2970075000000
Residual 7 1633000000
Total 8 2971708000000
our case, we know square feet of homes.
Coefficients Standard Error
y bad, low quality predictor? Intercept 51000 18356.69507205
Sq Ft 995 8.818271075551
against the trendline.
MS F Significance F
2.970075E+12 12731.4911206369 1.1322832E-12
233285714.29
t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
2.7782778872 0.0273659949575287 7593.31365354 94406.686346 7593.3136535 94406.686346
112.83390944 1.13228321990998E-12 974.148102358 1015.8518976 974.14810236 1015.8518976
Value Sq Ft Masonville Downtown Sq Ft Location
$ 1,557,000 1500 1 0 1500 Masonville
$ 1,035,000 1000 0 1 1000 Downtown
$ 2,520,000 2500 0 0 2500 Campus
$ 2,059,000 2000 1 0 2000 Masonville
$ 2,052,000 2000 1 0 2000 Masonville
$ 1,539,000 1500 0 0 1500 Campus
$ 2,023,000 2000 0 0 2000 Campus
$ 2,552,000 2500 1 0 2500 Masonville
$ 3,032,000 3000 0 1 3000 Downtown
Nick's Cheat Sheet
Location is a "Categorical" variable because it is a word, not a number.
Because regression can only understand numbers, we have to turn the word into a number.
We can do this by using '0' and '1' as 'yes' and 'no' - i.e. "yes, is it in masonville (aka 1)". Or "no, it is not in masonville (aka 0)"
Why do Multi-variable regressions?
Often, we have many different facts we can use to predict something.
In the case of housing, we can know many facts about a house. We can know:
Location
Sq Ft
Age of House
Construction Style
Why do "Categorical" multi variable regressions?
Well, most facts are not numbers. So this method simply allows us to "fake it". We can turn non-numerical facts (data points
Equation of a line y=a+b1x1+b2x2
Y intercept a 37333
Slope (Masonville) b1 27666
X (Masonville) x1 0 You can pick whether or not the house is in masonville
Slope (St Ft) b2 995
X2 (Sq Ft) x2 5000 You can pick your square footage
Y (Housing Price) y $ 5,012,333 This is the predicted housing price, according to this regression model.
We can have a high level of confidence in our regression model, because
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.99997036
R Square 0.99994072
Adjusted R Square 0.99990515
Standard Error 5935.76729
Observations 9
ANOVA
df SS MS F
Regression 3 2.9715E+12 9.9051E+11 28112.8839
Residual 5 176166667 35233333.3
Total 8 2.9717E+12
ot in masonville (aka 0)" Coefficients Standard Error t Stat P-value
Intercept 37333.3333 7663.04262 4.87186816 0.00458539
Masonville 27666.6667 4533.51715 6.10269372 0.0017109
Downtown 6166.66667 5418.5894 1.13805757 0.30667246
Sq Ft 995 3.42701684 290.339979 9.19845E-12
0.05
merical facts (data points) into numbers.
this regression model.
egression model, because the R2 value is high, and the P (randomness) factor is low.
Significance F
5.51207E-11
Lower 95% Upper 95% Lower 95.0% Upper 95.0%
17634.8552 57031.8115 17634.8552 57031.8115
16012.8898 39320.4435 16012.8898 39320.4435
-7762.26083 20095.5942 -7762.26083 20095.5942
986.190573 1003.80943 986.190573 1003.80943
This is the highest P value you want.
This P value means… "5% of my prediction is explained by randomness" (..basically).
We don't like that. We want VERY VERY Low randomness factors.
5% random is - in regression terms - very random.
30% random (aka the P value for "Downtown") is TERRIBLE.
It's basically saying - knowing whether or not the property is in downtown will NOT help you predict the price.
redict the price.