Term Project
Problem Statement : Predict the outcomes of ICC T20 world cup 2016.
Data Collection
Factors on which winning probability for a particular team depends :a) Toss :- Team winning the toss can decide whether to chase or to
lead, They can make the decision favoring them depending upon the
pitch, weather conditions and their strengths
b) Ground :- Team playing on the home ground has higher chance of
winning the game compared to the opposite team
c) Ratings of the team :- Team ratings are calculated on the basis of the
performance of teams in T20 matches for the last 3-4 years. Higher
rating signifies better performance
d) Innings :- Team who goes for batting in second inning(chasing) has
higher probability of winning the match
e) Average Score :- Average score made by team in last 3-4 years in
T20 matches
Data :- Records of all the matches played by the 10 teams (teams
participating in T20 2016 world cup) since 2008 were used to build the
model.
Model :A logit model was fitted to the dataset and the result was used to make
predictions for T20 World cup 2016 results. The analysis of the model
shows that only innings & Team ratings are playing significant role.
Output
Anova test
Innings, Average score by the team & Rating of the team are the
most significant factors in predicting the winner for the game.
Performance
The model was able to predict the result with 57% accuracy.
Performance measure A (Accuracy of prediction) = 57%
Performance measure B (1 +(log2p(1 p)/2)) = -3.7959
Code used in the model
setwd('D:/R_AMSM2')
data<-read.csv('Train_Data_3.csv', header=T)
testdata_wc<-read.csv('test_2.csv', header=T)
raw_test <- testdata_wc
summary(data)
str(testdata_wc)
data$Inns <- factor(data$Inns)
data$Ground <- factor(data$Ground)
M= nrow(data)
N = ncol(data)
TrainvsTest= 1;
Train_idx = ceiling(TrainvsTest*M)
Test_idx = Train_idx+1
traindata = data
testdata = data [Test_idx:M,]
testdata_wc$Inns <- factor(testdata_wc$Inns)
testdata_wc$Ground <- factor(testdata_wc$Ground)
logit <- glm(Result~ Toss + Inns + Ground +Average_Team +
Average_Opposition + Rating_Team + Rating_Opposition
, family = binomial(link ='logit'), data = traindata)
summary(logit)
anova(logit,test="Chisq")
plot(logit)
step(logit)
logit_new <- glm(formula = Result ~ Inns + Average_Team +
Rating_Opposition,
family = binomial(link = "logit"), data = traindata)
fitted.results <- predict(logit,newdata=testdata_wc,type='response')
fitted.results.str <- ifelse(fitted.results > 0.5,"Won","Lost")
misClasificError <- mean(fitted.results.str != testdata_wc$Result)
print(paste('Accuracy',1-misClasificError))
p <- fitted.results
col2 <- log((2*p)*(1-p)/2)+1
Test_logit<-read.csv('test_class.csv', header = T)
logit_result <- cbind(raw_test, p, 1-p, col2, fitted.results.str )
logit_result
write.csv(logit_result,file='logit.csv')