Ass-3 Ds
Ass-3 Ds
Ass-3 Ds
Assignment No: 3
Provide the codes with outputs and explain everything that you do in this step. .
Theory:-
Statistical Inference:
statistical inference as the process of generating conclusions about a population from a noisy sample.
Without statistical inference we’re simply living within our data. With statistical inference, we’re trying
to generate new knowledge.
Statistical analysis and probability influence our lives on a daily basis. Statistics is used to predict the
weather, restock retail shelves, estimate the condition of the economy, and much more. Used in a variety
of professional fields, statistics has the power to derive valuable insights and solve complex problems in
business, science, and society. Without hard science, decision making relies on emotions and gut
reactions. Statistics and data override intuition, inform decisions, and minimize risk and uncertainty.
In data science, statistics is at the core of sophisticated machine learning algorithms, capturing and
translating data patterns into actionable evidence. Data scientists use statistics to gather, review, analyze,
and draw conclusions from data, as well as apply quantified mathematical models to appropriate
variables.
Data science knowledge is grouped into three main areas: computer science; statistics and mathematics;
and business or field expertise. These areas separately result in a variety of careers, as displayed in the
diagram below. Combining computer science and statistics without business knowledge enables
professionals to perform an array of machine learning functions. Computer science and business
expertise leads to software development skills. Mathematics and statistics (combined with business
lOMoAR cPSD| 35499814
expertise) result in some of the most talented researchers. It is only with all three areas combined that
data scientists can maximize their performance, interpret data, recommend innovative solutions, and
create a mechanism to achieve improvements.
Statistical functions are used in data science to analyze raw data, build data models, and infer results.
Below is a list of the key statistical terms:
● Population: the source of data to be collected.
● Sample: a portion of the population.
● Variable: any data item that can be measured or counted.
● Quantitative analysis (statistical): collecting and interpreting data with patterns and data
visualization.
● Qualitative analysis (non-statistical): producing generic information from other non-data forms
of media.
● Descriptive statistics: characteristics of a population.
● Inferential statistics: predictions for a population.
● Central tendency (measures of the center): mean (average of all values), median (central value of
a data set), and mode (the most recurrent value in a data set).
● Measures of the Dispersion:
○ Range: the distance between each value in a data set.
○ Variance: the distance between a variable and its expected value.
○ Standard deviation: the dispersion of a data set from the mean.
Statistical techniques for data scientists
There are a number of statistical techniques that data scientists need to master. When just starting out, it
is important to grasp a comprehensive understanding of these principles, as any holes in knowledge will
result in compromised data or false conclusions.
lOMoAR cPSD| 35499814
General statistics: The most basic concepts in statistics include bias, variance, mean, median, mode, and
percentiles.
Probability distributions: Probability is defined as the chance that something will occur, characterized as
a simple “yes” or “no” percentage. For instance, when weather reporting indicates a 30 percent chance
of rain, it also means there is a 70 percent chance it will not rain. Determining the distribution calculates
the probability that all those potential values in the study will occur. For example, calculating the
probability that the 30 percent chance for rain will change over the next two days is an example of
probability distribution.
Dimension reduction: Data scientists reduce the number of random variables under consideration
through feature selection (choosing a subset of relevant features) and feature extraction (creating new
features from functions of the original features). This simplifies data models and streamlines the process
of entering data into algorithms.
Over and under sampling: Sampling techniques are implemented when data scientists have too much or
too little of a sample size for a classification. Depending on the balance between two sample groups,
data scientists will either limit the selection of a majority class or create copies of a minority class in
order to maintain equal distribution.
Bayesian statistics: Frequency statistics uses existing data to determine the probability of a future event.
Bayesian statistics, however, takes this concept a step further by accounting for factors we predict will
be true in the future. For example, imagine trying to predict whether at least 100 customers will visit
your coffee shop each Saturday over the next year. Frequency statistics will determine probability by
analyzing data from past Saturday visits. But Bayesian statistics will determine probability by also
factoring for a nearby art show that will start in the summer and take place every Saturday afternoon.
This allows the Bayesian statistical model to provide a much more accurate figure.
The goals of inference
1. Estimate and quantify the uncertainty of an estimate of a population quantity (the proportion of
people who will vote for a candidate).
2. Determine whether a population quantity is a benchmark value (“is the treatment effective?”).
3. Infer a mechanistic relationship when quantities are measured with noise (“What is the slope for
Hooke’s law?”)
4. Determine the impact of a policy? (“If we reduce pollution levels, will asthma rates decline?”)
5. Talk about the probability that something occurs.
lOMoAR cPSD| 35499814
Algorithm:-
Step 1. Import Dataset:
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
train_df.shape, test_df.shape
train_df['label']='train'
test_df['label']='test'
combined_data_df=pd.concat([train_df,test_df])
combined_data_df.shape
#The reasons for combining both training and test dataset are:
#Categorical = 10
#Numerical = 5
#Target =1
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['workclass','occupation','native-country'],axis=0,inplace=True)
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['income_>50K'],axis=0,inplace=True)
combined_data_df.isnull().sum()
lOMoAR cPSD| 35499814
sns.set_theme(style="darkgrid")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")
combined_data_df['education'] = combined_data_df['education'].replace(['1st-4th','5th-6th'],'elementary-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['7th-8th'],'middle-school')
combined_data_df['education'] = combined_data_df['education'].replace(['9th','10th','11th','12th'],'high-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['Doctorate','Bachelors','Some-
college','Masters','Prof-school','Assoc-voc','Assoc-acdm'],'postsecondary-education')
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "marital-status")
combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Divorced','Never-
married','Widowed'],'single')
lOMoAR cPSD| 35499814
combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Married-civ-
spouse','Separated','Married-spouse-absent','Married-AF-spouse'],'married')
plt.figure(figsize=(20,10))
plt.figure()
sns.countplot(data= combined_data_df, x = "marital-status")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, y = "occupation")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "relationship")
Output: Performed statistical analysis on the income prediction dataset and also converted categorical
data into numerical data.
Conclusion:-
● Handled the missing value, by dropping them from the dataset
● From the data visualization, combined/categorized the features
● Using dummy variable, converted categorical variable to numerical variable to create better model.