4-Week Data Science Internship Report
4-Week Data Science Internship Report
SUMMER ENTREPRENEURSHIP – II
(100510P)
ON
DATA SCIENCE USING PYTHON INTERNSHIP
Submitted by
VINIT KUMAR
REGISTRATION NUMBER: 22105124013
CLASS ROLL NUMBER: 2022/CSE/26
SEMESTER: VTH
SESSION: 2022-26
[1]
CERTIFICATE
[2]
This is to certify that project report entitled “Data Science Using Python Programming Internship” which is
submitted by Vinit Kumar, in partial fulfilment of the requirements for the award of Bachelor’s degree in
Technology (B.Tech.) in Computer Science and Engineering to Sershah Engineering College, affiliated
from Bihar Engineering University, Patna is a bona fide record of the candidates’ own work carried out by
them under my supervision. The report has fulfilled standard requirements related to the degree. The matter
embodied in this internship report, in full or in parts, is original and has not been submitted for the award of
any other degree or diploma.
Mr. Om Prakash
Head of the Department – In-charge,
[3]
DECLARATION
I hereby declare that this submission is my own work and that to the best of my
knowledge and belief. I also declare that the work which is being presented in this in-
plant training report titled “Data Science Using Python Programming Internship” by me,
in partial fulfilment of the requirements for the award of Baccalaureate degree in
Technology (B.Tech.) in “Computer Science and Engineering”, is an authentic record of
my own work carried out under the guidance of Smartbrige and Salesforce and Mr. Om
Prakash, Head of the Department – In-charge, Computer Science and Engg. at Sershah
Engineering College.
This report has been made independently by me during our second year at Sershah
Engineering College while pursuing an internship during the period of 2nd June, 2025 to
30th June 2025 (02/06/2025 – 30/07/2025). It contains no material previously published
or written by another person nor material which to a substantial extent has been
accepted for the award of any other degree or diploma of the university or other
institutes of higher learning, except where the acknowledgement has been made in the
text.
Signature
Name: Vinit Kumar
Registration No.: 22105124013
Class Roll No.: 2022/CSE/26
Sershah Engineering College
[4]
ACKNOWLEDGEMENT
It is my proud privilege and duty to acknowledge the kind of help and guidance received
from several people in preparation of this report. It would not have been possible to
prepare this report in this form without their valuable help, cooperation and guidance.
First and foremost, I wish to record my sincere gratitude to NIELIT Patna, Mr. Om
Prakash, and other faculty members for their constant support and encouragement in
preparation of this report as well as the project.
Last but not the least, I would like to express my gratitude to my parents, family and all
faculty members of our Computer Science and Engineering Department for providing
academic inputs, guidance & encouragement throughout the training period. Their
contributions and technical support in preparing this report are greatly acknowledged.
[5]
Table of Contents
Chapter 1:
Introduction and Objectives
Chapter 2:
Week 1: Python Programming Fundamentals
Chapter 3:
Week 2: Python Functions and Object-Oriented Programming
Chapter 4:
Week 3: Python Modules and Data Science Packages
Chapter 5:
Week 4: Data Preprocessing and Machine Learning
Chapter 6:
Mini Project: Customer Churn Prediction
Chapter 7:
Learning Outcomes and Reflection
Chapter 8:
Conclusion
[6]
1. Introduction and Objectives
1.1 Internship Overview
This internship report documents my 4-week journey in Data Science using Python
programming. The internship was designed to provide hands-on experience with Python
programming fundamentals, data manipulation, visualization, and machine learning
techniques. The program was structured to build knowledge progressively from basic
programming concepts to advanced data science applications.
1.2 Objectives
The primary objectives of this internship were:
1.3 Methodology
The internship followed a structured approach with theoretical learning complemented
by practical exercises. Each week focused on specific topics, building upon previous
knowledge to create a comprehensive understanding of data science workflows.
[7]
2.1 Introduction to Python Programming
Python is a high-level, interpreted programming language known for its simplicity and
readability. During the first week, I learned that Python's design philosophy emphasizes
code readability and a syntax that allows programmers to express concepts in fewer
lines of code compared to other languages.
We primarily used Jupyter Notebook due to its interactive nature and excellent support
for data visualization.
Python supports several built-in data types that form the foundation of programming:
Numeric Types:
Text Type:
[8]
Boolean Type:
python
Python provides various operators for performing operations on variables and values:
Arithmetic Operators: +, -, *, /, //, %, ** Comparison Operators: ==, !=, <, >, <=,
>= Logical
Operators: and, or, not Assignment Operators: =, +=, -=, *=, /=
Understanding operator precedence and how expressions are evaluated was crucial for
writing effective Python code.
python
x = 10
y = "Hello World"
z = [1, 2, 3, 4, 5]
A critical concept learned was the distinction between mutable and immutable objects:
[9]
Mutable Objects: Can be modified after creation
Lists
Dictionaries
Sets
This distinction affects how objects are passed to functions and how memory is
managed in Python.
Strings in Python are sequences of characters enclosed in quotes. They are immutable
and provide numerous methods for manipulation:
Creating strings:
Single, double, or triple quotes
String methods:
upper(), lower(), strip(), replace(), split(), join()
String formatting:
Using format() method and f-strings
python
2.2.2 Lists
Lists are ordered, mutable collections that can store different data types:
Lists are fundamental in data science for storing and manipulating datasets.
2.2.3 Tuples
Tuples are often used for coordinates, database records, or any grouped data that
Dictionaries are essential in data science for representing structured data and mapping
relationships.
python
student= {
"name"
: "Alice"
,
"age": 22,
"grades"
: [85, 90, 78]
}
python
score= 85
if score>=90:
grade= "A"
elif score>=80:
grade= "B"
else:
grade= "C"
[11]
Loops enable repetitive execution of code blocks:
for loops: Iterate over sequences (lists, strings, ranges) while loops: Continue
execution while condition is true Loop control: break and continue statements
python
python
Functions are reusable blocks of code that perform specific tasks. During week 2, I
learned the importance of functions in creating modular, maintainable code:
Function Definition: Using the def keyword Parameters and Arguments: Passing
data to functions
Return Values: Functions can return results Local vs Global Scope: Understanding
variable accessibility
[12]
python
def calculate_average
(numbers
):
"""Calculate the average of a list of numbers"""
if not numbers
:
return0
returnsum(numbers
) / len(numbers
)
python
def process_data
(*args, **kwargs
):
"""Function that accepts variable arguments"""
print(f"Positional args:
{args}")
print(f"Keyword args:
{kwargs
}")
python
def factorial
(n):
"""Calculate factorial using recursion"""
if n <=1:
return1
returnn * factorial
(n - 1)
Recursion is useful for solving problems that can be broken down into smaller, similar
subproblems.
Python provides numerous built-in functions that are essential for data manipulation:
map(): Applies function to every item in iterable filter(): Filters items based on function
criteria reduce(): Applies function cumulatively to items
python
from functoolsimportreduce
numbers= [1, 2, 3, 4, 5]
squared= list(map(lambdax: x**2, numbers
))
evens= list(filter(lambdax: x % 2 ==0, numbers
))
product= reduce
(lambdax, y: x * y, numbers
)
Encapsulation: Bundling data and methods that operate on that data Inheritance:
Creating new classes based on existing classes Polymorphism: Objects of different
types responding to same interface Abstraction: Hiding complex implementation
details
[14]
Instance Attributes: Unique to each object instance Class Attributes: Shared among
all instances of a class Instance Methods: Operate on instance data Class Methods:
Operate on class data Static
Methods: Don't access instance or class data
3.2.4 Inheritance
Inheritance allows creating new classes that inherit properties and methods from
existing classes.
Inheritance promotes code reusability and establishes hierarchical relationships between
classes.
[15]
python
python
from collections
importCounter
, defaultdict
# Counter example
data= ['apple'
, 'banana'
, 'apple'
, 'cherry'
, 'banana'
, 'apple'
]
counter= Counter
(data)
print(counter
.most_common
(2)) # [('apple', 3), ('banana', 2)]
python
importnumpyas np
# Creating arrays
arr1= np.array
([1, 2, 3, 4, 5])
arr2= np.zeros((3, 4))
arr3= np.random
.randn(2, 3)
# Array operations
result= arr1* 2
mean_value
= np.mean(arr1)
[17]
DataFrame: Two-dimensional labeled data structure DataFrame Operations:
python
importpandasas pd
# Creating DataFrame
data= {
'Name': ['Alice', 'Bob', 'Charlie'
],
'Age': [25, 30, 35],
'Salary'
: [50000
, 60000
, 70000
]
}
df = pd.DataFrame
(data)
# Basic operations
print(df.head())
print(df.describe
())
print(df.info())
[18]
Basic Plotting:
python
importmatplotlib
.pyplotas plt
[19]
CSV files (most common)
Excel spreadsheets
JSON files
Databases (SQL)
APIs
Web scraping
python
importpandasas pd
# Database connection
importsqlite3
conn= sqlite3
.connect
('database.db'
)
df_db= pd.read_sql_query
('SELECT * FROM table_name'
, conn)
[20]
Automates decision-making processes
5.2.2 Machine Learning Approaches
Supervised Learning: Uses labeled training data to learn mapping from inputs to
outputs:
Classification: Predicting discrete categories (spam/not spam, disease/healthy)
Regression: Predicting continuous values (house prices, temperature)
Unsupervised Learning: Finds patterns in data without labeled examples:
Clustering: Grouping similar data points
Association rule learning: Finding relationships between variables
Dimensionality reduction: Reducing number of features
Reinforcement Learning: Learns through interaction with environment using rewards
and penalties:
Agent-based learning: Learning optimal actions
Game playing: Chess, Go, video games
Robotics: Navigation, manipulation
5.2.3 Statistics and Probability Basics
Understanding statistics and probability is crucial for machine learning:
Descriptive Statistics:
Measures of central tendency: Mean, median, mode
Measures of dispersion: Variance, standard deviation, range
Distribution shapes: Skewness, kurtosis
Probability Concepts:
Probability distributions: Normal, binomial, Poisson
Bayes' theorem: Updating probabilities with new evidence
Central limit theorem: Foundation for statistical inference
Statistical Inference:
Hypothesis testing: Making decisions based on data
Confidence intervals: Estimating parameter ranges
P-values: Measuring statistical significance
5.4.1 Logistic Regression
Despite its name, logistic regression is a classification algorithm that uses the logistic
function to model probability:
Mathematical Foundation: Uses sigmoid function to map any real number to value
between 0 and 1:
[21]
sigmoid(z) = 1 / (1 + e^(-z))
python
from sklearn
.linear_model
importLogisticRegression
from sklearn
.metricsimportaccuracy_score
, confusion_matrix
, classification_report
# Make predictions
y_pred= model.predict
(X_test
)
y_pred_proba
= model.predict_proba
(X_test
)
# Evaluate model
accuracy
= accuracy_score
(y_test
, y_pred
)
conf_matrix
= confusion_matrix
(y_test
, y_pred
)
python
fromsklearn
.neighborsimportKNeighborsClassifier
# Make predictions
y_pred= model.predict
(X_test
)
Key Parameters:
k: Number of neighbors to consider
[22]
Distance metric: Euclidean, Manhattan, Minkowski
Weight function: Uniform or distance-based
5.4.3 Support Vector Machines (SVM)
SVM finds the optimal hyperplane that separates different classes with maximum
margin:
Key Concepts:
Support vectors:
Data points closest to decision boundary
python
fromsklearn
.svmimportSVC
# Make predictions
y_pred= model.predict
(X_test
)
Kernel Functions:
Linear: For linearly separable data
RBF (Radial Basis Function): For non-linear data
Polynomial: For polynomial relationships
Sigmoid: Similar to neural networks
5.5 Clustering
5.5.1 K-Means Clustering
K-means is an unsupervised learning algorithm that partitions data into k clusters:
Algorithm Steps:
Initialize k cluster centroids randomly
Assign each data point to the nearest centroid
Update centroids by calculating mean of assigned points
Repeat steps 2-3 until convergence
[23]
python
from sklearn
.clusterimportKMeans
importmatplotlib
.pyplotas plt
# Visualize clusters
plt.scatter
(X[:, 0], X[:, 1], c=cluster_labels
, cmap='viridis'
)
plt.scatter
(kmeans
.cluster_centers_
[:, 0], kmeans
.cluster_centers_
[:, 1],
marker
='x', s=200, linewidths
=3, color='red')
plt.title('K-Means Clustering'
)
plt.show()
Key Parameters:
n_clusters: Number of clusters (k)
init: Method for initialization ('k-means++', 'random')
max_iter: Maximum number of iterations
tol: Tolerance for convergence
Choosing Optimal k:
Elbow method: Plot within-cluster sum of squares vs k
Silhouette analysis: Measure cluster cohesion and separation
Gap statistic: Compare clustering with random data Advantages:
Simple and fast algorithm
Works well with spherical clusters
Limitations:
[24]
6. Mini Project: Customer Churn Prediction
6.1 Project Overview
For the capstone project, I developed a Customer Churn Prediction system using
machine learning. The objective was to predict which customers are likely to churn
based on their usage patterns, demographics, and service history.
python
# Data cleaning
df['TotalCharges'
] = pd.to_numeric
(df['TotalCharges'
], errors
='coerce'
)
df['TotalCharges'
].fillna(df['TotalCharges'
].median
(), inplace
=True)
# Feature engineering
df['TenureGroup'
] = df['Tenure'
].apply(lambdax: 'New'if x <=12else'Medium'if x <=36 else'Long')
df['ServiceCount'
] = df[service_columns
].apply(lambdax: sum(x !='No'), axis=1)
# Encoding
df_encoded
= pd.get_dummies
(df, columns
=['Contract'
, 'PaymentMethod'
], drop_first
=True)
Key Findings:
[25]
2.Tenure strongly correlates with retention
3.Payment method significantly impacts churn
Machine Learning:
Project Management:
Business Applications:
Industry-specific applications
[28]
8. Conclusion
8.1 Internship Summary
This 4-week Data Science internship provided comprehensive exposure to Python
programming and machine learning applications. The structured curriculum progressed
from basic programming concepts to advanced data science techniques, culminating in
a practical customer churn prediction project.
Professional Growth:
[29]