Data Science Notes
Data Science Notes
INTRODUCTION
Data Science is a multi-disciplinary field used for extracting insights from massive amounts of
data using mathematical & scientific models and algorithms. It is one of the domains of Artificial
Intelligence. The analysis of the data helps in making the machine intelligent. A meaningful and
informative dataset for decision making is generated by applying extensive data processing on
the unstructured data.
For example – Data Science is used by various companies like Netflix, Google and Amazon for
developing robust recommendation systems for their visitors. Similarly, to predict stock prices,
different financial companies are using multiple predictive analytics and forecasting methods.
Ø Internet search:
Various Search engines of the Internet like Google, Yahoo, Bing, Ask, etc., use data
science technology to enhance the search experience for the users. The search result is
listed within a fraction of seconds.
Ø E-Commerce
To provide an enhanced personalized experience and accurate recommendations to the
users, e-commerce companies, such as Amazon, Flipkart, Netflix, Google Play, etc., are
using data science technology. For example, when searching for a product on any e-
commerce site, auto suggestions for similar products pop up due to data science
technology.
Benefits of data science in the field of E-commerce:-
• For identifying a potential customer base.
• Predictive analytics is used for making a forecast about the products and services.
• Companies are optimizing their pricing structures for their consumers.
Ø Banking
One of the prime applications of data science is Banking Industry. With the help of data
science, it has become easier for banks to manage their resources efficiently.
Data science has benefited the Banking industry in the following areas:-
§ Fraud detection
§ Management of customer data
§ Customer segmentation
§ Risk detection
Ø Healthcare
Data Science has a crucial role in the Healthcare Industry. Doctors can detect cancer and
tumours at an early stage using Image Recognition software.
Genetic Industries use Data Science for analyzing and classifying patterns of genomic
sequences. Various virtual assistants are also helping patients to resolve their physical
and mental ailments. In the field of disease research, the techniques of data science
provide greater insight into genetic issues.
The field of computer science dealing with building a smarter machine capable of performing
tasks and decision-making the same as a human is known as Artificial Intelligence.
AI PROJECT CYCLE
The Scenario
Transportation is considered an indispensable part of human life and backbone of any Country’s
economy. It plays a crucial role in enhancing up the lifestyles of ordinary men by providing
facilities and approachability as required to them. The rural regions are constantly struggling with
services and facilities aspects due to their remote and dispersed locations. Effective and efficient
transportation can mitigate regional problems of rural areas by providing access to employment,
health, education and services.
The common man's problem in villages for commuting for work, study, etc., is not getting suitable
and convenient public transportation like bus, auto, or taxies. The proposed study objective is to
provide accessibility and proper transportation services to these rural regions.
Problem Scoping
a) Who - It involves all those affected directly and indirectly by the problems and
are Stake Holders.
b) What – Under this block, pieces of evidence are gathered to prove that problem
exists. It helps us in understanding and recognizing the nature of the problem.
d) Why – Under this block, the decision about whether the given problem is worth
solving is made.
Data Acquisition
After the goal of our project is finalized, the next step is looking at various data features which
affect the problem. As data is the fuel for AI-based projects, the goal is to collect the kind of
data. In the above project, the factors affecting are:-
a. Number of students and people using the transportation system
b. Time of usage of the transportation system
c. Cost incurred during transportation
d. Days with heavy traffic load
e. Days with very less traffic
On studying the pattern of the data for a week, the database has to be created. The various
techniques for collecting the data will be face-to-face interviews, telephonic interviews, pilot
surveys, and household surveys.
To collect the travel data, the GPS and GIS technologies are used. GPS data help us provide real-
time spatial information, and it shows the travel behavior, including distance, travel speed, trip
time.The data collected can be used for the development of transportation policy.
Data Exploration
The next step after the creation of the database is to analyze the data and perform the interpretation.
The required information extracted from the curated dataset, is cleaned so that there exist no errors
or missing elements in it.
Modelling
The model selection is the next step after the dataset. The clustering algorithm is chosen for the
transportation system, and it is an unsupervised machine learning technique. Clustering has proven
its efficiency in developing intelligent transportation systems. Clustering is applied in various
categories in transportation planning as the trip generation, traffic zone division, and trip
distribution.
Evaluation
The accuracy of the algorithm is checked whether it is working correctly or not, after training the
model on dataset.
1. The trained algorithm feeds data regarding the number of trips, number of persons
travelling and travelling time.
2. It is then fed data regards the capacity of the public transportation utilized.
3. The algorithm then works upon the entries according to the training it got at the
modelling stage.
4. The Model predicts the number of trips to be run for practical usage of public
transportation.
5. The prediction is compared to the testing dataset value.
6. The model is tested for ten testing datasets.
7. Prediction values of the testing dataset is compared to the actual values.
8. If the prediction value is the same or almost similar to the actual values, the model is
accurate. Otherwise, either the model selection is changed, or the model is trained on
more data for better accuracy. Once the model is able to achieve optimum efficiency, it is
ready to be deployed for real-time usage.
DATA COLLECTION
Our society is highly dependent on data. Accurate data collection is necessary to make precise
business decisions, safeguard quality assurance, and keep research reliabile. The systematic
approach to gather obseravtions and create databsase for data analysis is Data Collection. The
process of data collection is non technical process and do not require experts.
SOURCES OF DATA
Data collection is the fundamental basis of data analysis in Data Science. Raw facts and figures
are known as data. The sources of data are mainly of two types:
i. Primary data source: The data collected from its origin, i.e. from the reports and
records published within the organization, is the primary data source. Primary data is an
original and hence more reliable source of data.
ii. Secondary data source: The data collected from an outside agency or the organization is
the secondary data source. Secondary data is not original, and it is analyzed and has
undergone some statistical operations.
While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone’s privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random
sources can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of
the AI model.
TYPES OF DATA
During the data collection process, there can be some errors in the data as follows:-
a. Erroneous Data: Erroneous data means inaccurate data, which includes missing data, wrong
information etc., and their two ways are:-
• Incorrect values: The values in the dataset are not correct. For example, in the employee
table, in the designation post, the employee's name is mentioned. Since the expected
value was a designation, the analysis of data will be incorrect.
• Invalid or Null values: Some of the values in the datasets get corrupted and hence
become invalid. Some values become NaN in the dataset and are null values. The reason
for NaN can be due to finding out the square root of the number, division by zero etc.
These values are removed from the database.
b. Missing Data: In some datasets, the data is absent or is missing, i.e. some of the observations
in datasets are blank. The absence of data/ missing data cannot be inferred as an error. For example,
In surveys, some respondents are unreachable, and hence dataset for them is not recorded.
c. Outliers: Data that lies outside the range of other values in the set are outliers, i.e. it can be
termed as data that significantly differ from other data. Reasons for outliers are - data is
inappropriately scaled, errors during data entry, etc.
For example, finding out the average height of students of a particular grade, if there is a mistake
and some entries are wrongly entered, it would result in an incorrect average.
DATA ACCESS
Python supports various packages for accessing the tabular data and processing the data. Some of
these packages are:-
NUMPY
Python NumPy is a standard library; it stands for Numerical Python and is an extensive package
that provides the simplest and robust data structure, i.e. n-dimensional array used for data
analysis and scientific computing. An array is a group of homogenous elements stored together
under one name.
One of the core data types of Python is List An array is a part of the NumPy library
PANDAS
Python Pandas is Python’s library for analyzing datasets and finally making conclusions based on
statistical theories. Name ‘Pandas’ is derived from “Panel Data System”. As relevant data is
essential in data science, Pandas helps clean datasets to make them suitable.
Pandas support high performance and easy to use data structures like series (1 dimensional) and
data frames (2 dimensional).
Capabilities of Pandas library:
• Handling substantial datasets.
• Supports multiple data file formats for storing data.
• Supports operations on independent groups within the datasets.
• Selection/ filtration of subsets from bulky datasets and even merging numerous datasets.
• Reshaping and pivoting of datasets in many forms.
• Functionality to find and fill missing data.
MATPLOTLIB
Martplotlib is a library for 2-dimensional plotting with Python and is used for creating static,
animated, and interactive 2D- plots or figures in Python. Plots aid in understanding and deriving
decisions based on various trends and patterns in the graphs. Matplotlib supports the following
types of charts:-
Features of MatPlotlib:
Python is a prevalent language when it comes to data analysis and statistics. Python has a built-
in module, ‘statistics’, that is used to perform mathematical statistics of numeric data. A few of
the functions of the module are:-
DATA VISUALIZATION
The pictorial or graphical representation of the data using a graph, chart, etc., is knowns as Data
Visualisation. For effectively communicating results to the user, visualization is a great tool.
For example, traffic symbols and the speedometer of a vehicle are a few examples of visualization
that we encounter in our daily lives. Visualization of data is effectively used in fields like science,
health, finance, engineering, etc.
Python supports many visualization libraries like Matplotlib, seaborn, and Folium etc. As we have
already discussed before, with the help of Matplotlib, various plots can be drawn. Pyplot is a
Matplotlib module that provides an interface to draw different types of plots. Let us discuss five
key plots for basic data visualization:
• Scatter Plot
• Line Plot
• Bar Chart
• Histogram Plot
• Box Plot