[go: up one dir, main page]

0% found this document useful (0 votes)
1K views10 pages

Data Science Notes

Uploaded by

shailreena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views10 pages

Data Science Notes

Uploaded by

shailreena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA SCIENCE

INTRODUCTION

Data Science is a multi-disciplinary field used for extracting insights from massive amounts of
data using mathematical & scientific models and algorithms. It is one of the domains of Artificial
Intelligence. The analysis of the data helps in making the machine intelligent. A meaningful and
informative dataset for decision making is generated by applying extensive data processing on
the unstructured data.

Data Science can be understood easily through the following diagram:-

For example – Data Science is used by various companies like Netflix, Google and Amazon for
developing robust recommendation systems for their visitors. Similarly, to predict stock prices,
different financial companies are using multiple predictive analytics and forecasting methods.

Why Data Science?

Significant advantages of using Data ScienceTechnology:


• The conversion of data to information with the help of the right tools, technologies, and
algorithms. The processed information is suitable for business decision-making.
• Data Science can help to detect fraud using advanced machine learning algorithms.
• It helps to prevent any significant monetary losses.
• Allows in building intelligence ability in machines.
• It enables us to make better and faster decisions.
• Helps in suitable product recommendations to the customer, which ultimately enhances
business.

APPLICATIONS OF DATA SCIENCE


All the industries worldwide are relying on data and information. Now, Data science is fuel for
industries.

Ø Image recognition and speech recognition:


Data science is extensively used in the field of Image and speech recognition. Facebook
gives suggestions to tag friends when an image is uploaded on Facebook. Image
Recognition Algorithms are used to provide automatic recommendations to tag friends.
Also, devices like Alexa, Siri etc., uses speech recognition algorithm to respond as per
voice control.

Ø Internet search:
Various Search engines of the Internet like Google, Yahoo, Bing, Ask, etc., use data
science technology to enhance the search experience for the users. The search result is
listed within a fraction of seconds.

Ø E-Commerce
To provide an enhanced personalized experience and accurate recommendations to the
users, e-commerce companies, such as Amazon, Flipkart, Netflix, Google Play, etc., are
using data science technology. For example, when searching for a product on any e-
commerce site, auto suggestions for similar products pop up due to data science
technology.
Benefits of data science in the field of E-commerce:-
• For identifying a potential customer base.
• Predictive analytics is used for making a forecast about the products and services.
• Companies are optimizing their pricing structures for their consumers.

Ø Banking
One of the prime applications of data science is Banking Industry. With the help of data
science, it has become easier for banks to manage their resources efficiently.
Data science has benefited the Banking industry in the following areas:-
§ Fraud detection
§ Management of customer data
§ Customer segmentation
§ Risk detection

Ø Healthcare
Data Science has a crucial role in the Healthcare Industry. Doctors can detect cancer and
tumours at an early stage using Image Recognition software.
Genetic Industries use Data Science for analyzing and classifying patterns of genomic
sequences. Various virtual assistants are also helping patients to resolve their physical
and mental ailments. In the field of disease research, the techniques of data science
provide greater insight into genetic issues.

Ø Travel, Transportation & Logistics


The transportation industry safeguards the efficient and secure transportation of people
and goods from one place to another. The data science platform helps make travel,
transportation and logistics safer and more reliable.
Benefits of Data Science in the transportation and logistics industry:-
• Road, Railway and Air Traffic Management
• Ship Monitoring and Route Optimization
• Analyze traffic to optimize network flow and improve travel experiences
• Better predict delays to improve scheduling of support resources
1.3 REVISITING AI PROJECT CYCLE

The field of computer science dealing with building a smarter machine capable of performing
tasks and decision-making the same as a human is known as Artificial Intelligence.

AI PROJECT CYCLE

The Scenario

Transportation is considered an indispensable part of human life and backbone of any Country’s
economy. It plays a crucial role in enhancing up the lifestyles of ordinary men by providing
facilities and approachability as required to them. The rural regions are constantly struggling with
services and facilities aspects due to their remote and dispersed locations. Effective and efficient
transportation can mitigate regional problems of rural areas by providing access to employment,
health, education and services.

The common man's problem in villages for commuting for work, study, etc., is not getting suitable
and convenient public transportation like bus, auto, or taxies. The proposed study objective is to
provide accessibility and proper transportation services to these rural regions.

Problem Scoping

a) Who - It involves all those affected directly and indirectly by the problems and
are Stake Holders.

b) What – Under this block, pieces of evidence are gathered to prove that problem
exists. It helps us in understanding and recognizing the nature of the problem.

c) Where – It involves the situation and location of the problem.

d) Why – Under this block, the decision about whether the given problem is worth
solving is made.

In our Scenario, the Problem Template:-

Stake Holders Students, Employees, Business Owners etc., Who ?


commuting for Education and Employment
purpose etc.
Problem Non-availability of suitable and convenient What ?
public transportation (Government/ Private) like
bus, auto or taxi.
Location During Morning hours while going to school Where ?
/college and offices. Also, afternoon while
returning from college/school and offices.
Emergency transportation for healthcare
services.
An Ideal Based on the accurate data supplied to the Why ?
Solution service providers and AI system for
transportation, a proper timetable and services
can be devised for public in rural areas.

Data Acquisition

After the goal of our project is finalized, the next step is looking at various data features which
affect the problem. As data is the fuel for AI-based projects, the goal is to collect the kind of
data. In the above project, the factors affecting are:-
a. Number of students and people using the transportation system
b. Time of usage of the transportation system
c. Cost incurred during transportation
d. Days with heavy traffic load
e. Days with very less traffic
On studying the pattern of the data for a week, the database has to be created. The various
techniques for collecting the data will be face-to-face interviews, telephonic interviews, pilot
surveys, and household surveys.
To collect the travel data, the GPS and GIS technologies are used. GPS data help us provide real-
time spatial information, and it shows the travel behavior, including distance, travel speed, trip
time.The data collected can be used for the development of transportation policy.

Data Exploration

The next step after the creation of the database is to analyze the data and perform the interpretation.
The required information extracted from the curated dataset, is cleaned so that there exist no errors
or missing elements in it.

Modelling

The model selection is the next step after the dataset. The clustering algorithm is chosen for the
transportation system, and it is an unsupervised machine learning technique. Clustering has proven
its efficiency in developing intelligent transportation systems. Clustering is applied in various
categories in transportation planning as the trip generation, traffic zone division, and trip
distribution.

Evaluation

The accuracy of the algorithm is checked whether it is working correctly or not, after training the
model on dataset.

1. The trained algorithm feeds data regarding the number of trips, number of persons
travelling and travelling time.
2. It is then fed data regards the capacity of the public transportation utilized.
3. The algorithm then works upon the entries according to the training it got at the
modelling stage.
4. The Model predicts the number of trips to be run for practical usage of public
transportation.
5. The prediction is compared to the testing dataset value.
6. The model is tested for ten testing datasets.
7. Prediction values of the testing dataset is compared to the actual values.
8. If the prediction value is the same or almost similar to the actual values, the model is
accurate. Otherwise, either the model selection is changed, or the model is trained on
more data for better accuracy. Once the model is able to achieve optimum efficiency, it is
ready to be deployed for real-time usage.
DATA COLLECTION

Our society is highly dependent on data. Accurate data collection is necessary to make precise
business decisions, safeguard quality assurance, and keep research reliabile. The systematic
approach to gather obseravtions and create databsase for data analysis is Data Collection. The
process of data collection is non technical process and do not require experts.

SOURCES OF DATA

Data collection is the fundamental basis of data analysis in Data Science. Raw facts and figures
are known as data. The sources of data are mainly of two types:

i. Primary data source: The data collected from its origin, i.e. from the reports and
records published within the organization, is the primary data source. Primary data is an
original and hence more reliable source of data.
ii. Secondary data source: The data collected from an outside agency or the organization is
the secondary data source. Secondary data is not original, and it is analyzed and has
undergone some statistical operations.

DATA COLLECTION TECHNIQUES

Offline Techniques Online Techniques


• Interviews • Sales Reports
• Questionnaires • Business Journals
• In-person Surveys • Government Records (e.g., census, tax
records, Social Security info)

While accessing data from any of the data sources, following points should be kept in mind:

1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone’s privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random
sources can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of
the AI model.

TYPES OF DATA

There are two general types of data:-


• QUANTITATIVE DATA
Quantitative data is measurable information, and it is a numeric value.
Examples include:
a) Class attendance.
b) Marks obtained by a child in a subject
c) Price of any item like Book/Pen/Pencil etc.
d) Distance of the School from the railway station
• QUALITATIVE DATA
Qualitative data is information about qualities, which can’t be measured.
Examples include:
a) Sharing the experience of the first day at the School.
b) Recommendation of a book / Book review

FILE / DATA FORMATS IN DATA SCIENCE


Some of the popular forms of file formats are:-
a) CSV: CSV is a widely accepted data format. CSV stands for comma-separated values.
The data in CSV files are stored in text files separated by a delimiter, and the default
delimiter is a comma.
b) TSV: Tab Separated Values, and is quite similar to CSV file. TSV file is a popular
method that supports the exchange of data among databases and spreadsheets.
c) XLSX: In XLS/XLSX format, the data is stored and organized in rows and columns.
Microsoft Excel software is used to create spreadsheet files.
d) SQL: Full form of SQL is Structured Query Language. It is a language used to store,
retrieve and modify data from a database.

ERRORS IN DATA COLLECTION

During the data collection process, there can be some errors in the data as follows:-
a. Erroneous Data: Erroneous data means inaccurate data, which includes missing data, wrong
information etc., and their two ways are:-
• Incorrect values: The values in the dataset are not correct. For example, in the employee
table, in the designation post, the employee's name is mentioned. Since the expected
value was a designation, the analysis of data will be incorrect.
• Invalid or Null values: Some of the values in the datasets get corrupted and hence
become invalid. Some values become NaN in the dataset and are null values. The reason
for NaN can be due to finding out the square root of the number, division by zero etc.
These values are removed from the database.

b. Missing Data: In some datasets, the data is absent or is missing, i.e. some of the observations
in datasets are blank. The absence of data/ missing data cannot be inferred as an error. For example,
In surveys, some respondents are unreachable, and hence dataset for them is not recorded.

c. Outliers: Data that lies outside the range of other values in the set are outliers, i.e. it can be
termed as data that significantly differ from other data. Reasons for outliers are - data is
inappropriately scaled, errors during data entry, etc.
For example, finding out the average height of students of a particular grade, if there is a mistake
and some entries are wrongly entered, it would result in an incorrect average.
DATA ACCESS

Python supports various packages for accessing the tabular data and processing the data. Some of
these packages are:-

NUMPY

Python NumPy is a standard library; it stands for Numerical Python and is an extensive package
that provides the simplest and robust data structure, i.e. n-dimensional array used for data
analysis and scientific computing. An array is a group of homogenous elements stored together
under one name.

Lists Array (ndarray)


A collection of heterogeneous elements is known A group of homogeneous elements is an Array,
as a list, i.e. elements that can be of different data i.e. all elements are of the same data type. For
types. example, an array with float values may be:
for example, [1.25, 5.5, 2.75 , 12.56]
[100, 3.14, ‘hello’, ‘artificial intelligence’]
Elements of a list are not stored contiguously in All the elements of the array are stored in
memory. contiguous memory locations.
Lists take more space in the memory and are less NumPy array uses less space in memory than a
efficient because Python stores the type list because arrays do not require space to store
information of each element of the list. the data type of each element separately.

One of the core data types of Python is List An array is a part of the NumPy library

PANDAS
Python Pandas is Python’s library for analyzing datasets and finally making conclusions based on
statistical theories. Name ‘Pandas’ is derived from “Panel Data System”. As relevant data is
essential in data science, Pandas helps clean datasets to make them suitable.
Pandas support high performance and easy to use data structures like series (1 dimensional) and
data frames (2 dimensional).
Capabilities of Pandas library:
• Handling substantial datasets.
• Supports multiple data file formats for storing data.
• Supports operations on independent groups within the datasets.
• Selection/ filtration of subsets from bulky datasets and even merging numerous datasets.
• Reshaping and pivoting of datasets in many forms.
• Functionality to find and fill missing data.

MATPLOTLIB

Martplotlib is a library for 2-dimensional plotting with Python and is used for creating static,
animated, and interactive 2D- plots or figures in Python. Plots aid in understanding and deriving
decisions based on various trends and patterns in the graphs. Matplotlib supports the following
types of charts:-

Bar Chart Histogram Pie Chart Scatter Chart Area Chart

Features of MatPlotlib:

• Matplotlib is an open-source drawing library that supports various chart types.


• It supports features to make the plots more communicable and descriptive.
• Simple codes are required to generate plots like histograms, bar charts, and other types of
charts.
• They are used in web application servers, shells, and Python scripts.

STATISTICS WITH PYTHON

Python is a prevalent language when it comes to data analysis and statistics. Python has a built-
in module, ‘statistics’, that is used to perform mathematical statistics of numeric data. A few of
the functions of the module are:-

• Mean : Mean is also known as Arithmetic Average. It is the sum of all


observations in the datasets divided by the number of observations.
• Median : Median is also known as the 50th percentile of all datasets, and it
is the middle element of all the observations arranged in ascending order in a dataset.
• Mode : Mode is defined as the most common observation occurring in a
dataset.
• Standard Deviation : Standard Deviation is the square root of the variance, and it is
used to identify the outliers in the data.
• Variance : Average of squares of difference of mean and each observation of
the datasets.

DATA VISUALIZATION

The pictorial or graphical representation of the data using a graph, chart, etc., is knowns as Data
Visualisation. For effectively communicating results to the user, visualization is a great tool.
For example, traffic symbols and the speedometer of a vehicle are a few examples of visualization
that we encounter in our daily lives. Visualization of data is effectively used in fields like science,
health, finance, engineering, etc.
Python supports many visualization libraries like Matplotlib, seaborn, and Folium etc. As we have
already discussed before, with the help of Matplotlib, various plots can be drawn. Pyplot is a
Matplotlib module that provides an interface to draw different types of plots. Let us discuss five
key plots for basic data visualization:
• Scatter Plot
• Line Plot
• Bar Chart
• Histogram Plot
• Box Plot

You might also like