TLMweek 1 Intro Ds
TLMweek 1 Intro Ds
Data science can be seen as the interdisciplinary field that deals with the creation of insights or data
products from a given set of data files (usually in unstructured form), using analytics methodologies.
The data it handles is often what is commonly known as “big data,” although it is often applied to
conventional data streams, such as the ones usually encountered in the databases, the spreadsheets,
and the text documents of a business. We’ll take a closer look into big data in the next section.
Data science is not a guaranteed tool for finding the answers to the questions we have about the
data, though it does a good job at shedding some light on what we are investigating. For example, we
may be interested in figuring out the answer to “How can we predict customer attrition based on the
demographics data we have on them?” This is something that may not be possible with that data
alone.
However, investigating the data may help us come up with other questions, like “Can demographics
data supplement a prediction system of attrition, based on the orders they have made?” Also, it is as
good as the data we have, so it doesn’t make sense to expect breathtaking insights if the data we
have is of low quality.
The term “data science” combines two key elements: “data” and “science.”
1. Data: It refers to the raw information that is collected, stored, and processed. In today’s digital
age, enormous amounts of data are generated from various sources such as sensors, social
media, transactions, and more. This data can come in structured formats (e.g., databases) or
unstructured formats (e.g., text, images, videos).
2. Science: It refers to the systematic study and investigation of phenomena using scientific
methods and principles. Science involves forming hypotheses, conducting experiments,
analyzing data, and drawing conclusions based on evidence.
What is data science used for?
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening in
the data environment. It is characterized by data visualizations such as pie charts, bar charts, line
graphs, tables, or generated narratives. For example, a flight booking service may record data like the
number of tickets booked each day. Descriptive analysis will reveal booking spikes, booking slumps,
and high-performing months for this service.
2. Diagnostic analysis
3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur
in the future. It is characterized by techniques such as machine learning, forecasting, pattern
matching, and predictive modeling. In each of these techniques, computers are trained to reverse
engineer causality connections in the data. For example, the flight service team might use data
science to predict flight booking patterns for the coming year at the start of each year. The computer
program or algorithm may look at past data and predict booking spikes for certain destinations in
May. Having anticipated their customer’s future travel requirements, the company could start
targeted advertising for those cities from February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to
happen but also suggests an optimum response to that outcome. It can analyze the potential
implications of different choices and recommend the best course of action. It uses graph analysis,
simulation, complex event processing, neural networks, and recommendation engines from machine
learning.
Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns
to maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.
Finance
Marketing
Retail
Transportation
Education
Entertainment
Manufacturing
Energy
Government
DATA SCIENCE VS. BUSINESS INTELLIGENCE
Data Science: Data science is basically a field in which information and knowledge are extracted
from the data by using various scientific methods, algorithms, and processes. It can thus be
defined as a combination of various mathematical tools, algorithms, statistics, and machine
learning techniques which are thus used to find the hidden patterns and insights from the data
which help in the decision-making process. Data science deals with both structured as well as
unstructured data. It is related to both data mining and big data. Data science involves studying
the historic trends and thus using its conclusions to redefine present trends and also predict
future trends. Technologies.
S.
No. Factor Data Science Business Intelligence
7. Expertise It’s expertise is data scientist. It’s expertise is the business user.
It deals with the questions of what It deals with the question of what
8. Questions
will happen and what if. happened.
Statistics is a field that is similar to data science and business intelligence, but it has its own domain.
Namely, it involves doing basic manipulations on a set of data (usually tidy and easy to work with) and
applying a set of tests and models to that data. It’s like a conventional vehicle that you drive on city
roads. It does a decent job, but you wouldn’t want to take that vehicle to the country roads or off-
road. For this kind of terrain you’ll need something more robust and better-equipped for messy data:
data science. If you have data that comes straight from a database, it’s fairly clean, and all you want to
do is create a simple regression model or check to see if February sales are significantly different from
January sales, analyzing statistics will work. That’s why statisticians remain in business, even if most
of the methods they use are not as effective as the techniques a data scientist employs. Scientists
make use of statistics, though it is not formally a scientific field. This is an important point. In fact,
even mathematicians look down on the field of statistics, for the simple reason that it fails to create
robust theories that can be generalized to other aspects of Mathematics. So, even though statistical
techniques are employed in various areas, they are inherently inferior to most principles of
Mathematics and of Science. Also, statistics is not a fool-proof framework when it comes to drawing
inferences about the data. Despite the confidence metrics it provides, its results are only as good as
the assumptions it makes about the distribution of each variable, and how well these assumptions
hold. This is why many scientists also employ simulation methods to ensure that the conclusion their
statistical models come up with are indeed viable and robust enough to be used in the real world.
BIG DATA
Big data refers to extremely large datasets that are complex, grow rapidly, and require advanced
techniques and technologies for storage, analysis, and processing. Here’s an overview of big data, its
characteristics, and its applications:
1. Volume: The sheer amount of data generated every second from various sources, such as social
media, sensors, transactions, and more.
2. Velocity: The speed at which new data is generated and the pace at which it must be processed
to be useful.
3. Variety: The different types of data, including structured, semi-structured, and unstructured data
(e.g., text, images, videos, sensor data).
4. Veracity: The quality and accuracy of the data, which can vary and affect the reliability of
analysis.
5. Value: The potential insights and benefits that can be derived from analyzing the data.
6. Variability: Variability often applies to sets of big data, which might have multiple meanings or be
formatted differently in separate data sources.
Technologies and Tools for Big Data
1. Storage Solutions: Distributed file systems like Hadoop Distributed File System (HDFS) and
cloud storage solutions such as Amazon S3.
2. Processing Frameworks:
Spark: A fast and general-purpose cluster computing system for big data processing.
3. Data Management: NoSQL databases (e.g., MongoDB, Cassandra) for handling unstructured
data.
4. Data Integration: Tools like Apache Kafka and Apache NiFi for data ingestion and streaming.
5. Analytics and Machine Learning: Tools and frameworks like Apache Mahout, TensorFlow, and
H2O.ai for big data analytics and machine learning.
6. Visualization: Tools like Tableau, Power BI, and D3.js for visualizing large datasets.
1. Healthcare: Predictive analytics for patient care, personalized medicine, and epidemic outbreak
prediction.
5. Government: Smart city initiatives, public safety, and efficient resource management.
Cost savings. Big data can be used to pinpoint ways businesses can enhance operational
efficiency. For example, analysis of big data on a company's energy use can help it be more
efficient.
Positive social impact. Big data can be used to identify solvable problems, such as improving
healthcare or tackling poverty in a certain area.
Skill requirements. Deploying and managing big data systems also requires new skills compared
to the ones that database administrators and developers focused on relational software typically
possess.
Costs. Using a managed cloud service can help keep costs under control. However, IT managers
still must keep a close eye on cloud computing use to make sure costs don't get out of hand.
Migration. Migrating on-premises data sets and processing workloads to the cloud can be a
complex process.
Accessibility. Among the main challenges in managing big data systems is making the data
accessible to data scientists and analysts, especially in distributed environments that include a
mix of different platforms and data stores. To help analysts find relevant data, data management
and analytics teams are increasingly building data catalogs that incorporate metadata
management and data lineage functions.
Integration. The process of integrating sets of big data is also complicated, particularly when data
variety and velocity are factors.
3. Semi-Supervised Learning: This approach combines a small amount of labeled data with a large
amount of unlabeled data during training. It falls between supervised and unsupervised learning.
Machine learning is used in various domains to solve complex problems and automate tasks. Some
common applications include:
1. Natural Language Processing (NLP): Used in language translation, sentiment analysis, and
chatbots.
2. Computer Vision: Applied in facial recognition, image classification, and autonomous vehicles.
3. Healthcare: Used for disease prediction, personalized treatment plans, and medical imaging
analysis.
4. Finance: Used for fraud detection, algorithmic trading, and credit scoring.
Data preparation
Data Wrangling
Analyse Data
Deployment
Types of Machine Learning Algorithms
Machine learning algorithms can be broadly classified into several categories based on their
learning styles and the nature of tasks they are designed to solve. Here are the primary types of
machine learning algorithms:
Supervised learning algorithms are trained on labeled data. This means that each training example
is paired with an output label. The algorithm learns to predict the output from the input data.
Linear Regression: Used for predicting continuous values. It models the relationship between
a dependent variable and one or more independent variables using a linear equation.
Logistic Regression: Used for binary classification problems. It predicts the probability of a
binary outcome using a logistic function.
Support Vector Machines (SVM): Used for both classification and regression tasks. It finds the
optimal hyperplane that separates data points of different classes with the maximum margin.
Decision Trees: Used for classification and regression tasks. It splits the data into subsets
based on the value of input features, forming a tree-like structure.
Random Forests: An ensemble learning method that combines multiple decision trees to
improve predictive performance and reduce overfitting.
k-Nearest Neighbors (k-NN): A simple, instance-based learning algorithm that classifies a data
point based on the majority class among its k nearest neighbors.
Naive Bayes: Based on Bayes' theorem, it assumes independence between features and is
used for classification tasks.
2. Unsupervised Learning Algorithms
Unsupervised learning algorithms are trained on unlabeled data. They try to learn the underlying
structure of the data without any explicit instructions on what to predict.
K-Means Clustering: Partitions data into k clusters based on feature similarity. Each data
point is assigned to the nearest cluster center.
Hierarchical Clustering: Builds a hierarchy of clusters either by merging smaller clusters into
larger ones (agglomerative) or by splitting larger clusters into smaller ones (divisive).
t-Distributed Stochastic Neighbor Embedding (t-SNE): Used for dimensionality reduction and
visualization of high-dimensional data.
Semi-supervised learning algorithms use a combination of a small amount of labeled data and a
large amount of unlabeled data. This approach helps improve learning accuracy when labeled data
is scarce.
Self-Training: Uses a model trained on labeled data to predict labels for the unlabeled data.
The model is then retrained on the combined dataset.
Co-Training: Utilizes two or more models trained on different views of the data to label the
unlabeled data, iteratively improving each other.
Reinforcement learning algorithms learn by interacting with an environment. The algorithm, known
as an agent, takes actions and receives rewards or penalties based on the outcomes of those
actions. The goal is to learn a strategy that maximizes cumulative rewards.
Q-Learning: A value-based method that aims to learn the value of taking a particular action in
a particular state.
Deep Q-Networks (DQN): Combines Q-learning with deep neural networks to handle high-
dimensional state spaces.
Policy Gradient Methods: Directly optimizes the policy by adjusting the parameters in the
direction that maximizes expected rewards.
Proximal Policy Optimization (PPO): An advanced policy gradient method that balances
exploration and exploitation to improve training stability.
5. Ensemble Learning Algorithms
Ensemble learning algorithms combine the predictions of multiple base models to produce a final
prediction. This approach often improves the accuracy and robustness of the model.
Bagging (Bootstrap Aggregating): Builds multiple models from different subsamples of the
training dataset and aggregates their predictions. Random Forest is a popular bagging
algorithm.
Boosting: Builds models sequentially, each trying to correct the errors of the previous one.
Common boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.
Stacking: Combines multiple models by training a meta-model to make the final prediction
based on the outputs of the base models.