[go: up one dir, main page]

0% found this document useful (0 votes)
4 views10 pages

Q.1 Explain Process of Working With Data From Files in Data Science

Uploaded by

kumar108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Q.1 Explain Process of Working With Data From Files in Data Science

Uploaded by

kumar108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1 2

Q.1 Explain process of working with data from files in Data Science. Some popular tools and technologies used in this process include:
Working with data from files is a crucial step in the data science workflow. - Pandas and NumPy for data manipulation and analysis
Here's an overview of the process: - Matplotlib and Seaborn for data visualization
Step 1: Data Ingestion - Scikit-learn and TensorFlow for machine learning
- Collect and gather data from various file sources (e.g., CSV, Excel, JSON, - SQL and NoSQL databases for data storage
text files). - Jupyter Notebooks and R Studio for interactive data exploration and analysis
- Use programming languages like Python, R, or SQL to read and import data Q2. Explain use of NumPy arrays for efficient data manipulation.
from files. NumPy (Numerical Python) arrays are a fundamental data structure in
Step 2: Data Inspection scientific computing and data analysis. They provide an efficient way to store
- Examine the data to understand its structure, quality, and content. and manipulate large datasets. Here's how NumPy arrays support efficient data
- Use summary statistics, data visualization, and data profiling techniques to manipulation:
identify patterns, outliers, and missing values. Advantages of NumPy Arrays
Step 3: Data Cleaning 1. Vectorized Operations: NumPy arrays enable vectorized operations, which
- Handle missing values, duplicates, and inconsistent data entries. allow you to perform operations on entire arrays at once. This eliminates the
- Perform data normalization, feature scaling, and data transformation as need for loops, making your code faster and more concise.
needed. 2. Memory Efficiency: NumPy arrays store data in a contiguous block of
Step 4: Data Transformation memory, which reduces memory usage and improves cache locality. This leads
- Convert data types, perform data aggregation, and create new features. to faster data access and manipulation.
- Use data manipulation techniques, such as pivoting, melting, and merging. 3. Broadcasting: NumPy arrays support broadcasting, which allows you to
Step 5: Data Storage perform operations on arrays with dif…
- Store cleaned and transformed data in a suitable format (e.g., Pandas Q3 Explain structure of data in Pandas and its importance in large datasets
DataFrame, NumPy array, SQL database). Pandas is a powerful Python library used for data manipulation and analysis.
- Consider using data storage solutions like data warehouses, data lakes, or The structure of data in Pandas is based on two primary data structures: Series
cloud storage. (1-dimensional labeled array) and DataFrame (2-dimensional labeled data
Step 6: Data Analysis structure with columns of potentially different types).
- Apply statistical and machine learning techniques to extract insights and Pandas Data Structures:
meaning from the data. 1. Series: A Series is a one-dimensional labeled array of values. It's similar to
- Use data visualization tools to communicate findings and results. a column in a spreadsheet or a column in a relational database. Each value in the
Step 7: Data Visualization and Communication Series is associated with a unique index label.
- Present findings and insights to stakeholders using clear and effective Example:
visualizations. import pandas as pd
- Use storytelling techniques to convey the significance and impact of the # Create a Series
results. series = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
3 4

print(series) 3. Data Alignment: Pandas DataFrames and Series provide data alignment,
Output: which ensures that data is properly aligned and indexed, making it easier to
a 1 perform data analysis.
b 2 4. Missing Data Handling: Pandas provides built-in support for missing data
c 3 handling, allowing you to easily detect and handle missing values in your
d 4 dataset.
e 5 5. Integration with Other Libraries: Pandas integrates well with other popular
dtype: int64 data science libraries, such as NumPy, Matplotlib, Scikit-learn, and Statsmodels.
1. DataFrame: A DataFrame is a two-dimensional labeled data structure with In summary, Pandas data structures provide an efficient and flexible way to
columns of potentially different types. It's similar to an Excel spreadsheet or a store and manipulate large datasets, making it an essential library for data
table in a relational database. Each column in the DataFrame is a Series, and science tasks.
each row is identified by a unique index label. Q4. Explain different data loading and storage formats for Data Science projects.
Example: In Data Science projects, data loading and storage formats play a crucial role
import pandas as pd in efficient data processing, analysis, and modeling. Here are different data
# Create a DataFrame loading and storage formats commonly used in Data Science:
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'], 1. CSV (Comma Separated Values)
'Age': [28, 24, 35, 32], - A plain text file format used for tabular data.
'Country': ['USA', 'UK', 'Australia', 'Germany']} - Widely supported by most data science tools and libraries.
df = pd.DataFrame(data) - Easy to read and write, but can be slow for large datasets.
print(df) 2. JSON (JavaScript Object Notation)
Output: - A lightweight, human-readable data interchange format.
Name Age Country - Suitable for semi-structured data, such as web scraping or API data.
0 John 28 USA - Can be slower to read and write compared to binary formats.
1 Anna 24 UK 3. HDF5 (Hierarchical Data Format 5)
2 Peter 35 Australia - A binary format designed for large, complex datasets.
3 Linda 32 Germany - Supports hierarchical data structures and efficient data compression.
Importance of Pandas Data Structures in Large Datasets: - Widely used in scientific computing, but may require additional libraries.
1. Efficient Data Storage: Pandas DataFrames and Series provide efficient 4. Apache Parquet
data storage, allowing you to store large datasets in memory. - A columnar storage format designed for big data analytics.
2. Fast Data Manipulation: Pandas provides various methods for fast data - Optimized for querying and processing large datasets.
manipulation, such as filtering, sorting, grouping, and merging. - Supported by many big data technologies, including Hadoop and Spark.
5. Apache Arrow
- A cross-language, columnar memory format for big data analytics.
5 6

- Designed for high-performance data processing and interchange. Tools for Reshaping Data
- Supported by many big data technologies, including Pandas, NumPy, and 1. Pandas melt() function: Use the melt() function to transform data from
Spark. wide to long format.
6. Pickle 2. Pandas pivot() function: Use the pivot() function to transform data from
- A Python-specific binary format for serializing and deserializing data. long to wide format.
- Fast and efficient, but specific to Python and may not be compatible with Pivoting Data
other languages. Pivoting data involves rotating data from a state of rows to columns or vice
7. SQL Databases versa, creating a spreadsheet-style pivot table.
- Relational databases that store data in tables with defined schemas. Types of Pivoting
- Suitable for structured data and support SQL queries for data analysis. 1. Simple Pivoting: Rotate data from rows to columns.
- Examples include MySQL, PostgreSQL, and SQLite. 2. Aggregated Pivoting: Rotate data from rows to columns and perform
8. NoSQL Databases aggregation operations (e.g., sum, mean, count).
- Non-relational databases that store data in flexible, dynamic schemas. Tools for Pivoting Data
- Suitable for semi-structured or unstructured data, such as documents or 1. Pandas pivot_table() function: Use the pivot_table() function to create a
graphs. pivot table.
- Examples include MongoDB, Cassandra, and Neo4j. 2. Pandas pivot() function: Use the pivot() function to perform simple
When choosing a data loading and storage format, consider factors such as: pivoting.
- Data size and complexity Example Use Case
- Performance requirements Suppose we have a dataset containing sales data for different products across
- Compatibility with tools and libraries various regions.
- Data structure and schema | Region | Product A | Product B | Product C |
- Security and data governance requirements | --- | --- | --- | --- |
Q5. Explain the process of reshaping and pivoting data for effective | North | 100 | 200 | 300 |
analysis. | South | 150 | 250 | 350 |
Reshaping and pivoting data are essential steps in data preparation for | East | 200 | 300 | 400 |
effective analysis. Here's a step-by-step guide on how to reshape and pivot data: | West | 250 | 350 | 450 |
Reshaping Data
Reshaping data involves transforming data from a wide format to a long To analyze sales data by product and region, we can pivot the data using the
format or vice versa. pivot_table() function.
1. Wide Format: In a wide format, each row represents a single observation, import pandas as pd
and each column represents a variable. # Create a sample dataset
2. Long Format: In a long format, each row represents a single observation- data = {'Region': ['North', 'South', 'East', 'West'],
variable pair. 'Product A': [100, 150, 200, 250],
7 8

'Product B': [200, 250, 300, 350], 1. Summary statistics: Calculate mean, median, mode, standard deviation, and
'Product C': [300, 350, 400, 450]} variance for numerical variables.
df = pd.DataFrame(data) 2. Data visualization: Use plots, charts, and heatmaps to visualize the data
# Pivot the data distribution, patterns, and relationships.
pivoted_df = pd.pivot_table(df, values=['Product A', 'Product B', 'Product C'], 3. Correlation analysis: Examine the correlation between numerical variables
index='Region', aggfunc='sum') using correlation coefficients (e.g., Pearson's r).
print(pivoted_df) 4. Scatter plots: Visualize the relationship between two numerical variables.
Output: 5. Box plots: Compare the distribution of numerical variables across different
| Region | Product A | Product B | Product C | categories.
| --- | --- | --- | --- | Tools Used in Data Exploration:
| East | 200 | 300 | 400 | 1. Pandas: A popular Python library for data manipulation and analysis.
| North | 100 | 200 | 300 | 2. Matplotlib: A Python library for creating static, animated, and interactive
| South | 150 | 250 | 350 | visualizations.
| West | 250 | 350 | 450 | 3. Seaborn: A Python library built on top of Matplotlib for creating
By pivoting the data, we can easily analyze sales data by product and informative and attractive statistical graphics.
region. 4. Plotly: A Python library for creating interactive, web-based visualizations.
Data exploration is a crucial step in Data Science projects that involves 5. Jupyter Notebook: A web-based interactive environment for working with
visually and statistically examining the data to understand its underlying data and visualizing results.
structure, patterns, and relationships. The primary goal of data exploration is to Best Practices for Data Exploration:
gain insights into the data, identify potential issues, and inform the subsequent 1. Start with a clear question or objective: Focus your exploration on a
steps of the project. specific question or hypothesis.
Key Objectives of Data Exploration: 2. Use a combination of techniques: Employ multiple techniques, such as
1. Understand the data distribution: Examine the distribution of values in each summary statistics, visualization, and correlation analysis, to gain a
variable, including central tendency, dispersion, and skewness. comprehensive understanding of the data.
2. Identify patterns and relationships: Look for correlations, trends, and 3. Be iterative: Refine your exploration as you gain insights and identify new
relationships between variables. questions or areas of interest.
3. Detect outliers and anomalies: Identify data points that are significantly 4. Document your findings: Record your observations, insights, and
different from the rest of the data. conclusions to inform subsequent steps in the project.
4. Assess data quality: Check for missing values, duplicates, and
inconsistencies in the data.
5. Inform feature engineering: Use insights gained during exploration to Q6. Explain role of data exploration in Data Science projects
inform the creation of new features or transformation of existing ones. Data exploration is a crucial step in Data Science projects that involves
Techniques Used in Data Exploration: visually and statistically examining the data to understand its underlying
9 10

structure, patterns, and relationships. The primary goal of data exploration is to 5. Jupyter Notebook: A web-based interactive environment for working with
gain insights into the data, identify potential issues, and inform the subsequent data and visualizing results.
steps of the project. Best Practices for Data Exploration:
Key Objectives of Data Exploration: 1. Start with a clear question or objective: Focus your exploration on a
1. Understand the data distribution: Examine the distribution of values in each specific question or hypothesis.
variable, including central tendency, dispersion, and skewness. 2. Use a combination of techniques: Employ multiple techniques, such as
2. Identify patterns and relationships: Look for correlations, trends, and summary statistics, visualization, and correlation analysis, to gain a
relationships between variables. comprehensive understanding of the data.
3. Detect outliers and anomalies: Identify data points that are significantly 3. Be iterative: Refine your exploration as you gain insights and identify new
different from the rest of the data. questions or areas of interest.
4. Assess data quality: Check for missing values, duplicates, and 4. Document your findings: Record your observations, insights, and
inconsistencies in the data. conclusions to inform subsequent steps in the project.
5. Inform feature engineering: Use insights gained during exploration to Q7. Explain process of data cleaning and sampling in a data science
inform the creation of new features or transformation of existing ones. project.
Data cleaning and sampling are crucial steps in a data science project that
Techniques Used in Data Exploration: ensure the quality and reliability of the data. Here's a step-by-step guide on the
1. Summary statistics: Calculate mean, median, mode, standard deviation, and process of data cleaning and sampling:
variance for numerical variables. Data Cleaning
2. Data visualization: Use plots, charts, and heatmaps to visualize the data Data cleaning involves identifying and correcting errors, inconsistencies, and
distribution, patterns, and relationships. inaccuracies in the data.
3. Correlation analysis: Examine the correlation between numerical variables Steps in Data Cleaning
using correlation coefficients (e.g., Pearson's r). 1. Data Inspection: Examine the data to identify errors, inconsistencies, and
4. Scatter plots: Visualize the relationship between two numerical variables. inaccuracies.
5. Box plots: Compare the distribution of numerical variables across different 2. Handling Missing Values: Decide on a strategy to handle missing values,
categories. such as imputation, interpolation, or deletion.
Tools Used in Data Exploration: 3. Data Normalization: Normalize data to ensure consistency in formatting
1. Pandas: A popular Python library for data manipulation and analysis. and scaling.
2. Matplotlib: A Python library for creating static, animated, and interactive 4. Data Transformation: Transform data to ensure it meets the requirements of
visualizations. the analysis or model.
3. Seaborn: A Python library built on top of Matplotlib for creating 5. Data Quality Check: Perform a final quality check to ensure the data is
informative and attractive statistical graphics. accurate, complete, and consistent.
4. Plotly: A Python library for creating interactive, web-based visualizations. Data Sampling
11 12

Data sampling involves selecting a subset of data from the original dataset to What is Broadcasting?
reduce the size of the data while maintaining its representativeness. Broadcasting is the process of aligning arrays with different shapes and sizes
Types of Data Sampling to perform element-wise operations. When operating on two arrays, NumPy
1. Random Sampling: Select a random subset of data from the original compares their shapes element-wise. It starts with the trailing dimensions, and
dataset. works its way forward. Two dimensions are compatible when:
2. Stratified Sampling: Divide the data into subgroups based on relevant 1. They are equal.
characteristics and select a random subset from each subgroup. 2. One of them is 1.
3. Cluster Sampling: Divide the data into clusters based on relevant If these conditions are not met, a ValueError is raised.
characteristics and select a random subset of clusters. How Broadcasting Works
Steps in Data Sampling Here's an example to illustrate broadcasting:
1. Determine the Sampling Method: Choose a suitable sampling method import numpy as np
based on the characteristics of the data and the goals of the analysis. # Create two arrays
2. Determine the Sample Size: Calculate the required sample size based on a = np.array([1, 2, 3]) # shape: (3,)
the desired level of precision, confidence, and power. b = np.array([4]) # shape: (1,)
3. Select the Sample: Use the chosen sampling method to select the sample # Perform element-wise addition
from the original dataset. result = a + b
4. Evaluate the Sample: Assess the representativeness of the sample and its print(result) # Output: [5 6 7]
suitability for the analysis or model. In this example, a has shape (3,) and b has shape (1,). To perform the
Tools and Techniques for Data Cleaning and Sampling addition, NumPy broadcasts b to match the shape of a. The resulting array has
1. Pandas: A popular Python library for data manipulation and analysis. shape (3,).
2. NumPy: A library for efficient numerical computation in Python. Benefits of Broadcasting
3. Matplotlib: A plotting library for creating static, animated, and interactive Broadcasting provides several benefits in data processing:
visualizations in Python. 1. Concise Code: Broadcasting allows you to write concise and expressive
4. Scikit-learn: A machine learning library for Python that includes tools for code, reducing the need for explicit loops.
data preprocessing, feature selection, and model evaluation. 2. Efficient Computation: By avoiding explicit loops, broadcasting enables
By following these steps and using these tools and techniques, you can ensure efficient computation and reduces overhead.
that your data is clean, reliable, and representative, which is essential for 3. Flexibility: Broadcasting supports operations on arrays with different
accurate analysis and modeling in data science projects. shapes and sizes, making it a versatile tool for data processing.
Q8. Explain the concept of broadcasting in NumPy. How does it help in Common Use Cases for Broadcasting
data processing? 1. Element-wise Operations: Broadcasting is commonly used for element-
Broadcasting is a powerful feature in NumPy that allows you to perform wise operations like addition, subtraction, multiplication, and division.
operations on arrays with different shapes and sizes. It enables you to write 2. Array Multiplication: Broadcasting is useful for multiplying arrays with
concise and efficient code for various data processing tasks. different shapes, such as matrix multiplication.
13 14

3. Data Transformation: Broadcasting can be used to transform data by 4. Data Input/Output: Pandas supports various data input/output formats,
applying element-wise functions, such as scaling, normalization, or feature including:
extraction. - CSV: Reading and writing comma-separated values files
In summary, broadcasting is a powerful feature in NumPy that enables - Excel: Reading and writing Excel files
efficient and concise data processing. By understanding how broadcasting - JSON: Reading and writing JSON files
works, you can leverage its benefits to simplify your code and improve - SQL: Reading and writing data from SQL databases
performance. 5. Data Cleaning: Pandas provides methods for data cleaning, including:
Q9. Explain essential functionalities of Pandas for data analysis? - Handling Missing Values: Detecting and filling missing values
Pandas is a powerful Python library for data analysis that provides data - Data Normalization: Normalizing data to ensure consistency in formatting
structures and functions to efficiently handle structured data, including tabular and scaling
data such as spreadsheets and SQL tables. - Data Transformation: Transforming data to ensure it meets the
Essential Functionalities of Pandas: requirements of the analysis
1. Data Structures: Pandas provides two primary data structures: Key Benefits of Using Pandas:
- Series (1-dimensional labeled array of values) 1. Efficient Data Manipulation: Pandas provides fast and efficient data
- DataFrame (2-dimensional labeled data structure with columns of manipulation capabilities.
potentially different types) 2. Flexible Data Structures: Pandas offers flexible data structures that can
2. Data Manipulation: Pandas offers various methods for data manipulation, handle a wide range of data types and formats.
including: 3. Easy Data Analysis: Pandas provides a simple and intuitive API for data
- Filtering: Selecting specific rows or columns based on conditions analysis, making it easy to perform common data analysis tasks.
- Sorting: Sorting data by one or more columns 4. Integration with Other Libraries: Pandas integrates well with other popular
- Grouping: Grouping data by one or more columns and performing data science libraries, including NumPy, Matplotlib, and Scikit-learn.
aggregation operations Q10. Explain how data is loaded, stored, and formatted in different file
- Merging: Combining data from multiple sources based on common types for analysis.
columns Here's an overview of how data is loaded, stored, and formatted in different
- Reshaping: Transforming data from wide to long format or vice versa file types for analysis:
3. Data Analysis: Pandas provides various methods for data analysis, Text Files (.txt, .csv)
including: 1. Loading: Text files can be loaded using programming languages like
- Summary Statistics: Calculating mean, median, mode, standard deviation, Python, R, or SQL.
and variance 2. Storage: Text files store data in plain text format, with each row
- Correlation Analysis: Calculating correlation coefficients between representing a single observation and each column representing a variable.
columns 3. Formatting: Text files typically use commas (CSV) or tabs (TSV) to
- Data Visualization: Integrating with visualization libraries like Matplotlib separate columns, and newlines to separate rows.
and Seaborn to create plots and charts Comma Separated Values (.csv)
15 16

1. Loading: CSV files can be loaded using programming languages like 3. Formatting: Relational databases use a structured format to store data, with
Python, R, or SQL. support for data types, constraints, and relationships between tables.
2. Storage: CSV files store data in plain text format, with each row In summary, different file types and databases have their own strengths and
representing a single observation and each column representing a variable. weaknesses when it comes to loading, storing, and formatting data for analysis.
3. Formatting: CSV files use commas to separate columns and newlines to Q11. What is data science ?
separate rows. Data science is an interdisciplinary field that uses scientific methods,
Excel Files (.xls, .xlsx) processes, algorithms, and systems to extract knowledge and insights from
1. Loading: Excel files can be loaded using programming languages like structured and unstructured data. It involves using various techniques from
Python, R, or SQL. computer science, statistics, and domain-specific knowledge to turn data into
2. Storage: Excel files store data in a binary format, with each row actionable insights.
representing a single observation and each column representing a variable. Data science encompasses a range of activities, including:
3. Formatting: Excel files use a proprietary format to store data, with support 1. Data collection: Gathering data from various sources, such as databases,
for formatting, formulas, and charts. APIs, files, and sensors.
JSON Files (.json) 2. Data cleaning: Preprocessing data to remove errors, inconsistencies, and
1. Loading: JSON files can be loaded using programming languages like missing values.
Python, R, or SQL. 3. Data transformation: Converting data into a suitable format for analysis.
2. Storage: JSON files store data in a lightweight, human-readable format, 4. Data visualization: Using plots, charts, and other visualizations to
with each object representing a single observation and each key representing a communicate insights and patterns in the data.
variable. 5. Machine learning: Using algorithms to train models that can make
3. Formatting: JSON files use key-value pairs to represent data, with support predictions or classify data.
for nested objects and arrays. 6. Statistical analysis: Applying statistical techniques to identify trends,
HDF5 Files (.h5) patterns, and correlations in the data.
1. Loading: HDF5 files can be loaded using programming languages like 7. Insight generation: Interpreting the results of the analysis to extract
Python, R, or SQL. meaningful insights and recommendations.
2. Storage: HDF5 files store data in a binary format, with support for large Data science has many applications across various industries, including:
datasets and high-performance I/O. 1. Business: Customer segmentation, market analysis, and predictive
3. Formatting: HDF5 files use a hierarchical format to store data, with support modeling.
for groups, datasets, and attributes. 2. Healthcare: Disease diagnosis, patient outcome prediction, and
Relational Databases (e.g., MySQL, PostgreSQL) personalized medicine.
1. Loading: Relational databases can be loaded using SQL queries. 3. Finance: Risk analysis, portfolio optimization, and predictive modeling.
2. Storage: Relational databases store data in tables, with each row 4. Marketing: Customer behavior analysis, campaign optimization, and social
representing a single observation and each column representing a variable. media monitoring.
17 18

5. Environmental science: Climate modeling, air quality monitoring, and  Data Cleaning – Most of the real-world data is not structured and requires
natural disaster prediction. cleaning and conversion into structured data before it can be used for any
The data science process typically involves the following steps: analysis or modeling.
1. Problem formulation: Defining the problem or question to be addressed.  Exploratory Data Analysis – This is the step in which we try to find the

2. Data collection: Gathering relevant data from various sources. hidden patterns in the data at hand. Also, we try to analyze different factors
3. Data analysis: Applying various techniques to extract insights from the which affect the target variable and the extent to which it does so. How the
data. independent features are related to each other and what can be done to
4. Insight generation: Interpreting the results of the analysis to extract achieve the desired results all these answers can be extracted from this
meaningful insights. process as well. This also gives us a direction in which we should work to
5. Communication: Presenting the insights and recommendations to get started with the modeling process.
stakeholders.  Model Building – Different types of machine learning algorithms as well as

6. Deployment: Implementing the insights and recommendations into techniques have been developed which can easily identify complex patterns
production. in the data which will be a very tedious task to be done by a human.
Data science requires a combination of technical skills, including:  Model Deployment – After a model is developed and gives better results on

1. Programming: Proficiency in languages such as Python, R, or SQL. the holdout or the real-world dataset then we deploy it and monitor its
2. Data analysis: Knowledge of statistical techniques, machine learning performance. This is the main part where we use our learning from the data
algorithms, and data visualization tools. to be applied in real-world applications and use cases.
3. Data management: Familiarity with data storage solutions, data Key Components of Data Science Process
governance, and data quality.  Data Analysis – There are times when there is no need to apply advanced

4. Communication: Ability to effectively communicate insights and deep learning and complex methods to the data at hand to derive some
recommendations to stakeholders. patterns from it. Due to this before moving on to the modeling part, we first
Overall, data science is a rapidly evolving field that requires a unique blend of perform an exploratory data analysis to get a basic idea of the data and
technical, business, and communication skills to extract insights from data and patterns which are available in it this gives us a direction to work on if we
drive business value. want to apply some complex analysis methods on our data.
Data Science Process Life Cycle  Statistics – It is a natural phenomenon that many real-life datasets follow a

Some steps are necessary for any of the tasks that are being done in the field of normal distribution. And when we already know that a particular dataset
data science to derive any fruitful results from the data at hand. follows some known distribution then most of its properties can be analyzed
 Data Collection – After formulating any problem statement the main task is at once. Also, descriptive statistics and correlation and covariances between
to calculate data that can help us in our analysis and manipulation. two features of the dataset help us get a better understanding of how one
Sometimes data is collected by performing some kind of survey and there are factor is related to the other in our dataset.
times when it is done by performing scrapping.  Data Engineering – When we deal with a large amount of data then we

have to make sure that the data is kept safe from any online threats also it is
19 20

easy to retrieve and make changes in the data as well. To ensure that the data Clearly defining the research goals is the first step in the Data Science
is used efficiently Data Engineers play a crucial role. Process. A project charter outlines the objectives, resources, deliverables, and
 Advanced Computing timeline, ensuring that all stakeholders are aligned.
o Machine Learning – Machine Learning has opened new horizons Step 2: Retrieve Data
which had helped us to build different advanced applications and Data can be stored in databases, data warehouses, or data lakes within an
methodologies so, that the machines become more efficient and organization. Accessing this data often involves navigating company policies
provide a personalized experience to each individual and perform and requesting permissions.
tasks in a snap of the hand earlier which requires heavy human labor Step 3: Data Cleansing, Integration, and Transformation
and time intense. Data cleaning ensures that errors, inconsistencies, and outliers are
o Deep Learning – This is also a part of Artificial Intelligence and removed. Data integration combines datasets from different sources,
Machine Learning but it is a bit more advanced than machine while data transformation prepares the data for modeling by reshaping
learning itself. High computing power and a huge corpus of data variables or creating new features.
have led to the emergence of this field in data science. Step 4: Exploratory Data Analysis (EDA)
Knowledge and Skills for Data Science Professionals During EDA, various graphical techniques like scatter plots, histograms, and
Becoming proficient in Data Science requires a combination of skills, box plots are used to visualize data and identify trends. This phase helps in
including: selecting the right modeling techniques.
 Statistics: Wikipedia defines it as the study of the collection, analysis, Step 5: Build Models
interpretation, presentation, and organization of data. Therefore, it shouldn’t In this step, machine learning or deep learning models are built to make
be a surprise that data scientists need to know statistics. predictions or classifications based on the data. The choice of algorithm
 Programming Language R/ Python: Python and R are one of the most depends on the complexity of the problem and the type of data.
widely used languages by Data Scientists. The primary reason is the number Step 6: Present Findings and Deploy Models
of packages available for Numeric and Scientific computing. Once the analysis is complete, results are presented to stakeholders. Models are
 Data Extraction, Transformation, and Loading: Suppose we have deployed into production systems to automate decision-making or support
multiple data sources like MySQL DB, MongoDB, Google Analytics. You ongoing analysis.
have to Extract data from such sources, and then transform it for storing in a Benefits and uses of data science and big data
proper format or structure for the purposes of querying and analysis. Finally,  Governmental organizations are also aware of data’s value. A data scientist in a

you have to load the data in the Data Warehouse, where you will analyze the governmental organization gets to work on diverse projects such as detecting fraud
data. So, for people from ETL (Extract Transform and Load) background and other criminal activity or optimizing project funding.
 Nongovernmental organizations (NGOs) are also no strangers to using data. They
Data Science can be a good career option.
use it to raise money and defend their causes. The World Wildlife Fund (WWF), for
Steps for Data Science Processes:
instance, employs data scientists to increase the effectiveness of their fundraising
Step 1: Define the Problem and Create a Project Charter efforts.
 Universities use data science in their research but also to enhance the study

experience of their students. • Ex: MOOC’s- Massive open online courses.

You might also like