0% found this document useful (0 votes)

16 views54 pages

DS Unit-2 PDF

Data science

Uploaded by

22eg112a15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views54 pages

DS Unit-2 PDF

Data science

Uploaded by

22eg112a15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Unit- II

Handling Large Data on a Single Computer: The problem in

handling large data, General techniques for handling large
volumes of data, General programming tips for dealing with
large data sets, Case Studies
Handling Large Data on a Single Computer
• The techniques and tools that are used to handle larger data sets
are still manageable by a single computer if you adopt the right
techniques.
• When refer to large data it means data that causes problems to
work with in terms of memory or speed but can still be handled
by a single computer.
• Three types of solutions to overcome above problems:
• adapt right algorithms,
• choose the right data structures, and
• pick the right tools
• Two case studies:
• detect malicious URLs
• recommender engine inside a database
1. The problems in handling large data
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running.

• It forces to adapt and expand the repertoire of techniques.

• Care should be taken of issues such as I/O (input/output)

and CPU starvation, because these can cause speed issues.

• Go through the three steps:

• problems,
• solutions, and
• tips.
1. The problems in handling large data…

Figure: Overview of problems encountered when working

with more data than can fit in memory
2. General techniques for handling large volumes of data
1. Choosing the right algorithms

• Choosing the right algorithm

solve more problems than
adding more or better hardware.

• An algorithm that’s well suited

for handling large data doesn’t
need to load the entire data set
into memory to make
predictions.

• Ideally, the algorithm also

supports parallelized
calculations.
2. General techniques for handling large volumes of data
1. Choosing the right algorithms…
• Online Algorithms
• This “use and forget” way of working is the
perfect solution for the memory problem.
• A perceptron is one of the least complex
machine learning algorithms used for binary
classification (0 or 1);
• Example: Will the customer buy or not?
• Data feed methods:
■ Full batch learning / statistical learning
Feed the algorithm all the data at once.
■ Mini-batch learning
Feed the algorithm a spoonful (100, 1000, …, depending
on what your hardware can handle) of observations at a
time.
■ Online learning:
Feed the algorithm one observation at a time.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• Block matrices
• Certain algorithms can be translated into algorithms that use blocks of
matrices instead of full matrices.

• When partition a matrix into a block matrix, it divide the full matrix into
parts and work with the smaller parts instead of the full matrix.
• Load smaller matrices into memory and perform calculations, thereby
avoiding an out-of-memory error.

• Python tools:
■ bcolz is a Python library that can store data arrays compactly and uses
the hard drive when the array no longer fits into the main memory.

■ Dask is a library that enables to optimize the flow of calculations and

makes performing calculations in parallel easier.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• Block matrices…
• Example: Matrix addition A + B into submatrices.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• MAPREDUCE
• MapReduce algorithms are easy to understand with an analogy
• Example:
• National elections
• Country has 25 parties, 1,500 voting offices, and 2 million people.
• Gather all the voting tickets from every office individually and count them
centrally, or
• could ask the local offices to count the votes for the 25 parties and then hand
over the results, and could be then aggregate them by party.

• MapReduce pseudo code example:

For each person in voting office:
Yield (voted_party, 1)
For each vote in voting office:
add_vote_to_party()
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
• Data structures have different storage requirements, but also influence the
performance of CRUD (create, read, update, and delete) and other operations
on the data set.
• Three are important:
1. Sparse data 2. Tree data 3. Hash data
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
Sparse data
• The resulting large matrix can cause memory problems even though it contains
little information.
• Support for working with sparse matrices is growing in Python.
• Many algorithms now support or return sparse matrices

data = [(2,9,1)] Row 2, column 9 holds the value 1

2. General techniques for handling large volumes of data…
2. Choosing the right data structure…
Tree data
• Retrieve information much faster than scanning through a table.
• A tree always has a root value and subtrees of children, each with its children, and so on.
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
Hash Table
• Hash tables are data structures that
calculate a key for every value in data
and put the keys in a bucket.
• Quickly retrieve the information by
looking in the right bucket when
encounter the data.
• Dictionaries in Python are a hash
table implementation, and they used
key-value stores.
• Hash tables are used extensively in
databases as indices for fast
information retrieval.
2. General techniques for handling large volumes of data…
3. Choose the right tools

• The right tool can be a

Python library or a tool
that’s controlled from
Python
2. General techniques for handling large volumes of data…
3. Choose the right tools…
Python Tools:
• Cython, a superset of Python, solves the problem by forcing the programmer to specify
the data type while developing the program. Once the compiler has this information, it
runs programs much faster.

• Numexpr is a numerical expression evaluator for NumPy and can be many times faster
than the original NumPy.

• Numba helps to achieve greater speed by compiling code right before execute it, also
known as just-in-time compiling.

• Bcolz helps you overcome the out-of-memory problem that can occur when using NumPy.

• Theano enables to work directly with the graphical processing unit (GPU) and do
symbolical simplifications whenever possible

• Dask enables to optimize flow of calculations and execute them efficiently. It also enables
to distribute calculations.
2. General techniques for handling large volumes of data…
3. Choose the right tools…

Use Python as a master to control other Tools

• Most software and tool producers support a Python interface to their software.
• This enables to tap into specialized pieces of software with the ease and productivity that
comes with Python.
• This way Python sets itself apart from other popular data science languages such as R and SAS.
3. General programming tips for dealing with large data sets
1. Don’t reinvent the wheel.
Use tools and libraries
developed by others.
2. Get the most out of your
hardware.
A machine is never used to its
full potential; with simple
adaptions make it work
harder.
3. Reduce the computing need.
Slim down memory and
processing needs as much as
possible.
3. General programming tips for dealing with large data sets…
1. Don’t reinvent the wheel.
Solving a problem that has already been solved is a waste of time.
Data scientist adapt following two rules that can help deal with large data and make
much more productive:
■ Exploit the power of databases.
• Prepare analytical base tables inside a database when working with large data
sets.
• When this preparation involves advanced modeling, find out if it’s possible to
employ user-defined functions and procedures.
■ Use optimized libraries.
• Creating libraries like Mahout, Weka, and other machine learning algorithms
requires time and knowledge.
• They are highly optimize and incorporate best practices and state-of-the art
technologies.
• Spend time on getting things done, not on reinventing and repeating others
people’s efforts, unless it’s for the sake of understanding how things work.
3. General programming tips for dealing with large data sets…

2. Get the most out of your hardware

Resources on a computer can be idle, whereas other resources are over-utilized.

This slows down programs and can even make them fail.
Sometimes it’s possible and necessary to shift the workload from an overtaxed
resource to an underutilized resource using the following techniques:

■ Feed the CPU compressed data

• A simple trick to avoid CPU starvation is to feed the CPU compressed data
instead of the inflated (raw) data.
• This will shift more work from the hard disk to the CPU.
3. General programming tips for dealing with large data sets…
2. Get the most out of your hardware...

■ Make use of the GPU

• Sometimes CPU (not memory) is the bottleneck. If computations are
parallelizable, it is better to switch to the GPU.
• Higher throughput for computations than a CPU.
• The GPU is enormously efficient in parallelizable jobs but has less cache than
CPU.
• But it is pointless to switch to the GPU when your hard disk is the problem.
• Several Python packages, such as Theano and NumbaPro, will use the GPU
without much programming effort.

■ Use multiple threads

• Parallelize computations on CPU.
• This can be achieved with normal Python threads.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs
“Working smart + hard = achievement.” Apply to the programs you write.
The best way to avoid having large data problems is by removing as much of the
work as possible up front and
letting the computer work only on the part that can’t be skipped.
The following list contains methods to help you achieve this:

■ Profile code and remediate slow pieces of code e.g.Python: cProfile or line_profiler
• Not every piece of code needs to be optimized
• Use a profiler to detect slow parts inside your program and remediate these parts.

■ Use compiled code whenever possible, certainly when loops are involved.
• Whenever possible use functions from packages that are optimized for numerical
computations instead of implementing everything yourself.
• The code in the packages is often highly optimized and compiled.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Compile the code yourself

• If you can’t use an existing package, use either a just-in-time compiler or
implement the slowest parts of your code in a lower-level language such as C or
Fortran and integrate this with your codebase.

■ Avoid pulling data into memory

• When work with data that doesn’t fit in memory, avoid pulling everything into
memory.
• Read data in chunks and parsing the data on the fly.
• This won’t work on every algorithm but enables calculations on extremely large
data sets.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Use generators to avoid intermediate data storage

• Generators are a powerful tool in Python that allow for efficient and memory-
friendly processing of large datasets by generating elements on-the-fly without
the need to store them in memory.

• Generators help to return data per observation instead of in batches.

• This way avoid storing intermediate results.

3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…
■ Use generators to avoid intermediate data storage

Suppose we have a large CSV file with millions of records that we need to process
line by line. Instead of reading the entire file into memory and then processing it,
we can use a generator to iterate over each line one at a time, allowing us to
avoid storing the entire file in memory.
def read_csv(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line.strip().split(',')

# Process each line of the CSV file using the generator

for row in read_csv('data.csv'):
# Perform some processing on each row
print(row)
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Use as little data as possible

• If no large-scale algorithm is available and you are not willing to implement such a
technique, then it can still train the data on only a sample of the original data.

■ Use math skills to simplify calculations as much as possible

• Example: (a + b)2 = a 2 + 2ab + b 2

• The left side will be computed much faster than the right side of the equation
Case study 1: Predicting malicious URLs
• Many companies e.g. Google try to protect us from fraud by detecting
malicious websites.
• Doing so is no easy task, because the internet has billions of web pages
to scan.
• How to work with a data set that no longer fits in memory?

■ Data—The project contains data from 120 days (120 files), and each
observation has approximately 3,200,000 features.
• The target variable contains 1 if it’s a malicious website and -1
otherwise.

■ The Scikit-learn library—Simple and efficient tools for predictive data

analysis
Case study 1: Predicting malicious URLs…
Step 1: Defining the research goal
• The goal of project is to detect whether certain URLs can be trusted or
not.
• Aim to do in a memory-friendly way.

Step 2: Acquiring the URL data

• Start by downloading the data from
http://sysnet.ucsd.edu/projects/url/#datasets and place it in a folder.
• Choose the data in SVMLight format.
• SVMLight is a text-based format with one observation per row. To save
space, it leaves out the zeros.
Case study 1: Predicting malicious URLs…
Step 2: Acquiring the URL data…
The following listing and figure 4.10 show what happens when you try to read in 1 file
out of the 120 and create the normal matrix
Case study 1: Predicting malicious URLs…
Step 2: : Acquiring the URL data..

TOOLS AND TECHNIQUES

■ Use a sparse representation of data
■ Feed the algorithm compressed data instead of raw data
■ Use an online algorithm to make predictions

Step 3: Preparation of data

Data preparation and cleansing, is not necessary in this case because the URLs data
come pre-cleaned.
Case study 1: Predicting malicious URLs…
Step 4: Data exploration
• Find out whether the data does contain lots of zeros.
• Check this with the following piece of code:

print "number of non-zero entries %2.6f" % float((X.nnz)/(float(X.shape[0]) *

float(X.shape[1])))

• This outputs: number of non-zero entries 0.000033

• Data that contains little information compared to zeros is called sparse data.

• This can be saved more compactly if you store the data as [(0,0,1),(4,4,1)]

instead of [[1,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,1]]
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
• File formats is in SVMLight
• Keep the data compressed while checking for the maximum number of
observations and variables
• Need to read in data file by file (This way you consume even less memory).
• Feed the CPU compressed files
• Pack the data in tar.gz format
• Unpack a file only when it is in need
• Work on the first 5 files
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…

• Binary cross-entropy is a loss

function used in binary
classification problems where the
target variable has two possible
outcomes, 0 and 1
• It measures the performance of
the classification model whose
output is a probability is a value
between them.
• The goal of the model is to
minimize this loss function during
training to improve its predictive
accuracy.
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…
Precision and Recall are two key metrics used in evaluating the performance of
classification models in machine learning.
They are particularly important in imbalanced datasets or applications where false
positives and false negatives carry different consequences.
Case study 1: Predicting malicious URLs…
Step 5: Data modeling
Now apply the sparse representation of compressed file and use an online algorithm
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…

Precision: 3% (1 - 0.97) of the malicious sites are not detected

Recall: 6% (1 - 0.94) of the sites detected are falsely accused.
This is a decent result, so we can conclude that the methodology works.

Step 6: Presentation and automation

Case study 2: Building a recommender system inside a database
This case study cover How to use the hash table data structure and how to use
Python to control other tools.
TOOLS
■ pandas python library
■ MySQL database—
• Install a MySQL community server, you can download from www.mysql.com
• Appendix C: “Installing a MySQL server” explains how to set it up.
■ MySQL database connection Python library—
• Install SQLAlchemy or another library capable of communicating with MySQL.
• Here, we are using MySQLdb
• First install Binstar (another package management service) and look for the
appropriate mysql-python package for your Python setup.
• Execute following command:
conda install binstar
binstar search -t conda mysql-python
conda install --channel https://conda.binstar.org/krisvanneste mysql-python
(Above for Windows OS)
Case study 2: Building a recommender system inside a database…
K-nearest neighbors in machine learning
Example
• look for customers who have rented similar movies as you have and then
suggest those that the others have watched but you have not seen yet. This
technique is called k-nearest neighbors in machine learning.
Locality-Sensitive Hashing:
• Construct functions that map
• similar customers close together put in a bucket with the same label and
• make sure that objects that are different are put in different buckets.
• The hamming distance is used to calculate how much two labels differ
• A hash function maps any range of input to a fixed output.
• The simplest hash function concatenates the values from several random
columns.
• A single column (fixed) gives an output on many input columns (scalable).
Case study 2: Building a recommender system inside a database…
Locality-Sensitive Hashing..
Example
Three hash functions to find similar customers.
The three functions take the values of three movies:
■ The first function takes the values of movies 10, 15, and 28.
■ The second function takes the values of movies 7, 18, and 22.
■ The last function takes the values of movies 16, 19, and 30.
Compare the customers within the bucket with each other:-> Hamming distance
number of different characters in a string
Case study 2: Building a recommender system inside a database…
Locality-Sensitive Hashing..
• Comparing multiple columns is an expensive operation
• Apply a trick to speed this process.
• Columns contain a binary (0 or 1) variable to indicate whether a customer has
bought a movie or not
• Concatenate the information so that the same information is contained in a new
column.
• Table shows the “movies” variable that contains as much information as all the
movie columns combined.

Now, Finding similar customers becomes very simple

Case study 2: Building a recommender system inside a database…
Step 1: Set research goal
What movies people rent to predict what other movies they might like?
An automated system that learns people’s preferences and recommends movies
the customers have not tried yet.

• To create a memory-friendly recommender system.

• Also skip the data exploration step and move straight into model building.

Step 2: Data retrieval

• Create the data ourselves for this case study so we can skip the data retrieval
step and move right into data preparation.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation
• Create the data

• Connect Python to MySQL to create data.

• Make a connection to MySQL using username and password.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…

To efficiently query our database, need additional data preparation as

follows:
■ Creating bit strings:
The bit strings are compressed versions of the columns’ content (0
and 1 values). First these binary values are concatenated; then the
resulting bit string is reinterpreted as a number.

■ Defining hash functions:

The hash functions will create the bit strings.

■ Adding an index to the table:

to quicken data retrieval.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…

By converting the information of 32 columns into 4 numbers, we

compressed it.
Figure shows the first 2 observations (customer movie view history) in
this new format.
store[0:2]

Determine whether two customers have similar behavior.

Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 4: Data Exploration
Skip the data exploration step and move to model building.

Step 5: Model building

To use the hamming distance in the database we need to define it as a function.
CREATING THE HAMMING DISTANCE FUNCTION We implement this as a user-
defined function.
This function can calculate the distance for a 32-bit integer (actually 4*8), as shown
in the following listing.
Case study 2: Building a recommender system inside a database…
Step 5: Model building
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation

Application needs to perform two steps:

■ Look for similar customers.
■ Suggest movies the customer has yet to see based on what he or she
has already viewed and the viewing history of the similar customers.

FINDING A SIMILAR CUSTOMER

In the following listing, customer 27 is selected for the best movie to him.
We need to select customers with a similar viewing history.
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation
Table 4.5 shows customers 2 and 97 to be the most similar to customer 27.
The data was generated randomly, so anyone replicating this example might receive
different results.

Now we can finally select a movie for customer 27 to watch.

Case study 2: Building a recommender system inside a database…

Ashford D. Python For Algorithms and Data Structures 2024
100% (1)
Ashford D. Python For Algorithms and Data Structures 2024
475 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Pyhton Potential Interview Questions
No ratings yet
Pyhton Potential Interview Questions
34 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Numpy Notes
No ratings yet
Numpy Notes
38 pages
Let Us Create Super Ai by Chat GPT and Muwanguz David
No ratings yet
Let Us Create Super Ai by Chat GPT and Muwanguz David
133 pages
Certified Professional Diploma in Data Science-1
No ratings yet
Certified Professional Diploma in Data Science-1
43 pages
Asymptotic Theory of Dynamic Boundary Value Problems in Irregular Domains Operator Theory Advances and Applications 284 1st Ed. 2021 Edition Dmitrii Korikov All Chapters Instant Download
No ratings yet
Asymptotic Theory of Dynamic Boundary Value Problems in Irregular Domains Operator Theory Advances and Applications 284 1st Ed. 2021 Edition Dmitrii Korikov All Chapters Instant Download
67 pages
Ocs Unit 5
No ratings yet
Ocs Unit 5
19 pages
Business Intelligence Question Bank
No ratings yet
Business Intelligence Question Bank
35 pages
Data Analysis Python Read The Docs Io en Latest
No ratings yet
Data Analysis Python Read The Docs Io en Latest
79 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Experiment 1: Installation of WEKA Tool Aim
No ratings yet
Experiment 1: Installation of WEKA Tool Aim
19 pages
Python BigData Alternative Assignment
No ratings yet
Python BigData Alternative Assignment
5 pages
HackyHour - Python Tips & Tricks
No ratings yet
HackyHour - Python Tips & Tricks
26 pages
Full Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Ebook All Chapters
No ratings yet
Full Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Ebook All Chapters
77 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
26 pages
Master Data Science, Data Analytics and Machine Learning Using Python
No ratings yet
Master Data Science, Data Analytics and Machine Learning Using Python
16 pages
Data Science: Machine Learning
No ratings yet
Data Science: Machine Learning
25 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
Full Stack Roadmap
No ratings yet
Full Stack Roadmap
25 pages
1 Intro
No ratings yet
1 Intro
33 pages
Report
No ratings yet
Report
18 pages
Textual Factors - A Scalable, Interpretable, and Data-Driven Approach To Analyzing Unstructured Information
No ratings yet
Textual Factors - A Scalable, Interpretable, and Data-Driven Approach To Analyzing Unstructured Information
59 pages
Lecture - 5 - 2 - Skills Required by Data Scientist
No ratings yet
Lecture - 5 - 2 - Skills Required by Data Scientist
11 pages
Data Scince Report
No ratings yet
Data Scince Report
11 pages
Python Data Types
No ratings yet
Python Data Types
19 pages
Unit 1
No ratings yet
Unit 1
21 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Vector Database
No ratings yet
Vector Database
8 pages
Handling Large Datasets in RAM
No ratings yet
Handling Large Datasets in RAM
5 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
DS - Module 3 1
No ratings yet
DS - Module 3 1
6 pages
CS1010e Notes and Summary
No ratings yet
CS1010e Notes and Summary
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Learninng Plan
No ratings yet
Learninng Plan
6 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
7 Practicals With Python Practice With Data Science Cookbook
No ratings yet
7 Practicals With Python Practice With Data Science Cookbook
4 pages
Extra - Data Science Unit II
No ratings yet
Extra - Data Science Unit II
41 pages
Acmegrade Data Analytics & Data Science
No ratings yet
Acmegrade Data Analytics & Data Science
4 pages
Ciencia Datos Corner
No ratings yet
Ciencia Datos Corner
6 pages
Finding Similar Items
No ratings yet
Finding Similar Items
85 pages
General Material
No ratings yet
General Material
16 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
Data Science
No ratings yet
Data Science
31 pages
Final Project Vaaghu
No ratings yet
Final Project Vaaghu
84 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
312 Course Project-1
No ratings yet
312 Course Project-1
16 pages
DS Unit-2 PDF
No ratings yet
DS Unit-2 PDF
54 pages
Survey of Vector Database Management Systems: Noname Manuscript No
No ratings yet
Survey of Vector Database Management Systems: Noname Manuscript No
25 pages
AI - Book 10 - Part B - Answer Key (New Version)
No ratings yet
AI - Book 10 - Part B - Answer Key (New Version)
16 pages
LLM Powered Autonomous Agents - Lil'Log
No ratings yet
LLM Powered Autonomous Agents - Lil'Log
24 pages
Protecting - Your - Mobile - Cloud - Data-Project 2
No ratings yet
Protecting - Your - Mobile - Cloud - Data-Project 2
20 pages
Lect 26 and 27 - Locality Sensitive Hashing
No ratings yet
Lect 26 and 27 - Locality Sensitive Hashing
43 pages
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
No ratings yet
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
4 pages
Searchable Encryption Over Feature Rich Data PDF
No ratings yet
Searchable Encryption Over Feature Rich Data PDF
14 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
KNN Based Clothing Color Detection For Optimization of Color Selection Based On Thermal Comforatability
No ratings yet
KNN Based Clothing Color Detection For Optimization of Color Selection Based On Thermal Comforatability
22 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
3 Months Python and Data Analytics Syllabus
100% (1)
3 Months Python and Data Analytics Syllabus
3 pages
Full Stack Data Science Roadmap
No ratings yet
Full Stack Data Science Roadmap
17 pages
A Survey On User Behavior Modeling in Recommender Systems
No ratings yet
A Survey On User Behavior Modeling in Recommender Systems
9 pages
LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing
No ratings yet
LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing
10 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
HNSW
No ratings yet
HNSW
13 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
HW 1
No ratings yet
HW 1
9 pages
1
No ratings yet
1
7 pages
GNN PPoPP 2021
No ratings yet
GNN PPoPP 2021
14 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
Twitternews: Real Time Event Detection From The Twitter Data Stream
No ratings yet
Twitternews: Real Time Event Detection From The Twitter Data Stream
9 pages
Kaspersky Lab Whitepaper Machine Learning
No ratings yet
Kaspersky Lab Whitepaper Machine Learning
15 pages
Chapter - 2: Data Science & Python
No ratings yet
Chapter - 2: Data Science & Python
17 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Aurum A Data Discovery System
No ratings yet
Aurum A Data Discovery System
12 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
Simple Efficient Algorithm
No ratings yet
Simple Efficient Algorithm
9 pages
Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Learning To Hash For Indexing Big Data - A Survey
No ratings yet
Learning To Hash For Indexing Big Data - A Survey
22 pages
Searching Music Incipits in Metric Space With Locality-Sensitive Hashing - CodeProject
No ratings yet
Searching Music Incipits in Metric Space With Locality-Sensitive Hashing - CodeProject
6 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
Resume Building Tips by Prafful
No ratings yet
Resume Building Tips by Prafful
7 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
Computer Application In Business ( Concise Notes )
From Everand
Computer Application In Business ( Concise Notes )
NotesKaro
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet

DS Unit-2 PDF

Uploaded by

DS Unit-2 PDF

Uploaded by

Unit- II

Handling Large Data on a Single Computer: The problem in

• It forces to adapt and expand the repertoire of techniques.

• Care should be taken of issues such as I/O (input/output)

• Go through the three steps:

Figure: Overview of problems encountered when working

• Choosing the right algorithm

• An algorithm that’s well suited

• Ideally, the algorithm also

■ Dask is a library that enables to optimize the flow of calculations and

• MapReduce pseudo code example:

data = [(2,9,1)] Row 2, column 9 holds the value 1

• The right tool can be a

Use Python as a master to control other Tools

2. Get the most out of your hardware

Resources on a computer can be idle, whereas other resources are over-utilized.

■ Feed the CPU compressed data

■ Make use of the GPU

■ Use multiple threads

■ Compile the code yourself

■ Avoid pulling data into memory

■ Use generators to avoid intermediate data storage

• Generators help to return data per observation instead of in batches.

• This way avoid storing intermediate results.

# Process each line of the CSV file using the generator

■ Use as little data as possible

■ Use math skills to simplify calculations as much as possible

• Example: (a + b)2 = a 2 + 2ab + b 2

■ The Scikit-learn library—Simple and efficient tools for predictive data

Step 2: Acquiring the URL data

TOOLS AND TECHNIQUES

Step 3: Preparation of data

print "number of non-zero entries %2.6f" % float((X.nnz)/(float(X.shape[0]) *

• This outputs: number of non-zero entries 0.000033

• Binary cross-entropy is a loss

Precision: 3% (1 - 0.97) of the malicious sites are not detected

Step 6: Presentation and automation

Now, Finding similar customers becomes very simple

• To create a memory-friendly recommender system.

Step 2: Data retrieval

• Connect Python to MySQL to create data.

To efficiently query our database, need additional data preparation as

■ Defining hash functions:

■ Adding an index to the table:

By converting the information of 32 columns into 4 numbers, we

Determine whether two customers have similar behavior.

Step 5: Model building

Application needs to perform two steps:

FINDING A SIMILAR CUSTOMER

Now we can finally select a movie for customer 27 to watch.

You might also like