[go: up one dir, main page]

0% found this document useful (0 votes)
16 views54 pages

DS Unit-2 PDF

Data science

Uploaded by

22eg112a15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views54 pages

DS Unit-2 PDF

Data science

Uploaded by

22eg112a15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Unit- II

Handling Large Data on a Single Computer: The problem in


handling large data, General techniques for handling large
volumes of data, General programming tips for dealing with
large data sets, Case Studies
Handling Large Data on a Single Computer
• The techniques and tools that are used to handle larger data sets
are still manageable by a single computer if you adopt the right
techniques.
• When refer to large data it means data that causes problems to
work with in terms of memory or speed but can still be handled
by a single computer.
• Three types of solutions to overcome above problems:
• adapt right algorithms,
• choose the right data structures, and
• pick the right tools
• Two case studies:
• detect malicious URLs
• recommender engine inside a database
1. The problems in handling large data
• A large volume of data poses new challenges, such as
overloaded memory and algorithms that never stop
running.

• It forces to adapt and expand the repertoire of techniques.

• Care should be taken of issues such as I/O (input/output)


and CPU starvation, because these can cause speed issues.

• Go through the three steps:


• problems,
• solutions, and
• tips.
1. The problems in handling large data…

Figure: Overview of problems encountered when working


with more data than can fit in memory
2. General techniques for handling large volumes of data
1. Choosing the right algorithms

• Choosing the right algorithm


solve more problems than
adding more or better hardware.

• An algorithm that’s well suited


for handling large data doesn’t
need to load the entire data set
into memory to make
predictions.

• Ideally, the algorithm also


supports parallelized
calculations.
2. General techniques for handling large volumes of data
1. Choosing the right algorithms…
• Online Algorithms
• This “use and forget” way of working is the
perfect solution for the memory problem.
• A perceptron is one of the least complex
machine learning algorithms used for binary
classification (0 or 1);
• Example: Will the customer buy or not?
• Data feed methods:
■ Full batch learning / statistical learning
Feed the algorithm all the data at once.
■ Mini-batch learning
Feed the algorithm a spoonful (100, 1000, …, depending
on what your hardware can handle) of observations at a
time.
■ Online learning:
Feed the algorithm one observation at a time.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• Block matrices
• Certain algorithms can be translated into algorithms that use blocks of
matrices instead of full matrices.

• When partition a matrix into a block matrix, it divide the full matrix into
parts and work with the smaller parts instead of the full matrix.
• Load smaller matrices into memory and perform calculations, thereby
avoiding an out-of-memory error.

• Python tools:
■ bcolz is a Python library that can store data arrays compactly and uses
the hard drive when the array no longer fits into the main memory.

■ Dask is a library that enables to optimize the flow of calculations and


makes performing calculations in parallel easier.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• Block matrices…
• Example: Matrix addition A + B into submatrices.
2. General techniques for handling large volumes of data…
1. Choosing the right algorithms…
• MAPREDUCE
• MapReduce algorithms are easy to understand with an analogy
• Example:
• National elections
• Country has 25 parties, 1,500 voting offices, and 2 million people.
• Gather all the voting tickets from every office individually and count them
centrally, or
• could ask the local offices to count the votes for the 25 parties and then hand
over the results, and could be then aggregate them by party.

• MapReduce pseudo code example:


For each person in voting office:
Yield (voted_party, 1)
For each vote in voting office:
add_vote_to_party()
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
• Data structures have different storage requirements, but also influence the
performance of CRUD (create, read, update, and delete) and other operations
on the data set.
• Three are important:
1. Sparse data 2. Tree data 3. Hash data
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
Sparse data
• The resulting large matrix can cause memory problems even though it contains
little information.
• Support for working with sparse matrices is growing in Python.
• Many algorithms now support or return sparse matrices

data = [(2,9,1)] Row 2, column 9 holds the value 1


2. General techniques for handling large volumes of data…
2. Choosing the right data structure…
Tree data
• Retrieve information much faster than scanning through a table.
• A tree always has a root value and subtrees of children, each with its children, and so on.
2. General techniques for handling large volumes of data…
2. Choosing the right data structure
Hash Table
• Hash tables are data structures that
calculate a key for every value in data
and put the keys in a bucket.
• Quickly retrieve the information by
looking in the right bucket when
encounter the data.
• Dictionaries in Python are a hash
table implementation, and they used
key-value stores.
• Hash tables are used extensively in
databases as indices for fast
information retrieval.
2. General techniques for handling large volumes of data…
3. Choose the right tools

• The right tool can be a


Python library or a tool
that’s controlled from
Python
2. General techniques for handling large volumes of data…
3. Choose the right tools…
Python Tools:
• Cython, a superset of Python, solves the problem by forcing the programmer to specify
the data type while developing the program. Once the compiler has this information, it
runs programs much faster.

• Numexpr is a numerical expression evaluator for NumPy and can be many times faster
than the original NumPy.

• Numba helps to achieve greater speed by compiling code right before execute it, also
known as just-in-time compiling.

• Bcolz helps you overcome the out-of-memory problem that can occur when using NumPy.

• Theano enables to work directly with the graphical processing unit (GPU) and do
symbolical simplifications whenever possible

• Dask enables to optimize flow of calculations and execute them efficiently. It also enables
to distribute calculations.
2. General techniques for handling large volumes of data…
3. Choose the right tools…

Use Python as a master to control other Tools


• Most software and tool producers support a Python interface to their software.
• This enables to tap into specialized pieces of software with the ease and productivity that
comes with Python.
• This way Python sets itself apart from other popular data science languages such as R and SAS.
3. General programming tips for dealing with large data sets
1. Don’t reinvent the wheel.
Use tools and libraries
developed by others.
2. Get the most out of your
hardware.
A machine is never used to its
full potential; with simple
adaptions make it work
harder.
3. Reduce the computing need.
Slim down memory and
processing needs as much as
possible.
3. General programming tips for dealing with large data sets…
1. Don’t reinvent the wheel.
Solving a problem that has already been solved is a waste of time.
Data scientist adapt following two rules that can help deal with large data and make
much more productive:
■ Exploit the power of databases.
• Prepare analytical base tables inside a database when working with large data
sets.
• When this preparation involves advanced modeling, find out if it’s possible to
employ user-defined functions and procedures.
■ Use optimized libraries.
• Creating libraries like Mahout, Weka, and other machine learning algorithms
requires time and knowledge.
• They are highly optimize and incorporate best practices and state-of-the art
technologies.
• Spend time on getting things done, not on reinventing and repeating others
people’s efforts, unless it’s for the sake of understanding how things work.
3. General programming tips for dealing with large data sets…

2. Get the most out of your hardware

Resources on a computer can be idle, whereas other resources are over-utilized.


This slows down programs and can even make them fail.
Sometimes it’s possible and necessary to shift the workload from an overtaxed
resource to an underutilized resource using the following techniques:

■ Feed the CPU compressed data


• A simple trick to avoid CPU starvation is to feed the CPU compressed data
instead of the inflated (raw) data.
• This will shift more work from the hard disk to the CPU.
3. General programming tips for dealing with large data sets…
2. Get the most out of your hardware...

■ Make use of the GPU


• Sometimes CPU (not memory) is the bottleneck. If computations are
parallelizable, it is better to switch to the GPU.
• Higher throughput for computations than a CPU.
• The GPU is enormously efficient in parallelizable jobs but has less cache than
CPU.
• But it is pointless to switch to the GPU when your hard disk is the problem.
• Several Python packages, such as Theano and NumbaPro, will use the GPU
without much programming effort.

■ Use multiple threads


• Parallelize computations on CPU.
• This can be achieved with normal Python threads.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs
“Working smart + hard = achievement.” Apply to the programs you write.
The best way to avoid having large data problems is by removing as much of the
work as possible up front and
letting the computer work only on the part that can’t be skipped.
The following list contains methods to help you achieve this:

■ Profile code and remediate slow pieces of code e.g.Python: cProfile or line_profiler
• Not every piece of code needs to be optimized
• Use a profiler to detect slow parts inside your program and remediate these parts.

■ Use compiled code whenever possible, certainly when loops are involved.
• Whenever possible use functions from packages that are optimized for numerical
computations instead of implementing everything yourself.
• The code in the packages is often highly optimized and compiled.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Compile the code yourself


• If you can’t use an existing package, use either a just-in-time compiler or
implement the slowest parts of your code in a lower-level language such as C or
Fortran and integrate this with your codebase.

■ Avoid pulling data into memory


• When work with data that doesn’t fit in memory, avoid pulling everything into
memory.
• Read data in chunks and parsing the data on the fly.
• This won’t work on every algorithm but enables calculations on extremely large
data sets.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Use generators to avoid intermediate data storage

• Generators are a powerful tool in Python that allow for efficient and memory-
friendly processing of large datasets by generating elements on-the-fly without
the need to store them in memory.

• Generators help to return data per observation instead of in batches.

• This way avoid storing intermediate results.


3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…
■ Use generators to avoid intermediate data storage

Suppose we have a large CSV file with millions of records that we need to process
line by line. Instead of reading the entire file into memory and then processing it,
we can use a generator to iterate over each line one at a time, allowing us to
avoid storing the entire file in memory.
def read_csv(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line.strip().split(',')

# Process each line of the CSV file using the generator


for row in read_csv('data.csv'):
# Perform some processing on each row
print(row)
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…

■ Use as little data as possible


• If no large-scale algorithm is available and you are not willing to implement such a
technique, then it can still train the data on only a sample of the original data.

■ Use math skills to simplify calculations as much as possible

• Example: (a + b)2 = a 2 + 2ab + b 2


• The left side will be computed much faster than the right side of the equation
Case study 1: Predicting malicious URLs
• Many companies e.g. Google try to protect us from fraud by detecting
malicious websites.
• Doing so is no easy task, because the internet has billions of web pages
to scan.
• How to work with a data set that no longer fits in memory?

■ Data—The project contains data from 120 days (120 files), and each
observation has approximately 3,200,000 features.
• The target variable contains 1 if it’s a malicious website and -1
otherwise.

■ The Scikit-learn library—Simple and efficient tools for predictive data


analysis
Case study 1: Predicting malicious URLs…
Step 1: Defining the research goal
• The goal of project is to detect whether certain URLs can be trusted or
not.
• Aim to do in a memory-friendly way.

Step 2: Acquiring the URL data


• Start by downloading the data from
http://sysnet.ucsd.edu/projects/url/#datasets and place it in a folder.
• Choose the data in SVMLight format.
• SVMLight is a text-based format with one observation per row. To save
space, it leaves out the zeros.
Case study 1: Predicting malicious URLs…
Step 2: Acquiring the URL data…
The following listing and figure 4.10 show what happens when you try to read in 1 file
out of the 120 and create the normal matrix
Case study 1: Predicting malicious URLs…
Step 2: : Acquiring the URL data..

TOOLS AND TECHNIQUES


■ Use a sparse representation of data
■ Feed the algorithm compressed data instead of raw data
■ Use an online algorithm to make predictions

Step 3: Preparation of data

Data preparation and cleansing, is not necessary in this case because the URLs data
come pre-cleaned.
Case study 1: Predicting malicious URLs…
Step 4: Data exploration
• Find out whether the data does contain lots of zeros.
• Check this with the following piece of code:

print "number of non-zero entries %2.6f" % float((X.nnz)/(float(X.shape[0]) *


float(X.shape[1])))

• This outputs: number of non-zero entries 0.000033

• Data that contains little information compared to zeros is called sparse data.

• This can be saved more compactly if you store the data as [(0,0,1),(4,4,1)]

instead of [[1,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,1]]
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
• File formats is in SVMLight
• Keep the data compressed while checking for the maximum number of
observations and variables
• Need to read in data file by file (This way you consume even less memory).
• Feed the CPU compressed files
• Pack the data in tar.gz format
• Unpack a file only when it is in need
• Work on the first 5 files
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…

• Binary cross-entropy is a loss


function used in binary
classification problems where the
target variable has two possible
outcomes, 0 and 1
• It measures the performance of
the classification model whose
output is a probability is a value
between them.
• The goal of the model is to
minimize this loss function during
training to improve its predictive
accuracy.
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…
Precision and Recall are two key metrics used in evaluating the performance of
classification models in machine learning.
They are particularly important in imbalanced datasets or applications where false
positives and false negatives carry different consequences.
Case study 1: Predicting malicious URLs…
Step 5: Data modeling
Now apply the sparse representation of compressed file and use an online algorithm
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…

Precision: 3% (1 - 0.97) of the malicious sites are not detected


Recall: 6% (1 - 0.94) of the sites detected are falsely accused.
This is a decent result, so we can conclude that the methodology works.

Step 6: Presentation and automation


Case study 2: Building a recommender system inside a database
This case study cover How to use the hash table data structure and how to use
Python to control other tools.
TOOLS
■ pandas python library
■ MySQL database—
• Install a MySQL community server, you can download from www.mysql.com
• Appendix C: “Installing a MySQL server” explains how to set it up.
■ MySQL database connection Python library—
• Install SQLAlchemy or another library capable of communicating with MySQL.
• Here, we are using MySQLdb
• First install Binstar (another package management service) and look for the
appropriate mysql-python package for your Python setup.
• Execute following command:
conda install binstar
binstar search -t conda mysql-python
conda install --channel https://conda.binstar.org/krisvanneste mysql-python
(Above for Windows OS)
Case study 2: Building a recommender system inside a database…
K-nearest neighbors in machine learning
Example
• look for customers who have rented similar movies as you have and then
suggest those that the others have watched but you have not seen yet. This
technique is called k-nearest neighbors in machine learning.
Locality-Sensitive Hashing:
• Construct functions that map
• similar customers close together put in a bucket with the same label and
• make sure that objects that are different are put in different buckets.
• The hamming distance is used to calculate how much two labels differ
• A hash function maps any range of input to a fixed output.
• The simplest hash function concatenates the values from several random
columns.
• A single column (fixed) gives an output on many input columns (scalable).
Case study 2: Building a recommender system inside a database…
Locality-Sensitive Hashing..
Example
Three hash functions to find similar customers.
The three functions take the values of three movies:
■ The first function takes the values of movies 10, 15, and 28.
■ The second function takes the values of movies 7, 18, and 22.
■ The last function takes the values of movies 16, 19, and 30.
Compare the customers within the bucket with each other:-> Hamming distance
number of different characters in a string
Case study 2: Building a recommender system inside a database…
Locality-Sensitive Hashing..
• Comparing multiple columns is an expensive operation
• Apply a trick to speed this process.
• Columns contain a binary (0 or 1) variable to indicate whether a customer has
bought a movie or not
• Concatenate the information so that the same information is contained in a new
column.
• Table shows the “movies” variable that contains as much information as all the
movie columns combined.

Now, Finding similar customers becomes very simple


Case study 2: Building a recommender system inside a database…
Step 1: Set research goal
What movies people rent to predict what other movies they might like?
An automated system that learns people’s preferences and recommends movies
the customers have not tried yet.

• To create a memory-friendly recommender system.


• Also skip the data exploration step and move straight into model building.

Step 2: Data retrieval


• Create the data ourselves for this case study so we can skip the data retrieval
step and move right into data preparation.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation
• Create the data

• Connect Python to MySQL to create data.


• Make a connection to MySQL using username and password.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…

To efficiently query our database, need additional data preparation as


follows:
■ Creating bit strings:
The bit strings are compressed versions of the columns’ content (0
and 1 values). First these binary values are concatenated; then the
resulting bit string is reinterpreted as a number.

■ Defining hash functions:


The hash functions will create the bit strings.

■ Adding an index to the table:


to quicken data retrieval.
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…

By converting the information of 32 columns into 4 numbers, we


compressed it.
Figure shows the first 2 observations (customer movie view history) in
this new format.
store[0:2]

Determine whether two customers have similar behavior.


Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 3: Data preparation…
Case study 2: Building a recommender system inside a database…
Step 4: Data Exploration
Skip the data exploration step and move to model building.

Step 5: Model building


To use the hamming distance in the database we need to define it as a function.
CREATING THE HAMMING DISTANCE FUNCTION We implement this as a user-
defined function.
This function can calculate the distance for a 32-bit integer (actually 4*8), as shown
in the following listing.
Case study 2: Building a recommender system inside a database…
Step 5: Model building
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation

Application needs to perform two steps:


■ Look for similar customers.
■ Suggest movies the customer has yet to see based on what he or she
has already viewed and the viewing history of the similar customers.

FINDING A SIMILAR CUSTOMER


In the following listing, customer 27 is selected for the best movie to him.
We need to select customers with a similar viewing history.
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation
Case study 2: Building a recommender system inside a database…
Step 6: Presentation and automation
Table 4.5 shows customers 2 and 97 to be the most similar to customer 27.
The data was generated randomly, so anyone replicating this example might receive
different results.

Now we can finally select a movie for customer 27 to watch.


Case study 2: Building a recommender system inside a database…

You might also like