DS Unit-2 PDF
DS Unit-2 PDF
• When partition a matrix into a block matrix, it divide the full matrix into
parts and work with the smaller parts instead of the full matrix.
• Load smaller matrices into memory and perform calculations, thereby
avoiding an out-of-memory error.
• Python tools:
■ bcolz is a Python library that can store data arrays compactly and uses
the hard drive when the array no longer fits into the main memory.
• Numexpr is a numerical expression evaluator for NumPy and can be many times faster
than the original NumPy.
• Numba helps to achieve greater speed by compiling code right before execute it, also
known as just-in-time compiling.
• Bcolz helps you overcome the out-of-memory problem that can occur when using NumPy.
• Theano enables to work directly with the graphical processing unit (GPU) and do
symbolical simplifications whenever possible
• Dask enables to optimize flow of calculations and execute them efficiently. It also enables
to distribute calculations.
2. General techniques for handling large volumes of data…
3. Choose the right tools…
■ Profile code and remediate slow pieces of code e.g.Python: cProfile or line_profiler
• Not every piece of code needs to be optimized
• Use a profiler to detect slow parts inside your program and remediate these parts.
■ Use compiled code whenever possible, certainly when loops are involved.
• Whenever possible use functions from packages that are optimized for numerical
computations instead of implementing everything yourself.
• The code in the packages is often highly optimized and compiled.
3. General programming tips for dealing with large data sets…
3. Reduce the computing needs…
• Generators are a powerful tool in Python that allow for efficient and memory-
friendly processing of large datasets by generating elements on-the-fly without
the need to store them in memory.
Suppose we have a large CSV file with millions of records that we need to process
line by line. Instead of reading the entire file into memory and then processing it,
we can use a generator to iterate over each line one at a time, allowing us to
avoid storing the entire file in memory.
def read_csv(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line.strip().split(',')
■ Data—The project contains data from 120 days (120 files), and each
observation has approximately 3,200,000 features.
• The target variable contains 1 if it’s a malicious website and -1
otherwise.
Data preparation and cleansing, is not necessary in this case because the URLs data
come pre-cleaned.
Case study 1: Predicting malicious URLs…
Step 4: Data exploration
• Find out whether the data does contain lots of zeros.
• Check this with the following piece of code:
• Data that contains little information compared to zeros is called sparse data.
• This can be saved more compactly if you store the data as [(0,0,1),(4,4,1)]
instead of [[1,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,1]]
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
• File formats is in SVMLight
• Keep the data compressed while checking for the maximum number of
observations and variables
• Need to read in data file by file (This way you consume even less memory).
• Feed the CPU compressed files
• Pack the data in tar.gz format
• Unpack a file only when it is in need
• Work on the first 5 files
Case study 1: Predicting malicious URLs…
Step 4: Data exploration…
Case study 1: Predicting malicious URLs…
Step 5: Data modeling…