[go: up one dir, main page]

0% found this document useful (0 votes)
14 views23 pages

PPDS - Unit 3

This document provides an overview of Python modules and packages, explaining how to create and use them for organizing code. It covers built-in modules like os, sys, random, and statistics, detailing their functions and methods. Additionally, it introduces multithreading concepts in Python, including thread creation and management, as well as handling binary files.

Uploaded by

gopi kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views23 pages

PPDS - Unit 3

This document provides an overview of Python modules and packages, explaining how to create and use them for organizing code. It covers built-in modules like os, sys, random, and statistics, detailing their functions and methods. Additionally, it introduces multithreading concepts in Python, including thread creation and management, as well as handling binary files.

Uploaded by

gopi kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT III – Python for Data Science

Module in Python
The module is a simple Python file that contains collections of functions and global
variables and with having a .py extension file. It is an executable file and to organize
all the modules we have the concept called Package in Python.
Modules are simply files with the “. py” extension containing Python code that can be
imported inside another Python Modules Operations Program. In simple terms, we can
consider a module to be the same as a code library or a file that contains a set of
functions that you want to include in your application.
The Python standard library contains well over 200 modules, although the exact
number varies between distributions.
Create a module
Save this code in a file named mymodule.py
def greeting(name):
print("Hello, " + name)
Use a Module
the module we just created, by using the import statement:
import mymodule
mymodule.greeting("Jonathan")
Packages
Python packages are a way to organize and structure code by grouping related
modules into directories.
A package is essentially a folder that contains an __init__.py file and one or more
Python files (modules).
This organization helps manage and reuse code effectively, especially in larger
projects. It also allows functionality to be easily shared and distributed across different
applications.
Packages act like toolboxes, storing and organizing tools (functions and classes) for
efficient access and reuse.
Key Components of a Python Package
● Module: A single Python file containing reusable code (e.g., math.py).
● Package: A directory containing modules and a special __init__.py file.
● Sub-Packages: Packages nested within other packages for deeper organization.
How to create and access packages in python
1. Create a Directory: Make a directory for your package. This will serve as the
root folder.
2. Add Modules: Add Python files (modules) to the directory, each representing
specific functionality.
3. Include __init__.py: Add an __init__.py file (can be empty) to the directory to
mark it as a package.
4. Add Sub packages (Optional): Create subdirectories with their own
__init__.py files for sub packages.
5. Import Modules: Use dot notation to import, e.g., from mypackage.module1
import greet.
Python os Module

Python has a built-in os module with methods for interacting with the operating
system, like creating files and directories, management of files and directories, input,
output, environment variables, process management, etc.

OS Methods

Method Description

os._exit() Exits the process with the specified status

os.abort() Terminates a running process immediately

os.access() Uses the real uid/gid to check access to a path

os.add_dll_directory() Adds a path to the DLL search path

os.chdir() Change the current working directory

os.chflags() Sets the flags of path to the numeric flags

os.chmod() Changes the mode of path to the numeric mode

os.getlogin() Returns the name of the user logged in to the terminal


os.getgid() Returns the real group id of the current process
os.getpid() Returns the process id of the current process
OS Objects
os.name
This function gives the name of the OS module it imports. This differs based on the
underlying Operating System. Currently, it registers ‘posix’, ‘os2’, ‘ce’, ‘nt’, ‘riscos’
and ‘java’. Let’s execute this on the system:
>>> print(os.name)
posix
os.environ
environ is not a function but a process parameter through which we can access
environment variables of the system. Let’s see sample code snippet:
import os
output = os.environ['HOME']
print(output)

os.getuid
This os module function returns current process’s user ID or UID, as it is populary
known.
>>>os.getuid()
501
os.rename
import os
fileDir = "JournalDev.txt"
os.rename(fd,'JournalDev_Hi.txt')
os.system
Python os system function allows us to run a command in the Python script, just like if
I was running it in my shell. For example:
import os
currentFiles = os.system("users > users.txt")
os.getpid
This function returns current process ID or PID, as it is populary known.
>>>os.getpid()
71622
OS Constants
os.EX_CANTCREAT :User specified output file could not be created
os.EX_CONFIG :Some kind of configuration error occurred
os.EX_DATAERR : Input data was incorrect
os.EX_IOERR :An error occurred while doing input/output on some file
os.EX_NOHOST : The host did not exist
os.EX_NOINPUT :: The input file did not exist or was not readable
Python sys Module
The sys module in Python provides various functions and variables that are used to
manipulate different parts of the Python runtime environment. It allows operating on
the interpreter as it provides access to the variables and functions that interact strongly
with the interpreter.
Python sys.version
sys.version is used which returns a string containing the version of Python Interpreter
with some additional information.
import sys
print(sys.version)
3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0]
Input and Output using Python Sys
The sys modules provide variables for better control over input or output.
● stdin
● stdout
● stderr
Read from stdin in Python
stdin: It can be used to get input from the command line directly. It is used for
standard input. It internally calls the input() method. It, also, automatically adds ‘\n’
after each sentence.
Example
import sys
for line insys.stdin:
if 'q' ==line.rstrip():
break
print(f'Input : {line}')
print("Exit")

sys.stdout Method
stdout: A built-in file object that is analogous to the interpreter’s standard output
stream in Python. stdout is used to display output directly to the screen console.
Output can be of any form, it can be output from a print statement, an expression
statement, and even a prompt direct for input. By default, streams are in text mode.
import sys
sys.stdout.write('Geeks')
stderr function in Python
stderr: Whenever an exception occurs in Python it is written to sys.stderr.
Example
import sys
defprint_to_stderr(*a):
print(*a, file =sys.stderr)
print_to_stderr("Hello World")
Python Random Module
Python has a built-in module that you can use to make random
numbers.Python Random module generates random numbers in Python. These are
pseudo-random numbers means they are not truly random.
This module can be used to perform random actions such as generating random
numbers, printing random a value for a list or string, etc. It is an in-built function in
Python.
Method
1. seed() : Initialize the random number generator
2. getstate() : Returns the current internal state of the random number generator
3. setstate() : Restores the internal state of the random number generator
4. getrandbits() : Returns a number representing the random bits
5. randrange() : Returns a random number between the given range
6. choice() : Returns a random element from the given sequence
7. choices() : Returns a list with a random selection from the given sequence
8. shuffle() :Takes a sequence and returns the sequence in a random order
9. sample() : Returns a given sample of a sequence
10. random() : Returns a random float number between 0 and 1
11. uniform() : Returns a random float number between two given parameters

Example 1

import random
random.seed(5)
print(random.random())
print(random.random())
Output
0.6229016948897019
0.7417869892607294
Example 2
Syntax: randint(start, end)
import random
r1 = random.randint(5, 15)
print("Random number between 5 and 15 is % s" % (r1))
r2 = random.randint(-10, -2)
print("Random number between -10 and -2 is % d" % (r2))
Output
Random number between 5 and 15 is 10
Random number between -10 and -2 is -2

Python statistics Module


Python has a built-in module that you can use to calculate mathematical statistics of
numeric data.
Method
1. statistics.harmonic_mean()Calculates the harmonic mean (central location) of
the given data
2. statistics.mean()Calculates the mean (average) of the given data
3. statistics.median()Calculates the median (middle value) of the given data
4. statistics.median_grouped()Calculates the median of grouped continuous data
5. statistics.median_high()Calculates the high median of the given data
6. statistics.median_low()Calculates the low median of the given data
7. statistics.mode()Calculates the mode (central tendency) of the given numeric
or nominal data
8. statistics.pstdev()Calculates the standard deviation from an entire population
9. statistics.stdev()Calculates the standard deviation from a sample of data
10. statistics.pvariance()Calculates the variance of an entire population
11. statistics.variance()Calculates the variance from a sample of data

Example 1

import statistics
dataset =[2, 4, 7, 7, 2, 2, 3, 6, 6, 8]
print("Calculated Mode % s" % (statistics.mode(dataset)))
Output

Calculated Mode 2

Example 2

import statistics
set1 = [4, 6, 2, 5, 7, 7]
print("Low median of data-set is % s " % (statistics.median_low(set1)))
Output:
Low median of the data-set is 5
THREAD
Threads
● A thread is a sequence of code that runs within a process
● Threads share the virtual address space of the process
● Each thread has its own scheduling priority
● Threads can execute any part of the program code
Multithreading
● Multithreading is the application's version of multitasking
● Multithreading can improve application responsiveness
● Multithreading can reduce the number of computing resources used
● Multithreading can make programs more efficient
Multithreading in different programming languages
● In Java, multithreading is the execution of a complex process using virtual threads
● In Python, multithreading can be implemented using a thread pool
Multithreading strategies
● Multithreading can be executed concurrently or in parallel
● Multithreading can be implemented using a one-to-one, many-to-one, or many-to-
many model
This article covers the basics of multithreading in Python programming language. Just
like multiprocessing , multithreading is a way of achieving multitasking. In
multithreading, the concept of threads is used. Let us first understand the concept
of thread in computer architecture.

What is a Process in Python?


In computing, a process is an instance of a computer program that is being executed.
Any process has 3 basic components:
● An executable program.
● The associated data needed by the program (variables, workspace, buffers, etc.)
● The execution context of the program (State of the process)

An Intro to Python Threading

A thread is an entity within a process that can be scheduled for execution. Also, it is
the smallest unit of processing that can be performed in an OS (Operating System). In
simple words, a thread is a sequence of such instructions within a program that can be
executed independently of other code. For simplicity, you can assume that a thread is
simply a subset of a process! A thread contains all this information in a Thread
Control Block (TCB) :

● Thread Identifier: Unique id (TID) is assigned to every new thread


● Stack pointer: Points to the thread’s stack in the process. The stack contains the
local variables under the thread’s scope.
● Program counter: a register that stores the address of the instruction currently
being executed by a thread.
● Thread state: can be running, ready, waiting, starting, or done.
● Thread’s register set: registers assigned to thread for computations.
● Parent process Pointer: A pointer to the Process control block (PCB) of the
process that the thread lives on.
Multiple threads can exist within one process where:

● Each thread contains its own register set and local variables (stored in the
stack) .
● All threads of a process share global variables (stored in heap) and
the program code.
Multithreading in Python
In Python , the threading module provides a very simple and intuitive API for
spawning multiple threads in a program. Let us try to understand multithreading
code step-by-step.
Step 1: Import Module
First, import the threading module.
import threading
Step 2: Create a Thread
To create a new thread, we create an object of the Thread class. It takes the
‘target’ and ‘args’ as the parameters. The target is the function to be executed by
the thread whereas the args is the arguments to be passed to the target function.
t1=threading.Thread(target,args)
t2 = threading.Thread(target, args)
Step 3: Start a Thread
To start a thread, we use the start() method of the Thread class.
t1.start()
t2.start()
Step 4: End the thread Execution
Once the threads start, the current program (you can think of it like a main thread)
also keeps on executing. In order to stop the execution of the current program until
a thread is complete, we use the join() method.
t1.join()
t2.join()
As a result, the current program will first wait for the completion of t1 and then t2 .
Once, they are finished, the remaining statements of the current program are
executed.
Example Program

import threading
import time
# Function that will run in a separate thread
def print_numbers():
for i in range(5):
time.sleep(1) # Simulate a delay
print(i)
# Function that will run in a separate thread
def print_letters():
for letter in 'abcde':
time.sleep(1) # Simulate a delay
print(letter)
# Create two threads
thread1 = threading.Thread(target=print_numbers)
thread2 = threading.Thread(target=print_letters)
# Start the threads
thread1.start()
thread2.start()
# Wait for both threads to complete
thread1.join()
thread2.join()
print("Both threads have finished executing.")
Output
0 a 1 b 2 c 3 d 4 e Both threads have finished.
Binary files
Binary files store data as a sequence of bytes. Each byte can represent a wide range of
values, from simple text characters to more complex data structures like images,
videos and executable programs.
Different Modes for Binary Files in Python
When working with binary files in Python, there are specific modes we can use to
open them:
● ‘rb’: Read binary – Opens the file for reading in binary mode.
● ‘wb’: Write binary – Opens the file for writing in binary mode.
● ‘ab’: Append binary – Opens the file for appending in binary mode.
Opening and Closing Binary Files
To work with files in Python, we use the open() function to open a file and the close()
method to close it. Using with statement ensures that the file is properly closed after
its suite finishes.
Opening Files
with open('example.bin', 'rb') as f:
Closing Files
If we do not use with statement, we need to manually close the file using the close()
method to ensure that all resources are released.
f = open('example.bin', 'rb')
try:
bin = f.read()
finally:
f.close()
Reading Binary Files in Python
Using the open() Function with Binary Mode
Reading binary files means reading data that is stored in a binary format, which is not
human-readable. Unlike text files, which store data as readable characters, binary files
store data as raw bytes.
The open() function is used to open files in Python. When dealing with binary files,
we need to specify the mode as ‘rb’ (read binary).
f = open('example.bin', 'rb')
# Perform operations
bin = f.read()
print(bin)
# Closing the file
f.close()
Reading Binary file line by line
By using readlines() method we can read all lines in a file. However, in binary mode,
it returns a list of lines, each ending with a newline byte (b’\n’).
with open('example.bin', 'rb') as f:
lines = f.readlines()
for i in lines:
print(i)
Reading Binary File in Chunks
Reading a binary file in chunks is useful when dealing with large files that cannot be
read into memory all at once. This uses read(size) method which reads up to size
bytes from the file. If the size is not specified, it reads until the end of the file.
size = 1024 # Define the chunk size
with open ('example.bin', 'rb') as f:
while True:
chunk = f.read(size)
if not chunk:
break
# Process the chunk (for demonstration, we'll print it)
print(chunk)
Python Command Line Arguments
Python Command Line Arguments provides a convenient way to accept some
information at the command line while running the program. We usually pass these
values along with the name of the Python script.
To run a Python program, we execute the following command in the command prompt
terminal of the operating system. For example, in windows, the following command is
entered in Windows command prompt terminal.
$ python script.py arg1 arg2 arg3
name = input("Enter your name: ")
print ("Hello {}. How are you?".format(name))
Shell variables
A shell variable is a character string in a shell that stores some value. It could be an
integer, filename, string, or some shell command itself. Basically, it is a pointer to the
actual data stored in memory.
● Local Variables
● Global Variables or Environment Variables
● Shell Variables or System Variables
Local Variable.
A local variable is a special type of variable which has its scope only within a specific
function or block of code. Local variables can override the same variable name in the
larger scope.
#!/bin/sh
getName()
{
NAME=SATYAJIT #local variable
echo "$NAME (from function)" #valid if called using function
}
echo "$NAME - (outside function)" #invalid here
getName
Output:
- (outside function)
SATYAJIT (from function)
Global Variables
A global variable is a variable with global scope. It is accessible throughout the
program. Global variables are declared outside any block of code or function.
NAME=SATYAJIT #global variable
getName(){
echo "$NAME (from function)"
}

echo "$NAME - (outside function)"


getName

Output :
SATYAJIT - (outside function)
SATYAJIT (from function)
System variables
In Python, shell variables or system variables typically refer to the environment
variables that are set in the operating system or shell from which the Python script is
run. These variables store system-level settings and configuration information that
programs can access during execution.
import os
path = os.environ.get('PATH') # The PATH variable
print("PATH:", path)
home = os.environ.get('HOME') # In Linux/Unix, it might be HOME
print("HOME:", home)
non_existing_var = os.environ.get('NON_EXISTING_VAR')
print("NON_EXISTING_VAR:", non_existing_var)
Common Shell/Environment Variables:
Some common environment variables that are often used in Python programs include:
● PATH: Specifies the system’s executable search path.
● HOME: The current user’s home directory (Linux/Unix).
● USER: The name of the currently logged-in user.
● SHELL: The current shell (e.g., /bin/bash for Linux).
● TEMP or TMP: Directory used for temporary files.
● LANG: Defines the system's language and locale.
● PYTHONPATH: A list of directories to search for Python modules.
Parallel Processing in python
Parallel processing can increase the number of tasks done by your program which
reduces the overall processing time. These help to handle large scale problems.
In this section we will cover the following topics:
● Introduction to parallel processing
● Multi Processing Python library for parallel processing
● IPython parallel framework
Introduction to parallel processing
For parallelism, it is important to divide the problem into sub-units that do not depend
on other sub-units (or less dependent). A problem where the sub-units are totally
independent of other sub-units is called embarrassingly parallel.
For example, An element-wise operation on an array. In this case, the operation needs
to aware of the particular element it is handling at the moment.
In another scenario, a problem which is divided into sub-units have to share some
data to perform operations. These results in the performance issue because of the
communication cost.
There are two main ways to handle parallel programs:
● Shared Memory
In shared memory, the sub-units can communicate with each other through the same
memory space. The advantage is that you don’t need to handle the communication
explicitly because this approach is sufficient to read or write from the shared memory.
But the problem arises when multiple process access and change the same memory
location at the same time. This conflict can be avoided using synchronization
techniques.
● Distributed memory
In distributed memory, each process is totally separated and has its own memory
space. In this scenario, communication is handled explicitly between the processes.
Since the communication happens through a network interface, it is costlier compared
to shared memory.
● Benefits of Parallel Processing in Python
● Parallel Processing Methods in Python
● Python Parallel Processing Examples
● Parallel Computing and Run:ai
Benefits of Parallel Processing in Python
Improved Performance and Efficiency
When tasks are executed sequentially, the program has to wait for one task to
complete before moving on to the next. This can lead to a waste of valuable
processing time, especially if some tasks are independent and don't need to wait for
others to complete
Handling Large Data Sets
Handling large data sets is a common requirement for many modern applications.
These could range from data analysis and machine learning applications to big data
processing and more. Sequential processing can be quite inefficient and time-
consuming when dealing with such large volumes of data.
Cost-Effectiveness
Parallel processing in Python also leads to cost-effectiveness. With parallel
processing, tasks are completed faster, requiring less computational resources. The
faster processing time means that resources are freed up sooner for other tasks. This
improved efficiency can translate into cost savings in the long run.
Parallel Processing Methods in Python
Python Multi-Threading
Multi-threading is a form of parallelism that allows a program to perform multiple
tasks concurrently. In Python, the threading module provides a way to create and
manage threads. Each thread can run a specific function or method, and all threads run
independently of each other.
Multi Threading Example
import threading
def print_numbers():
for i in range(10):
print(f” {i}”)
def print_letters():
for letter in 'abcdefghij':
print(f” {letter}”)
t1 = threading.Thread(target=print_numbers)
t2=threading.Thread(target=print_letters)
t1.start()t2.start()
t1.join()t2.join()
Python Multiprocessing
Multiprocessing is another form of parallelism that involves running multiple
processes simultaneously. In Python, the multiprocessing module provides a way to
create and manage processes. Unlike threads, each process runs in its own Python
interpreter, which means that they can run in true parallel on a multiprocessor system.
example of multiprocessing in Python:
from multiprocessing import Process
def print_numbers():
for i in range(10):
print(f” {i}”)
def print_letters():
for letter in 'abcdefghij':
print(f” {letter}”)
p1 = Process(target=print_numbers)p2 = Process(target=print_letters)
p1.start()p2.start()
p1.join()p2.join()
Python Asynchronous Programming
Asynchronous programming is a form of concurrent programming that involves
executing tasks in a non-blocking manner. In Python, the asyncio module provides a
way to write asynchronous code. With asyncio, you can write code that performs IO-
bound tasks without blocking the execution of the rest of the program.
import asyncio
async def print_numbers():
for i in range(10):
print(f” {i}”)
await asyncio.sleep(1)
async def print_letters():
for letter in 'abcdefghij':
print(f” {letter}”)
await asyncio.sleep(1)
async def main():
task1 = asyncio.create_task(print_numbers())
task2 = asyncio.create_task(print_letters())‍
await task1
await task2
asyncio.run(main())
List of Python libraries for Parallel Processing
Ray
When a Python code needs to be parallelized or distributed, it can lead to rewriting the
existing code, & even sometimes writing it from scratch. The Ray library provides an
efficient way to run the same code on more than one machine & helps handle large
objects as well as numerical data.
Dask
When developers in the data engineering team handle large data sets, they find dask to
be a one-stop solution for such data sets that are larger than to fit-in memory.
Joblib
Joblib is one of the python libraries that provides an easy-to-use interface for
performing parallel processing in python. This library is best-suited when you have
loops and each iteration through the loop calls some function that can take time to
complete.

Pandarallel
Pandarallel is an open-source library that is used to parallelize Pandas operations on
all the available CPUs. It lets you parallelize the functions such as apply() ,
applymap() , map() , groupby() , and rolling() on Pandas DataFrame & Series objects.
Dispy
Dispy is ideal for data-parallel (SIMD) paradigm, where SIMD is an acronym for
‘Single Instruction/Multiple Data’ operations meaning – a computing method that
enables the processing of multiple data with just a single instruction.
Ipyparallel
The main advantage of developing parallel applications using ipyparallel is that it can
interactively within Jupyter platform. And, it supports various types of parallel
processing approaches like the single program, multiple data parallelism, multiple
programs, multiple data parallelism & more.
pySpark
Spark is great for scaling up data science tasks and workloads! As long as you’re
using Spark data frames and libraries that operate on these data structures, you can
scale to massive data sets that distribute across a cluster.
Regular Expression or RegEx
A Regular Expression or RegEx is a special sequence of characters that uses a search
pattern to find a string or set of strings.
It can detect the presence or absence of a text by matching it with a particular pattern
and also can split a pattern into one or more sub-patterns.
A RegEx, or Regular Expression, is a sequence of characters that forms a search
pattern. RegEx can be used to check if a string contains the specified search pattern.
RegEx Module
Python has a built-in package called re, which can be used to work with Regular
Expressions.
Import the re module:
import re
RegEx Functions
Function
re.findall() : finds and returns all matching occurrences in a list
re.compile() : Regular expressions are compiled into pattern objects
re.split() : Split string by the occurrences of a character or a pattern.
re.sub() : Replaces all occurrences of a character or patter with a replacement string.
re.escape() : Escapes special character
re.search() : Searches for first occurrence of character or pattern

You might also like