[go: up one dir, main page]

0% found this document useful (0 votes)
7 views26 pages

Data Analytics Lab Manual

The document outlines a series of experiments focused on setting up a Python environment for data science using libraries like NumPy and Pandas, as well as performing various data manipulation and analysis tasks. It includes detailed steps for installation, verification, and execution of basic programs, along with additional experiments involving Hadoop, MongoDB, data visualization, and MapReduce techniques. Each experiment aims to provide practical experience with data science tools and techniques, culminating in expected outputs for successful execution.

Uploaded by

Satyam Tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

Data Analytics Lab Manual

The document outlines a series of experiments focused on setting up a Python environment for data science using libraries like NumPy and Pandas, as well as performing various data manipulation and analysis tasks. It includes detailed steps for installation, verification, and execution of basic programs, along with additional experiments involving Hadoop, MongoDB, data visualization, and MapReduce techniques. Each experiment aims to provide practical experience with data science tools and techniques, culminating in expected outputs for successful execution.

Uploaded by

Satyam Tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Experiment:1

1. Introduction

In the rapidly evolving landscape of data science and analytics, Python has
emerged as a quintessential programming language, favored for its
readability, extensive libraries, and robust community support. Central to
Python’s dominance in this field are powerful libraries like NumPy and
Pandas, which provide highly optimized tools for numerical operations and
data manipulation, respectively.

2. Aim

The aim of Experiment 1 is :

To successfully install, configure, and verify a Python programming


environment suitable for data science.

To explore and demonstrate basic data manipulation capabilities using the


NumPy and Pandas libraries.

3. Background and Theoretical Context

3.1 Python for Data Science

3. Tools Required

Python: Version 3.8 or higher (preferably installed via Anaconda distribution


for ease of package management).

NumPy: Python library for numerical computing.

Pandas: Python library for data analysis and manipulation.

Integrated Development Environment (IDE) / Code Editor: Jupyter


Notebook/Lab, VS Code, or PyCharm are recommended for interactive
development.

4. Methodology and Tasks Performed

The experiment was conducted on a [Your Operating System, e.g., Windows


10 / macOS Ventura / Ubuntu 22.04 LTS] system. The following steps outline
the installation process and the subsequent programming tasks.
5.1 Installation of Python, NumPy, and Pandas

Steps:

Download Anaconda:

Navigate to the official Anaconda website:


https://www.anaconda.com/products/distribution

Download the appropriate graphical installer for your operating system (e.g.,
64-bit Graphical Installer for Windows).

Install Anaconda:

Run the downloaded installer.

Follow the on-screen prompts. It is generally recommended to accept the


default options, including adding Anaconda to your system PATH (though the
installer might warn against this for advanced users, for a beginner it’s often
convenient).

Once installed, open your system’s command prompt (Windows: cmd,


macOS/Linux: Terminal).

Verify Python Installation:

Type the following command and press Enter:

Python –version

Expected Output: Python 3.x.x (e.g., Python 3.9.12). This confirms Python is
installed and accessible.

Verify NumPy and Pandas Installation:

Since Anaconda typically pre-installs these libraries, we can verify their


presence and version.
Open a Python interpreter session by typing python in your terminal, or
preferably, open a Jupyter Notebook/Lab.

Execute the following commands:

Import numpy as np

Print(f”NumPy Version: {np.__version__}”)

Import pandas as pd

Print(f”Pandas Version: {pd.__version__}”)

If these commands execute without ModuleNotFoundError, it confirms that


NumPy and Pandas are successfully installed.

(Optional: If for some reason they were not installed, or you are not using
Anaconda, you would install them via pip or conda):

# Using pip

Pip install numpy pandas

# Using conda (if using Anaconda/Miniconda)

Conda install numpy pandas

5.2 Writing Basic Programs Using NumPy Arrays and Pandas DataFrames

Once the environment is set up, basic programs were written and executed
to demonstrate the core functionalities of NumPy and Pandas. These
programs were run in a Jupyter Notebook environment for interactive
development and clear output.

Task 1: Basic NumPy Array Operations

Objective: Create a NumPy array, perform element-wise operations, and


demonstrate basic array attributes.

Task 2: Basic Pandas DataFrame Operations


Objective: Create a Pandas DataFrame, access specific columns/rows, and
perform basic data selection.

6. Results and Observations

6.1 Installation Verification

The Python environment was successfully set up using the Anaconda


distribution. The verification commands yielded the following outputs:

# Command to check Python version

Python –version

# Output:

# Python 3.9.12

# Commands to check NumPy and Pandas versions in a Python


interpreter/Jupyter Notebook

Import numpy as np

Print(f”NumPy Version: {np.__version__}”)

Import pandas as pd

Print(f”Pandas Version: {pd.__version__}”)

# Output:

# NumPy Version: 1.21.5

# Pandas Version: 1.4.2

(Note: Versions may vary based on the Anaconda distribution used at the
time of installation.)

The successful output confirms that Python, NumPy, and Pandas are correctly
installed and configured within the environment.
6.2 Execution of Basic Programs

Program 1: NumPy Array Operations

Import numpy as np

Print(“--- NumPy Array Operations ---“)

# 1. Create a NumPy array

Data_list = [10, 20, 30, 40, 50]

Numpy_array = np.array(data_list)

Print(“\n1. Original NumPy Array:”)

Print(numpy_array)

Print(f” Type: {type(numpy_array)}”)

Print(f” Shape: {numpy_array.shape}”)

Print(f” Data Type (dtype): {numpy_array.dtype}”)

# 2. Perform element-wise operation (e.g., add 5 to each element)

Modified_array = numpy_array + 5

Print(“\n2. Array after adding 5 to each element:”)

Print(modified_array)

# 3. Perform another operation (e.g., multiply by 2)

Multiplied_array = numpy_array * 2

Print(“\n3. Array after multiplying each element by 2:”)

Print(multiplied_array)

# 4. Calculate sum of array elements


Array_sum = np.sum(numpy_array)

Print(f”\n4. Sum of array elements: {array_sum}”)

# 5. Create a 2D array and perform operations

Matrix = np.array([[1, 2, 3], [4, 5, 6]])

Print(“\n5. 2D NumPy Array (Matrix):”)

Print(matrix)

Print(f” Shape of matrix: {matrix.shape}”)

Print(f” Sum of all elements in matrix: {np.sum(matrix)}”)

Output of Program 1:

--- NumPy Array Operations ---

1. Original NumPy Array:

[10 20 30 40 50]

Type: <class ‘numpy.ndarray’>

Shape: (5,)

Data Type (dtype): int64

2. Array after adding 5 to each element:

[15 25 35 45 55]

3. Array after multiplying each element by 2:

[ 20 40 60 80 100]

4. Sum of array elements: 150


5. 2D NumPy Array (Matrix):

[[1 2 3]

[4 5 6]]

Shape of matrix: (2, 3)

Sum of all elements in matrix: 21

Program 2: Pandas DataFrame Operations

Import pandas as pd

Print(“\n--- Pandas DataFrame Operations ---“)

# 1. Create a Pandas DataFrame from a dictionary

Data = {

‘Student_ID’: [101, 102, 103, 104, 105],

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Eve’],

‘Score’: [85, 92, 78, 95, 88],

‘Course’: [‘Math’, ‘Physics’, ‘Chemistry’, ‘Math’, ‘Physics’]

Df = pd.DataFrame(data)

Print(“\n1. Original DataFrame:”)

Print(df)

Print(f”\n Type: {type(df)}”)

Print(f” Shape: {df.shape}”)

Print(f” Columns: {list(df.columns)}”)


# 2. Access a specific column

Names_column = df[‘Name’]

Print(“\n2. ‘Name’ Column:”)

Print(names_column)

Print(f” Type of ‘Name’ column: {type(names_column)}”)

# 3. Access multiple columns

Selected_columns = df[[‘Name’, ‘Score’]]

Print(“\n3. ‘Name’ and ‘Score’ Columns:”)

Print(selected_columns)

# 4. Select rows based on a condition (e.g., Score > 90)

High_scorers = df[df[‘Score’] > 90]

Print(“\n4. Students with Score > 90:”)

Print(high_scorers)

# 5. Get basic descriptive statistics for numerical columns

Print(“\n5. Descriptive Statistics for numerical columns:”)

Print(df.describe())

Output of Program 2:

--- Pandas DataFrame Operations ---

1. Original DataFrame:

Student_ID Name Score Course

0 101 Alice 85 Math


1 102 Bob 92 Physics

2 103 Charlie 78 Chemistry

3 104 David 95 Math

4 105 Eve 88 Physics

Type: <class ‘pandas.core.frame.DataFrame’>

Shape: (5, 4)

Columns: [‘Student_ID’, ‘Name’, ‘Score’, ‘Course’]

2. ‘Name’ Column:
0 Alice
1 1 Bob
2 Charlie
3 3 David
4 Eve
5 Name: Name, dtype: object

Type of ‘Name’ column: <class ‘pandas.core.series.Series’>

3. ‘Name’ and ‘Score’ Columns:

Name Score

0 Alice 85

1 Bob 92

2 Charlie 78

3 David 95

4 Eve 88

4. Students with Score > 90:

Student_ID Name Score Course

1 102 Bob 92 Physics


3 104 David 95 Math

5. Descriptive Statistics for numerical columns:

Student_ID Score

Count 5.000000 5.000000

Mean 103.000000 87.600000

Std 1.581139 6.804410

Min 101.000000 78.000000

25% 102.000000 85.000000

50% 103.000000 88.000000

75% 104.000000 92.000000

Max 105.000000 95.000000

Experiment 1: Install, Configure, and Run Python, NumPy, and Pandas

Aim:

To install and set up the Python environment and explore basic data
manipulation using NumPy and Pandas.

Steps:
1. Install Python from python.org.

2. Open terminal/command prompt and install libraries:

pip install numpy pandas

3. Create two programs:

NumPy Program (numpy_demo.py):

import numpy as np

# create a numpy array

arr = np.array([1, 2, 3, 4, 5])

print("Array:", arr)

# perform operations

print("Mean:", np.mean(arr))

print("Squared:", arr ** 2)

Pandas Program (pandas_demo.py):


import pandas as pd

# create a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [20, 21, 19],

'Score': [85, 90, 88]}

df = pd.DataFrame(data)

print("DataFrame:\n", df)

print("Mean Score:", df['Score'].mean())

Expected Output:

Successful execution of array operations and DataFrame manipulations.

---

Experiment 2: Install, Configure, and Run Hadoop and HDFS

Aim:

To set up Hadoop and interact with the Hadoop Distributed File System
(HDFS).

Steps:

1. Install Hadoop, set environment variables (JAVA_HOME, HADOOP_HOME).


2. Configure core-site.xml and hdfs-site.xml.

3. Start NameNode and DataNode.

4. Run commands:

hdfs dfs -mkdir /mydata

hdfs dfs -put localfile.txt /user/hadoop/

hdfs dfs -ls /user/hadoop/

Expected Output:

HDFS directories and files should be created, listed, and accessible.

---

Experiment 3: Visualize Data Using Basic Plotting Techniques

Aim:

To create visualizations using Matplotlib and Seaborn.

Steps & Code:


import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# sample dataset

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Math': [85, 78, 92],

'Science': [90, 88, 95],

'English': [80, 75, 85]}

df = pd.DataFrame(data)

# Line Plot

df.set_index('Name')[['Math','Science','English']].plot(kind='line')

plt.title("Line Plot - Scores")

plt.show()

# Bar Chart

sns.barplot(x='Name', y='Math', data=df)

plt.title("Bar Chart - Math Scores")

plt.show()

# Pie Chart

df[['Math','Science','English']].sum().plot.pie(autopct='%1.1f%%')

plt.title("Pie Chart - Total Marks")

plt.show()
# Histogram

plt.hist(df['Science'], bins=5, edgecolor='black')

plt.title("Histogram - Science Marks")

plt.show()

Expected Output:

Line, bar, pie, and histogram plots.

---

Experiment 4: CRUD Operations in MongoDB

Aim:

To perform CRUD operations and manage arrays in MongoDB using Python.

Code (mongodb_crud.py):

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["studentDB"]

collection = db["students"]

# Insert

collection.insert_many([

{"name": "Alice", "math": 85, "skills": ["Python", "SQL"]},


{"name": "Bob", "math": 90, "skills": ["Java"]},

{"name": "Charlie", "math": 78, "skills": ["C++"]}

])

# Read

print("All Students:")

for doc in collection.find():

print(doc)

# Update

collection.update_one({"name": "Alice"}, {"$set": {"math": 95}})

print("\nUpdated Alice:", collection.find_one({"name": "Alice"}))

# Delete

collection.delete_one({"name": "Charlie"})

print("\nAfter Deletion:")

for doc in collection.find():

print(doc)

# Query Array

print("\nStudents with Python:")

for doc in collection.find({"skills": "Python"}):

print(doc)

Expected Output:

Insertion, retrieval, update, deletion, and array query results.


---

Experiment 5: Advanced MongoDB Operations

Aim:

To use Count, Sort, Limit, Skip, and Aggregate functions in MongoDB.

Code:

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

db = client["companyDB"]

collection = db["employees"]

collection.insert_many([

{"name": "A", "salary": 50000, "dept": "HR"},

{"name": "B", "salary": 60000, "dept": "IT"},

{"name": "C", "salary": 70000, "dept": "IT"},

{"name": "D", "salary": 80000, "dept": "Finance"},

{"name": "E", "salary": 75000, "dept": "Finance"}

])

print("Count:", collection.count_documents({}))

print("\nSorted by Salary:")
for doc in collection.find().sort("salary", -1):

print(doc)

print("\nTop 3 Salaries:")

for doc in collection.find().sort("salary", -1).limit(3):

print(doc)

print("\nSkip 2:")

for doc in collection.find().sort("salary", -1).skip(2):

print(doc)

print("\nAverage Salary by Dept:")

for doc in collection.aggregate([

{"$group": {"_id": "$dept", "avgSalary": {"$avg": "$salary"}}}

]):

print(doc)

---

Experiment 6: Word Count Using MapReduce

Aim:

To implement word frequency using MapReduce in Python.

Code:
def mapper(line):

return [(word, 1) for word in line.strip().split()]

def shuffle_and_sort(mapped):

grouped = {}

for word, count in mapped:

grouped.setdefault(word, []).append(count)

return grouped

def reducer(grouped):

return {word: sum(counts) for word, counts in grouped.items()}

def main():

with open("input.txt", "r") as f:

lines = f.readlines()

mapped = []

for line in lines:

mapped.extend(mapper(line))

grouped = shuffle_and_sort(mapped)

reduced = reducer(grouped)

print("Word Count:", reduced)

if __name__ == "__main__":

main()
---

Experiment 7: MapReduce on Dataset (Average Salary)

Aim:

To compute average salary per department using MapReduce.

Code:

import csv

def mapper(row):

dept, salary = row[1], float(row[2])

return (dept, (salary, 1))

def shuffle_and_sort(mapped):

grouped = {}

for dept, (salary, count) in mapped:

if dept not in grouped:

grouped[dept] = []

grouped[dept].append((salary, count))

return grouped

def reducer(grouped):

results = {}
for dept, values in grouped.items():

total_salary = sum(s for s, _ in values)

total_count = sum(c for _, c in values)

results[dept] = round(total_salary / total_count, 2)

return results

def main():

with open("employees.csv") as f:

reader = csv.reader(f)

next(reader)

mapped = [mapper(row) for row in reader]

grouped = shuffle_and_sort(mapped)

reduced = reducer(grouped)

print("Average Salary by Department:", reduced)

if __name__ == "__main__":

main()

---

Experiment 8: Clustering with Spark MLlib

Aim:

To perform clustering using Spark MLlib’s K-Means.


Code (PySpark):

from pyspark.ml.clustering import KMeans

from pyspark.ml.feature import VectorAssembler

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Clustering").getOrCreate()

# load dataset

data = [(1, 2.0, 3.0), (2, 10.0, 15.0), (3, 25.0, 30.0)]

df = spark.createDataFrame(data, ["id", "x", "y"])

# assemble features

vec = VectorAssembler(inputCols=["x", "y"], outputCol="features")

df = vec.transform(df)

# k-means

kmeans = KMeans(k=2, seed=1)

model = kmeans.fit(df)

print("Cluster Centers:")

for center in model.clusterCenters():

print(center)

✅ Experiment 9 Program: MongoDB + Hadoop Integration

Aim:
To build a mini-project that stores student data in MongoDB, exports it to
HDFS (Hadoop Distributed File System), and then processes it (e.g., count
total records).

Python Program (experiment9.py):

From pymongo import MongoClient

Import pandas as pd

Import subprocess

Import os

# 1. Connect to MongoDB

Client = MongoClient(“mongodb://localhost:27017/”)

Db = client[“bigdataDB”]

Collection = db[“students”]

# Insert sample student data

Students = [

{“name”: “Alice”, “dept”: “CS”, “score”: 85},

{“name”: “Bob”, “dept”: “IT”, “score”: 90},

{“name”: “Charlie”, “dept”: “CS”, “score”: 78},

{“name”: “David”, “dept”: “IT”, “score”: 88},

{“name”: “Eve”, “dept”: “Math”, “score”: 92}

]
Collection.insert_many(students)

Print(“✅ Data inserted into MongoDB.”)

# 2. Export data from MongoDB to CSV

Data = list(collection.find({}, {“_id”: 0})) # Exclude _id field

Df = pd.DataFrame(data)

Csv_file = “students.csv”

Df.to_csv(csv_file, index=False)

Print(“✅ Exported data to students.csv”)

# 3. Put CSV file into HDFS (requires Hadoop running)

Hdfs_dir = “/bigdata_exp9”

Try:

Subprocess.run([“hdfs”, “dfs”, “-mkdir”, “-p”, hdfs_dir], check=True)

Subprocess.run([“hdfs”, “dfs”, “-put”, “-f”, csv_file, hdfs_dir], check=True)

Print(f”✅ File uploaded to HDFS at {hdfs_dir}/{csv_file}”)

Except Exception as e:

Print(“⚠️HDFS commands failed. Make sure Hadoop is running.”)

Print€

# 4. Simple Processing Simulation (count rows)

Print(“\nProcessing Data:”)

Print(f”Total number of student records: {len(df)}”)

Print(“Average Score by Department:”)

Print(df.groupby(“dept”)[“score”].mean())
Steps to Run:

1. Start MongoDB (mongod) and Hadoop (start-dfs.sh).

2. Save the program as experiment9.py.

3. Run:

Python experiment9.py

Expected Output:

✅ Data inserted into MongoDB.

✅ Exported data to students.csv

✅ File uploaded to HDFS at /bigdata_exp9/students.csv

Processing Data:

Total number of student records: 5


Average Score by Department:

CS 81.5

IT 89.0

Math 92.0

Name: score, dtype: float64

👉 This program demonstrates:

Storage in MongoDB

Export to Hadoop HDFS

Processing (average scores per department)

You might also like