Experiment:1
1. Introduction
In the rapidly evolving landscape of data science and analytics, Python has
emerged as a quintessential programming language, favored for its
readability, extensive libraries, and robust community support. Central to
Python’s dominance in this field are powerful libraries like NumPy and
Pandas, which provide highly optimized tools for numerical operations and
data manipulation, respectively.
2. Aim
The aim of Experiment 1 is :
To successfully install, configure, and verify a Python programming
environment suitable for data science.
To explore and demonstrate basic data manipulation capabilities using the
NumPy and Pandas libraries.
3. Background and Theoretical Context
3.1 Python for Data Science
3. Tools Required
Python: Version 3.8 or higher (preferably installed via Anaconda distribution
for ease of package management).
NumPy: Python library for numerical computing.
Pandas: Python library for data analysis and manipulation.
Integrated Development Environment (IDE) / Code Editor: Jupyter
Notebook/Lab, VS Code, or PyCharm are recommended for interactive
development.
4. Methodology and Tasks Performed
The experiment was conducted on a [Your Operating System, e.g., Windows
10 / macOS Ventura / Ubuntu 22.04 LTS] system. The following steps outline
the installation process and the subsequent programming tasks.
5.1 Installation of Python, NumPy, and Pandas
Steps:
Download Anaconda:
Navigate to the official Anaconda website:
https://www.anaconda.com/products/distribution
Download the appropriate graphical installer for your operating system (e.g.,
64-bit Graphical Installer for Windows).
Install Anaconda:
Run the downloaded installer.
Follow the on-screen prompts. It is generally recommended to accept the
default options, including adding Anaconda to your system PATH (though the
installer might warn against this for advanced users, for a beginner it’s often
convenient).
Once installed, open your system’s command prompt (Windows: cmd,
macOS/Linux: Terminal).
Verify Python Installation:
Type the following command and press Enter:
Python –version
Expected Output: Python 3.x.x (e.g., Python 3.9.12). This confirms Python is
installed and accessible.
Verify NumPy and Pandas Installation:
Since Anaconda typically pre-installs these libraries, we can verify their
presence and version.
Open a Python interpreter session by typing python in your terminal, or
preferably, open a Jupyter Notebook/Lab.
Execute the following commands:
Import numpy as np
Print(f”NumPy Version: {np.__version__}”)
Import pandas as pd
Print(f”Pandas Version: {pd.__version__}”)
If these commands execute without ModuleNotFoundError, it confirms that
NumPy and Pandas are successfully installed.
(Optional: If for some reason they were not installed, or you are not using
Anaconda, you would install them via pip or conda):
# Using pip
Pip install numpy pandas
# Using conda (if using Anaconda/Miniconda)
Conda install numpy pandas
5.2 Writing Basic Programs Using NumPy Arrays and Pandas DataFrames
Once the environment is set up, basic programs were written and executed
to demonstrate the core functionalities of NumPy and Pandas. These
programs were run in a Jupyter Notebook environment for interactive
development and clear output.
Task 1: Basic NumPy Array Operations
Objective: Create a NumPy array, perform element-wise operations, and
demonstrate basic array attributes.
Task 2: Basic Pandas DataFrame Operations
Objective: Create a Pandas DataFrame, access specific columns/rows, and
perform basic data selection.
6. Results and Observations
6.1 Installation Verification
The Python environment was successfully set up using the Anaconda
distribution. The verification commands yielded the following outputs:
# Command to check Python version
Python –version
# Output:
# Python 3.9.12
# Commands to check NumPy and Pandas versions in a Python
interpreter/Jupyter Notebook
Import numpy as np
Print(f”NumPy Version: {np.__version__}”)
Import pandas as pd
Print(f”Pandas Version: {pd.__version__}”)
# Output:
# NumPy Version: 1.21.5
# Pandas Version: 1.4.2
(Note: Versions may vary based on the Anaconda distribution used at the
time of installation.)
The successful output confirms that Python, NumPy, and Pandas are correctly
installed and configured within the environment.
6.2 Execution of Basic Programs
Program 1: NumPy Array Operations
Import numpy as np
Print(“--- NumPy Array Operations ---“)
# 1. Create a NumPy array
Data_list = [10, 20, 30, 40, 50]
Numpy_array = np.array(data_list)
Print(“\n1. Original NumPy Array:”)
Print(numpy_array)
Print(f” Type: {type(numpy_array)}”)
Print(f” Shape: {numpy_array.shape}”)
Print(f” Data Type (dtype): {numpy_array.dtype}”)
# 2. Perform element-wise operation (e.g., add 5 to each element)
Modified_array = numpy_array + 5
Print(“\n2. Array after adding 5 to each element:”)
Print(modified_array)
# 3. Perform another operation (e.g., multiply by 2)
Multiplied_array = numpy_array * 2
Print(“\n3. Array after multiplying each element by 2:”)
Print(multiplied_array)
# 4. Calculate sum of array elements
Array_sum = np.sum(numpy_array)
Print(f”\n4. Sum of array elements: {array_sum}”)
# 5. Create a 2D array and perform operations
Matrix = np.array([[1, 2, 3], [4, 5, 6]])
Print(“\n5. 2D NumPy Array (Matrix):”)
Print(matrix)
Print(f” Shape of matrix: {matrix.shape}”)
Print(f” Sum of all elements in matrix: {np.sum(matrix)}”)
Output of Program 1:
--- NumPy Array Operations ---
1. Original NumPy Array:
[10 20 30 40 50]
Type: <class ‘numpy.ndarray’>
Shape: (5,)
Data Type (dtype): int64
2. Array after adding 5 to each element:
[15 25 35 45 55]
3. Array after multiplying each element by 2:
[ 20 40 60 80 100]
4. Sum of array elements: 150
5. 2D NumPy Array (Matrix):
[[1 2 3]
[4 5 6]]
Shape of matrix: (2, 3)
Sum of all elements in matrix: 21
Program 2: Pandas DataFrame Operations
Import pandas as pd
Print(“\n--- Pandas DataFrame Operations ---“)
# 1. Create a Pandas DataFrame from a dictionary
Data = {
‘Student_ID’: [101, 102, 103, 104, 105],
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Eve’],
‘Score’: [85, 92, 78, 95, 88],
‘Course’: [‘Math’, ‘Physics’, ‘Chemistry’, ‘Math’, ‘Physics’]
Df = pd.DataFrame(data)
Print(“\n1. Original DataFrame:”)
Print(df)
Print(f”\n Type: {type(df)}”)
Print(f” Shape: {df.shape}”)
Print(f” Columns: {list(df.columns)}”)
# 2. Access a specific column
Names_column = df[‘Name’]
Print(“\n2. ‘Name’ Column:”)
Print(names_column)
Print(f” Type of ‘Name’ column: {type(names_column)}”)
# 3. Access multiple columns
Selected_columns = df[[‘Name’, ‘Score’]]
Print(“\n3. ‘Name’ and ‘Score’ Columns:”)
Print(selected_columns)
# 4. Select rows based on a condition (e.g., Score > 90)
High_scorers = df[df[‘Score’] > 90]
Print(“\n4. Students with Score > 90:”)
Print(high_scorers)
# 5. Get basic descriptive statistics for numerical columns
Print(“\n5. Descriptive Statistics for numerical columns:”)
Print(df.describe())
Output of Program 2:
--- Pandas DataFrame Operations ---
1. Original DataFrame:
Student_ID Name Score Course
0 101 Alice 85 Math
1 102 Bob 92 Physics
2 103 Charlie 78 Chemistry
3 104 David 95 Math
4 105 Eve 88 Physics
Type: <class ‘pandas.core.frame.DataFrame’>
Shape: (5, 4)
Columns: [‘Student_ID’, ‘Name’, ‘Score’, ‘Course’]
2. ‘Name’ Column:
0 Alice
1 1 Bob
2 Charlie
3 3 David
4 Eve
5 Name: Name, dtype: object
Type of ‘Name’ column: <class ‘pandas.core.series.Series’>
3. ‘Name’ and ‘Score’ Columns:
Name Score
0 Alice 85
1 Bob 92
2 Charlie 78
3 David 95
4 Eve 88
4. Students with Score > 90:
Student_ID Name Score Course
1 102 Bob 92 Physics
3 104 David 95 Math
5. Descriptive Statistics for numerical columns:
Student_ID Score
Count 5.000000 5.000000
Mean 103.000000 87.600000
Std 1.581139 6.804410
Min 101.000000 78.000000
25% 102.000000 85.000000
50% 103.000000 88.000000
75% 104.000000 92.000000
Max 105.000000 95.000000
Experiment 1: Install, Configure, and Run Python, NumPy, and Pandas
Aim:
To install and set up the Python environment and explore basic data
manipulation using NumPy and Pandas.
Steps:
1. Install Python from python.org.
2. Open terminal/command prompt and install libraries:
pip install numpy pandas
3. Create two programs:
NumPy Program (numpy_demo.py):
import numpy as np
# create a numpy array
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr)
# perform operations
print("Mean:", np.mean(arr))
print("Squared:", arr ** 2)
Pandas Program (pandas_demo.py):
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [20, 21, 19],
'Score': [85, 90, 88]}
df = pd.DataFrame(data)
print("DataFrame:\n", df)
print("Mean Score:", df['Score'].mean())
Expected Output:
Successful execution of array operations and DataFrame manipulations.
---
Experiment 2: Install, Configure, and Run Hadoop and HDFS
Aim:
To set up Hadoop and interact with the Hadoop Distributed File System
(HDFS).
Steps:
1. Install Hadoop, set environment variables (JAVA_HOME, HADOOP_HOME).
2. Configure core-site.xml and hdfs-site.xml.
3. Start NameNode and DataNode.
4. Run commands:
hdfs dfs -mkdir /mydata
hdfs dfs -put localfile.txt /user/hadoop/
hdfs dfs -ls /user/hadoop/
Expected Output:
HDFS directories and files should be created, listed, and accessible.
---
Experiment 3: Visualize Data Using Basic Plotting Techniques
Aim:
To create visualizations using Matplotlib and Seaborn.
Steps & Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# sample dataset
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 78, 92],
'Science': [90, 88, 95],
'English': [80, 75, 85]}
df = pd.DataFrame(data)
# Line Plot
df.set_index('Name')[['Math','Science','English']].plot(kind='line')
plt.title("Line Plot - Scores")
plt.show()
# Bar Chart
sns.barplot(x='Name', y='Math', data=df)
plt.title("Bar Chart - Math Scores")
plt.show()
# Pie Chart
df[['Math','Science','English']].sum().plot.pie(autopct='%1.1f%%')
plt.title("Pie Chart - Total Marks")
plt.show()
# Histogram
plt.hist(df['Science'], bins=5, edgecolor='black')
plt.title("Histogram - Science Marks")
plt.show()
Expected Output:
Line, bar, pie, and histogram plots.
---
Experiment 4: CRUD Operations in MongoDB
Aim:
To perform CRUD operations and manage arrays in MongoDB using Python.
Code (mongodb_crud.py):
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["studentDB"]
collection = db["students"]
# Insert
collection.insert_many([
{"name": "Alice", "math": 85, "skills": ["Python", "SQL"]},
{"name": "Bob", "math": 90, "skills": ["Java"]},
{"name": "Charlie", "math": 78, "skills": ["C++"]}
])
# Read
print("All Students:")
for doc in collection.find():
print(doc)
# Update
collection.update_one({"name": "Alice"}, {"$set": {"math": 95}})
print("\nUpdated Alice:", collection.find_one({"name": "Alice"}))
# Delete
collection.delete_one({"name": "Charlie"})
print("\nAfter Deletion:")
for doc in collection.find():
print(doc)
# Query Array
print("\nStudents with Python:")
for doc in collection.find({"skills": "Python"}):
print(doc)
Expected Output:
Insertion, retrieval, update, deletion, and array query results.
---
Experiment 5: Advanced MongoDB Operations
Aim:
To use Count, Sort, Limit, Skip, and Aggregate functions in MongoDB.
Code:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["companyDB"]
collection = db["employees"]
collection.insert_many([
{"name": "A", "salary": 50000, "dept": "HR"},
{"name": "B", "salary": 60000, "dept": "IT"},
{"name": "C", "salary": 70000, "dept": "IT"},
{"name": "D", "salary": 80000, "dept": "Finance"},
{"name": "E", "salary": 75000, "dept": "Finance"}
])
print("Count:", collection.count_documents({}))
print("\nSorted by Salary:")
for doc in collection.find().sort("salary", -1):
print(doc)
print("\nTop 3 Salaries:")
for doc in collection.find().sort("salary", -1).limit(3):
print(doc)
print("\nSkip 2:")
for doc in collection.find().sort("salary", -1).skip(2):
print(doc)
print("\nAverage Salary by Dept:")
for doc in collection.aggregate([
{"$group": {"_id": "$dept", "avgSalary": {"$avg": "$salary"}}}
]):
print(doc)
---
Experiment 6: Word Count Using MapReduce
Aim:
To implement word frequency using MapReduce in Python.
Code:
def mapper(line):
return [(word, 1) for word in line.strip().split()]
def shuffle_and_sort(mapped):
grouped = {}
for word, count in mapped:
grouped.setdefault(word, []).append(count)
return grouped
def reducer(grouped):
return {word: sum(counts) for word, counts in grouped.items()}
def main():
with open("input.txt", "r") as f:
lines = f.readlines()
mapped = []
for line in lines:
mapped.extend(mapper(line))
grouped = shuffle_and_sort(mapped)
reduced = reducer(grouped)
print("Word Count:", reduced)
if __name__ == "__main__":
main()
---
Experiment 7: MapReduce on Dataset (Average Salary)
Aim:
To compute average salary per department using MapReduce.
Code:
import csv
def mapper(row):
dept, salary = row[1], float(row[2])
return (dept, (salary, 1))
def shuffle_and_sort(mapped):
grouped = {}
for dept, (salary, count) in mapped:
if dept not in grouped:
grouped[dept] = []
grouped[dept].append((salary, count))
return grouped
def reducer(grouped):
results = {}
for dept, values in grouped.items():
total_salary = sum(s for s, _ in values)
total_count = sum(c for _, c in values)
results[dept] = round(total_salary / total_count, 2)
return results
def main():
with open("employees.csv") as f:
reader = csv.reader(f)
next(reader)
mapped = [mapper(row) for row in reader]
grouped = shuffle_and_sort(mapped)
reduced = reducer(grouped)
print("Average Salary by Department:", reduced)
if __name__ == "__main__":
main()
---
Experiment 8: Clustering with Spark MLlib
Aim:
To perform clustering using Spark MLlib’s K-Means.
Code (PySpark):
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Clustering").getOrCreate()
# load dataset
data = [(1, 2.0, 3.0), (2, 10.0, 15.0), (3, 25.0, 30.0)]
df = spark.createDataFrame(data, ["id", "x", "y"])
# assemble features
vec = VectorAssembler(inputCols=["x", "y"], outputCol="features")
df = vec.transform(df)
# k-means
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
print("Cluster Centers:")
for center in model.clusterCenters():
print(center)
✅ Experiment 9 Program: MongoDB + Hadoop Integration
Aim:
To build a mini-project that stores student data in MongoDB, exports it to
HDFS (Hadoop Distributed File System), and then processes it (e.g., count
total records).
Python Program (experiment9.py):
From pymongo import MongoClient
Import pandas as pd
Import subprocess
Import os
# 1. Connect to MongoDB
Client = MongoClient(“mongodb://localhost:27017/”)
Db = client[“bigdataDB”]
Collection = db[“students”]
# Insert sample student data
Students = [
{“name”: “Alice”, “dept”: “CS”, “score”: 85},
{“name”: “Bob”, “dept”: “IT”, “score”: 90},
{“name”: “Charlie”, “dept”: “CS”, “score”: 78},
{“name”: “David”, “dept”: “IT”, “score”: 88},
{“name”: “Eve”, “dept”: “Math”, “score”: 92}
]
Collection.insert_many(students)
Print(“✅ Data inserted into MongoDB.”)
# 2. Export data from MongoDB to CSV
Data = list(collection.find({}, {“_id”: 0})) # Exclude _id field
Df = pd.DataFrame(data)
Csv_file = “students.csv”
Df.to_csv(csv_file, index=False)
Print(“✅ Exported data to students.csv”)
# 3. Put CSV file into HDFS (requires Hadoop running)
Hdfs_dir = “/bigdata_exp9”
Try:
Subprocess.run([“hdfs”, “dfs”, “-mkdir”, “-p”, hdfs_dir], check=True)
Subprocess.run([“hdfs”, “dfs”, “-put”, “-f”, csv_file, hdfs_dir], check=True)
Print(f”✅ File uploaded to HDFS at {hdfs_dir}/{csv_file}”)
Except Exception as e:
Print(“⚠️HDFS commands failed. Make sure Hadoop is running.”)
Print€
# 4. Simple Processing Simulation (count rows)
Print(“\nProcessing Data:”)
Print(f”Total number of student records: {len(df)}”)
Print(“Average Score by Department:”)
Print(df.groupby(“dept”)[“score”].mean())
Steps to Run:
1. Start MongoDB (mongod) and Hadoop (start-dfs.sh).
2. Save the program as experiment9.py.
3. Run:
Python experiment9.py
Expected Output:
✅ Data inserted into MongoDB.
✅ Exported data to students.csv
✅ File uploaded to HDFS at /bigdata_exp9/students.csv
Processing Data:
Total number of student records: 5
Average Score by Department:
CS 81.5
IT 89.0
Math 92.0
Name: score, dtype: float64
👉 This program demonstrates:
Storage in MongoDB
Export to Hadoop HDFS
Processing (average scores per department)