[go: up one dir, main page]

0% found this document useful (0 votes)
26 views3 pages

Experiment3.Ipynb - Colab

The document contains a Jupyter notebook that processes a dataset of 1000 machine learning job postings in the US. It includes steps for data inspection, cleaning, handling missing values, and saving the cleaned dataset. The initial data shape is (997, 10) and various data attributes such as job titles, company names, and descriptions are analyzed.

Uploaded by

Madhavi Sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views3 pages

Experiment3.Ipynb - Colab

The document contains a Jupyter notebook that processes a dataset of 1000 machine learning job postings in the US. It includes steps for data inspection, cleaning, handling missing values, and saving the cleaned dataset. The initial data shape is (997, 10) and various data attributes such as job titles, company names, and descriptions are analyzed.

Uploaded by

Madhavi Sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

5/14/25, 12:57 PM experiment3.

ipynb - Colab

import pandas as pd

df = pd.read_csv('1000_ml_jobs_us.csv')

print("Initial data shape:", df.shape)

Initial data shape: (997, 10)

# 2. Inspect the data


print(df.head())

Unnamed: 0 job_posted_date company_address_locality company_address_region \


0 0 2024-10-31 Indianapolis Indiana
1 1 2025-03-14 San Francisco California
2 2 2025-04-09 San Jose CA
3 3 2025-03-22 Mountain View California
4 4 2025-03-28 Boston Massachusetts

company_name company_website \
0 Upper Hand https://upperhand.com
1 Ikigai https://www.ikigailabs.io
2 Adobe http://www.adobe.com
3 Waymo https://waymo.com/careers/
4 HMH http://www.hmhco.com

company_description \
0 Upper Hand is the leading provider of full-sui...
1 Built upon years of MIT research, Ikigai is a ...
2 Adobe is the global leader in digital media an...
3 On the journey to be the world's most trusted ...
4 We are an adaptive learning company that empow...

job_description_text seniority_level \
0 OverviewUpper Hand is embarking on an exciting... Internship
1 Company DescriptionThe Ikigai platform unlocks... Mid-Senior level
2 Our CompanyChanging the world through digital ... Entry level
3 Waymo is an autonomous driving technology comp... Entry level
4 Job Title: Machine Learning EngineerLocation: ... Mid-Senior level

job_title
0 Internship - Machine Learning Engineer & Data ...
1 Machine Learning Engineer
2 Machine Learning Engineer
3 Machine Learning Engineer, Training
4 Machine Learning Engineer

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997 entries, 0 to 996
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 997 non-null int64
1 job_posted_date 997 non-null object
2 company_address_locality 950 non-null object
3 company_address_region 884 non-null object
4 company_name 997 non-null object
5 company_website 983 non-null object
6 company_description 985 non-null object
7 job_description_text 996 non-null object
8 seniority_level 988 non-null object
9 job_title 997 non-null object
dtypes: int64(1), object(9)
memory usage: 78.0+ KB
None

print(df.describe(include='all'))

Unnamed: 0 job_posted_date company_address_locality \ 


count 997.000000 997 950
unique NaN 116 178
top NaN 2025-04-09 San Francisco
freq NaN 87 148
mean 498.000000 NaN NaN
std 287.953411 NaN NaN
min 0.000000 NaN NaN
25% 249.000000 NaN NaN

https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 1/3
5/14/25, 12:57 PM experiment3.ipynb - Colab
50% 498.000000 NaN NaN 
75% 747.000000 NaN NaN
max 996.000000 NaN NaN

company_address_region company_name \
count 884 997
unique 87 488
top California TikTok
freq 308 88
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN

company_website \
count 983
unique 478
top https://www.tiktok.com/about?lang=en
freq 88
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

company_description \
count 985
unique 480
top TikTok is the world's leading destination for ...
freq 88
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

job_description_text seniority_level \
count 996 988
unique 795 7
top Meta is embarking on the most transformative c... Mid-Senior level
freq 12 371
mean NaN NaN

# 3. Clean column names


df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("Columns after cleaning:", df.columns.tolist())

Columns after cleaning: ['unnamed:_0', 'job_posted_date', 'company_address_locality', 'company_address_region', 'company_name', 'com

 


# 4. Check and handle missing values
print("Missing values:\n", df.isnull().sum())

Missing values:
unnamed:_0 0
job_posted_date 0
company_address_locality 47
company_address_region 113
company_name 0
company_website 14
company_description 12
job_description_text 1
seniority_level 9
job_title 0
dtype: int64

# Example: Fill missing values in 'company_size' with mode


if 'company_size' in df.columns:
df['company_size'] = df['company_size'].fillna(df['company_size'].mode()[0])

# 5. Drop rows with critical missing values (if any)


df = df.dropna(subset=['job_title', 'company_name',])#company_address_locality

https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 2/3
5/14/25, 12:57 PM experiment3.ipynb - Colab
# 6. Convert 'founded' to numeric and handle missing values
if 'founded' in df.columns:
df['founded'] = pd.to_numeric(df['founded'], errors='coerce')
df['founded'] = df['founded'].fillna(0).astype(int)

# 9. Remove duplicates if any


df = df.drop_duplicates()

# 10. Reset index


df.reset_index(drop=True, inplace=True)

# 11. Save cleaned dataset


df.to_csv('1000_ml_jobs_us_cleaned.csv', index=False)
print("Cleaned dataset saved as '1000_ml_jobs_us_cleaned.csv'")

Cleaned dataset saved as '1000_ml_jobs_us_cleaned.csv'

a = pd.read_csv('1000_ml_jobs_us.csv')

print(a.head())

Unnamed: 0 job_posted_date company_address_locality company_address_region \


0 0 2024-10-31 Indianapolis Indiana
1 1 2025-03-14 San Francisco California
2 2 2025-04-09 San Jose CA
3 3 2025-03-22 Mountain View California
4 4 2025-03-28 Boston Massachusetts

company_name company_website \
0 Upper Hand https://upperhand.com
1 Ikigai https://www.ikigailabs.io
2 Adobe http://www.adobe.com
3 Waymo https://waymo.com/careers/
4 HMH http://www.hmhco.com

company_description \
0 Upper Hand is the leading provider of full-sui...
1 Built upon years of MIT research, Ikigai is a ...
2 Adobe is the global leader in digital media an...
3 On the journey to be the world's most trusted ...
4 We are an adaptive learning company that empow...

job_description_text seniority_level \
0 OverviewUpper Hand is embarking on an exciting... Internship
1 Company DescriptionThe Ikigai platform unlocks... Mid-Senior level
2 Our CompanyChanging the world through digital ... Entry level
3 Waymo is an autonomous driving technology comp... Entry level
4 Job Title: Machine Learning EngineerLocation: ... Mid-Senior level

job_title
0 Internship - Machine Learning Engineer & Data ...
1 Machine Learning Engineer
2 Machine Learning Engineer
3 Machine Learning Engineer, Training
4 Machine Learning Engineer

https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 3/3

You might also like