5/14/25, 12:57 PM experiment3.
ipynb - Colab
import pandas as pd
df = pd.read_csv('1000_ml_jobs_us.csv')
print("Initial data shape:", df.shape)
Initial data shape: (997, 10)
# 2. Inspect the data
print(df.head())
Unnamed: 0 job_posted_date company_address_locality company_address_region \
0 0 2024-10-31 Indianapolis Indiana
1 1 2025-03-14 San Francisco California
2 2 2025-04-09 San Jose CA
3 3 2025-03-22 Mountain View California
4 4 2025-03-28 Boston Massachusetts
company_name company_website \
0 Upper Hand https://upperhand.com
1 Ikigai https://www.ikigailabs.io
2 Adobe http://www.adobe.com
3 Waymo https://waymo.com/careers/
4 HMH http://www.hmhco.com
company_description \
0 Upper Hand is the leading provider of full-sui...
1 Built upon years of MIT research, Ikigai is a ...
2 Adobe is the global leader in digital media an...
3 On the journey to be the world's most trusted ...
4 We are an adaptive learning company that empow...
job_description_text seniority_level \
0 OverviewUpper Hand is embarking on an exciting... Internship
1 Company DescriptionThe Ikigai platform unlocks... Mid-Senior level
2 Our CompanyChanging the world through digital ... Entry level
3 Waymo is an autonomous driving technology comp... Entry level
4 Job Title: Machine Learning EngineerLocation: ... Mid-Senior level
job_title
0 Internship - Machine Learning Engineer & Data ...
1 Machine Learning Engineer
2 Machine Learning Engineer
3 Machine Learning Engineer, Training
4 Machine Learning Engineer
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997 entries, 0 to 996
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 997 non-null int64
1 job_posted_date 997 non-null object
2 company_address_locality 950 non-null object
3 company_address_region 884 non-null object
4 company_name 997 non-null object
5 company_website 983 non-null object
6 company_description 985 non-null object
7 job_description_text 996 non-null object
8 seniority_level 988 non-null object
9 job_title 997 non-null object
dtypes: int64(1), object(9)
memory usage: 78.0+ KB
None
print(df.describe(include='all'))
Unnamed: 0 job_posted_date company_address_locality \
count 997.000000 997 950
unique NaN 116 178
top NaN 2025-04-09 San Francisco
freq NaN 87 148
mean 498.000000 NaN NaN
std 287.953411 NaN NaN
min 0.000000 NaN NaN
25% 249.000000 NaN NaN
https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 1/3
5/14/25, 12:57 PM experiment3.ipynb - Colab
50% 498.000000 NaN NaN
75% 747.000000 NaN NaN
max 996.000000 NaN NaN
company_address_region company_name \
count 884 997
unique 87 488
top California TikTok
freq 308 88
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
company_website \
count 983
unique 478
top https://www.tiktok.com/about?lang=en
freq 88
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
company_description \
count 985
unique 480
top TikTok is the world's leading destination for ...
freq 88
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
job_description_text seniority_level \
count 996 988
unique 795 7
top Meta is embarking on the most transformative c... Mid-Senior level
freq 12 371
mean NaN NaN
# 3. Clean column names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("Columns after cleaning:", df.columns.tolist())
Columns after cleaning: ['unnamed:_0', 'job_posted_date', 'company_address_locality', 'company_address_region', 'company_name', 'com
# 4. Check and handle missing values
print("Missing values:\n", df.isnull().sum())
Missing values:
unnamed:_0 0
job_posted_date 0
company_address_locality 47
company_address_region 113
company_name 0
company_website 14
company_description 12
job_description_text 1
seniority_level 9
job_title 0
dtype: int64
# Example: Fill missing values in 'company_size' with mode
if 'company_size' in df.columns:
df['company_size'] = df['company_size'].fillna(df['company_size'].mode()[0])
# 5. Drop rows with critical missing values (if any)
df = df.dropna(subset=['job_title', 'company_name',])#company_address_locality
https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 2/3
5/14/25, 12:57 PM experiment3.ipynb - Colab
# 6. Convert 'founded' to numeric and handle missing values
if 'founded' in df.columns:
df['founded'] = pd.to_numeric(df['founded'], errors='coerce')
df['founded'] = df['founded'].fillna(0).astype(int)
# 9. Remove duplicates if any
df = df.drop_duplicates()
# 10. Reset index
df.reset_index(drop=True, inplace=True)
# 11. Save cleaned dataset
df.to_csv('1000_ml_jobs_us_cleaned.csv', index=False)
print("Cleaned dataset saved as '1000_ml_jobs_us_cleaned.csv'")
Cleaned dataset saved as '1000_ml_jobs_us_cleaned.csv'
a = pd.read_csv('1000_ml_jobs_us.csv')
print(a.head())
Unnamed: 0 job_posted_date company_address_locality company_address_region \
0 0 2024-10-31 Indianapolis Indiana
1 1 2025-03-14 San Francisco California
2 2 2025-04-09 San Jose CA
3 3 2025-03-22 Mountain View California
4 4 2025-03-28 Boston Massachusetts
company_name company_website \
0 Upper Hand https://upperhand.com
1 Ikigai https://www.ikigailabs.io
2 Adobe http://www.adobe.com
3 Waymo https://waymo.com/careers/
4 HMH http://www.hmhco.com
company_description \
0 Upper Hand is the leading provider of full-sui...
1 Built upon years of MIT research, Ikigai is a ...
2 Adobe is the global leader in digital media an...
3 On the journey to be the world's most trusted ...
4 We are an adaptive learning company that empow...
job_description_text seniority_level \
0 OverviewUpper Hand is embarking on an exciting... Internship
1 Company DescriptionThe Ikigai platform unlocks... Mid-Senior level
2 Our CompanyChanging the world through digital ... Entry level
3 Waymo is an autonomous driving technology comp... Entry level
4 Job Title: Machine Learning EngineerLocation: ... Mid-Senior level
job_title
0 Internship - Machine Learning Engineer & Data ...
1 Machine Learning Engineer
2 Machine Learning Engineer
3 Machine Learning Engineer, Training
4 Machine Learning Engineer
https://colab.research.google.com/drive/1k2l4_kYDz70rpj2uxDVuStwdkCymXwjn#printMode=true 3/3