FACULTY OF INFORMATION TECHNOLOGY
DATA SCIENCE 500
SEMESTER 2 ASSIGNMENT
Name & Surname: ___________________________ ITS No: _____________________
Qualification: __________________ Semester: _____ Module Name: ____________________
Date Submitted: _________________
MARK EXAMINER MODERATOR
ASSESSMENT CRITERIA
ALLOCATION MARKS MARKS
MARKS FOR CONTENT
QUESTION ONE 15
QUESTION TWO 15
QUESTION THREE 15
QUESTION FOUR 20
QUESTION FIVE 25
TOTAL 90
MARKS FOR TECHNICAL ASPECTS
TABLE OF CONTENTS 2
ASSIGNMENT LAYOUT 5
REFERENCES 3
TOTAL MARKS FOR
100
ASSIGNMENT
EXAMINERS COMMENTS:
MODERATORS COMMENTS:
SIGNATURE OF EXAMINER: SIGNATURE OF MODERATOR:
1. All assignments must be typed, not handwritten.
ASSIGNMENT INSTRUCTIONS
2. Every assignment should include the cover page, table of contents and a reference list or
bibliography at the end of the document.
3. A minimum of three current sources (references) should be used in all assignments, and these
should be reflected in both in-text citations and the reference list or bibliography.
4. In-text citations and a reference list or bibliography must be provided. Use the Harvard Style
for in-text citations and the reference list or bibliography.
5. Assignments submitted without citations and accompanying reference lists will be penalised.
6. Students are not allowed to share assignments with fellow students. Any shared assignments
will attract stiff penalties.
7. Using and copying content from websites such as chegg.com, studocu.com, transtutors.com,
sparknotes.com or any other assignment-assistance websites is strictly prohibited. This also
applies to Wiki sites, blogs, and YouTube.
8. Any pictures and diagrams used in the Assignment should be labelled appropriately and
referenced.
9. Correct formatting of answers (font size 12, font style Calibri, line spacing of 1.15 and margins
justified). Each question should begin on a new page.
10. All assignments must be saved in PDF using the correct naming convention before uploading
them to Moodle: E.g., StudentNumber_CourseCode_Assignment
(402999999_DSC500_Assignment).
Question 1 (15 marks)
1.1. Differentiate between organised and unorganised data. (4 marks)
1.2. Explain the purpose of the following libraries: (6 marks)
a. pandas
b. matplotlib
c. numpy
1.3. What is the purpose of the BeautifulSoup package? Provide an example. (5 marks)
Question 2 (15 marks)
The data represents the average daily steps for a group of 15 participants in a fitness study over
a week for each participant:
Daily_Steps: 6532, 8741, 5403, 7829, 9126, 6087, 7324, 8560, 5972, 7645, 6891, 8102, 7456,
6213, 9034
Write a Python script to perform the following tasks:
1. Create an array from the given data. Then, sort this array in descending order. (3 marks)
2. Calculate the mean and standard deviation of the daily steps, rounding to the nearest whole
number. Display the results. (5 marks)
3. Determine the 25th, 50th (median), and 75th percentiles of the data. Display these results on
screen. (3 marks)
4. Find how many participants averaged more than 7500 steps daily and display this count on
screen. (4 marks)
Question 3 (15 marks)
One hundred and seventy (170) companies from the JSE were randomly selected and classified
by sector and size. The table below shows the frequencies for the two categorical random
variables, ‘sector’ and ‘size’.
Company Size
Sector Row Total
Small Medium Large
Mining 3 8 30 41
Financial 9 21 42 72
Service 10 6 8 24
Retail 14 13 6 33
Column Total 36 48 86 170
a. What is the probability that a randomly selected JSE company will be small and operate in the
service sector? (3 marks)
b. What is the probability that a randomly selected JSE company will be both small and medium-
sized? (3 marks)
c. What is the probability that a randomly selected JSE company will be either a small company,
a service sector company, or both? (3 marks)
d. What is the probability that a randomly selected company is a retail company, given that it is
known (in advance) to be a medium-sized company? (3 marks)
e. What is the probability of selecting a small retail company from the JSE-listed sample
companies? (3 marks)
Question 4 (20 marks)
A bookstore categorises its sales into four main genres: Fiction, Non-Fiction, Children's Books, and
Textbooks. The table shows data for a bookstore over the past five years.
Year Fiction Non-Fiction Children’s Books Textbooks
2019 120 95 55 80
2020 110 100 65 85
2021 130 110 70 75
2022 125 115 75 90
2023 140 120 80 95
Your task is to:
1. Create a stacked bar graph using matplotlib, where each bar represents a year, and the
segments of each bar represent the sales for each book genre.
2. Use a different colour for each genre and include a legend.
3. Add appropriate labels for the x-axis (years) and y-axis (sales in thousands of dollars).
4. Include a title for the graph.
5. Add text labels on each stacked bar segment to show the exact sales figure for that genre and
year.
Question 5 (25 marks)
Analyse the relationship between years of experience, education level, and salary for software
engineers. Use the following data:
Years of Experience: [1, 2, 3, 5, 7, 8, 10, 12, 15, 18]
Education Level (0: Bachelor's, 1: Master's, 2: PhD): [0, 0, 1, 1, 0, 2, 1, 2, 1, 2]
Salary (in thousands of USD): [50, 55, 65, 75, 80, 95, 90, 105, 110, 125]
1. Create a scatter plot using matplotlib, with different colours for each education level.
2. Add lines of best fit for each education level where possible.
3. Use sklearn's LinearRegression to fit a multiple linear regression model to the data.
4. Display the coefficients and intercept of the regression model.
5. Use the model to predict the salary for:
a. A software engineer with six years of experience and a Bachelor's degree
b. A software engineer with nine years of experience and a PhD
6. Create a new graph that shows the original data points, lines of best fit, and the predicted
values from step 5.