[go: up one dir, main page]

0% found this document useful (0 votes)
32 views7 pages

Lecture 10

Hff hggb hgv hg

Uploaded by

AJ SAHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views7 pages

Lecture 10

Hff hggb hgv hg

Uploaded by

AJ SAHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment 2: Document Distance

Weightage: 6%
Deadline: Sunday 1st December 2024, 11:59 PM

Introduction

Objectives

• Introducing the concept of dictionaries and their applications


• Writing and calling helper functions
• Basic file handling

Collaboration

• Student should write up and hand in their assignment separately. Students may not submit the exact
same code. There is NO reason for multiple students to have similar looking code.
• Students are NOT permitted to look at or copy each other’s code or ‘code structure’ /
logic.

Although this handout is long, the information is here to provide you with context, useful examples, and
hints, so be sure to read carefully.

Getting Started

A) File Setup

Download the files document distance.py, test a2 student.py, and various documents of texts and lyrics
within the tests/student tests. When you are done, make sure you run the tester file test a2 student.py
to check your code against some of our test cases.
You will edit ONLY document distance.py.

B) Document Distance Overview

Given two words (or even two documents), you will calculate a score between 0 and 1 that will tell you how
similar they are. If the words or documents are the same, they will get a score of 1. If the documents are
completely different, they will get a score of 0. If they are somewhat similar then they will get a floating
value score between 0 and 1.

1
You can use this to detect plagiarism, find similar documents, or even to recommend similar songs or movies
to a user.
You will calculate the score in two different ways and observe whether one works better than the other.
The first way will use single word frequencies in the two texts. The second will use the TF-IDF (Term
Frequency-Inverse Document Frequency) of words in a file.
Note that you do NOT need to worry about case sensitivity throughout this assignment. All
inputs will be lower case.

1 Text to List
The first step in any data analysis problem is prepping your data. We have provided a function called
load file to read a text file and output all the text in the file into a string. This function takes in a variable
called filename, which is a string of the filename you want to load, including the extension. It removes all
punctuation, and saves the text as a string. Do not modify this function.
Here’s an example usage:

1 # hello_world . txt looks like this : ‘ hello world , hello ’


2 >>> text = load_file ( " tests / student_tests / hello_world . txt " )
3 >>> text
4 ’ hello world hello ’

You will further prepare the text by taking the string and transforming it into a list representation of the
text. Given the example from above, here is what we expect:

1 >>> text_to_list ( ’ hello world hello ’)


2 [ ’ hello ’ , ’ world ’ , ’ hello ’]

Implement text to list in document distance.py as per the given instructions and docstring. In addition
to running the tester file, you can quickly check your implementation on the provided examples for each
problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if __name__ == " __main__ " :


2 # # Tests Problem 0: Prep Data
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he .... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 print ( world ) # should print [ ‘ hello ’, ‘ world ’, ‘ hello ’]
7 print ( friend ) # should print [ ‘ hello ’, ‘ friends ’]

Note: You can assume that the only kinds of white space in the text documents we provide
will be new lines or space(s) between words (i.e. there are no tabs).

2 Get Frequencies
Let’s start by calculating the frequency of each element in a given list. The goal is to return a dictionary
with each unique element as the key, and the number of times the element occurs in the list as the value.
Consider the following examples:
Example 1:

2
1 >>> get_frequencies ([ ’h ’ , ’e ’ , ’l ’ , ’l ’ , ’o ’ ])
2 { ’h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}

Example 2:
1 >>> get_frequencies ([ ’ hello ’ , ’ world ’ , ’ hello ’ ])
2 { ’ hello ’: 2 , ’ world ’: 1}

Implement get frequencies in document distance.py using the above instructions and the docstring pro-
vided. In addition to running the tester file, you can quickly check your implementation on the provided exam-
ples for each problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if __name__ == " __main__ " :


2 # Tests Problem 1: Get Frequencies
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he .... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 world_word_freq = get_frequencies ( world )
7 friend_word_freq = get_frequencies ( friend )
8 print ( world_word_freq ) # should print { ’ hello ’: 2 , ’ world ’: 1}
9 print ( friend_word_freq ) # should print { ’ hello ’: 1 , ’ friends ’: 1}

3 Letter Frequencies
Now, given a word in the form of a string, let’s create a dictionary with each letter as the key and how many
times each letter occurs in the word as the value. That sounds very similar to get frequencies...
You must call get frequencies in your get letter frequencies to get full credit.
Example 1:
1 >>> g e t _ l ett er _f req ue nc ies ( ’ hello ’)
2 { ’h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}

Example 2:
1 >>> g e t _ l ett er _f req ue nc ies ( ’ that ’)
2 { ’t ’: 2 , ’h ’: 1 , ’a ’: 1}

Implement get letter frequencies in document distance.py using the above instructions and the doc-
string provided. In addition to running the tester file, you can quickly check your implementation on
the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

1 if __name__ == " __main__ " :


2 # Tests Problem 2: Get Letter Frequencies
3 freq1 = ge t_ let te r_ fr equ en ci es ( ’ hello ’)
4 freq2 = ge t_ let te r_ fr equ en ci es ( ’ that ’)
5 print ( freq1 ) # should print { ’ h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}
6 print ( freq2 ) # should print { ’ t ’: 2 , ’h ’: 1 , ’a ’: 1}

4 Similarity
Now it’s time to calculate similarity! Complete the function calculate similarity score based on the
definition of similarity found in the next paragraph. Your function should be able to be used with the outputs
of get frequencies or get letter frequencies.

3
Consider two lists L1 and L2 . Let U be a list made up of all elements in L1 and L2 , but with no repeats
(e.q if L1 = [‘a’,‘b’], L2 = [‘b’,‘c’]), then U = [‘a’,‘b’,‘c’]. For an element e in L1 or L2 , let:
(
number of times e appears in Li , if e in Li
count(e, Li ) = (1)
0, if e not in Li

We can then define:

• δ(e) = |count(e, L1 ) − count(e, L2 )|


• σ(e) = count(e, L1 ) + count(e, L2 )

Similarity is defined as:

δ(u1 ) + δ(u2 ) + δ(u3 ) + · · ·


1−
σ(u1 ) + σ(u2 ) + σ(u3 ) + · · ·

where the sums are taken over all the elements of u1 , u2 , u3 , · · · of U , and the result is rounded to two
decimal places.

Example (where elements are words):

• Suppose:
– L1 = [‘hello’, ‘world’, ‘hello’], and
– L2 = [‘hello’, ‘friends’]
• The list of unique elements U is U = [‘hello’, ‘world’, ‘friends’].
• The frequency differences δ(u) are:
– δ(‘hello’) = |2 − 1| = 1
– δ(‘world’) = |1 − 0| = 1
– δ(‘friends’) = |0 − 1| = 1
• The frequency totals σ(u) are:
– σ(‘hello’) = 2 + 1 = 3
– σ(‘world’) = 1 + 0 = 1
– σ(‘friends’) = 0 + 1 = 1
• Thus, similarity is calculated as:

1+1+1 3
1− = 1 − = 0.40
3+1+1 5
(0.40 rounded to two decimal places is still 0.4).

The same calculation with an alternate (but equivalent) explanation can be found in the
calculate similarity score’s docstring.
IMPORTANT: Be sure to round your final similarity calculation to 2 decimal places.
Implement the function calculate similarity score in document distance.py with the given instruc-
tion and docstrings. In addition to running the tester file, you can quickly check your implementation
on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

4
1 if __name__ == " __main__ " :
2 # Tests Problem 3: Similarity
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he ... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 world_word_freq = get_frequencies ( world )
7 friend_word_freq = get_frequencies ( friend )
8 word1_freq = ge t_ le tte r_ fr equ en ci es ( ’ toes ’)
9 word2_freq = ge t_ le tte r_ fr equ en ci es ( ’ that ’)
10 word3_freq = get_frequencies ( ’ nah ’)
11 word_similarity1 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word1_freq )
12 word_similarity2 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word2_freq )
13 word_similarity3 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word3_freq )
14 word_similarity4 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( world_word_freq , ...)
15 print ( word_similarity1 ) # should print 1.0
16 print ( word_similarity2 ) # should print 0.25
17 print ( word_similarity3 ) # should print 0.0
18 print ( word_similarity4 ) # should print 0.4

5 Most Frequent Word(s)


Next, you will find out which word(s) occurs the most frequently among two dictionaries. You’ll count
how many times every word occurs combined across both texts and return a list of the most frequent
word(s). The most frequent word does not need to appear in both dictionaries. It is based on
the combined word frequencies across both dictionaries. If a word occurs in both dictionaries, consider the
sum of frequencies as the combined word frequency. If multiple words are tied (i.e. have the same highest
frequency), return an alphabetically ordered list of all these words.
For example, consider the following usage:

1 >>> freq1 = { " hello " : 5 , " world " : 1}


2 >>> freq2 = { " hello " : 1 , " world " : 5}
3 >>> g e t _ m os t _f re q ue n t_ wo r ds ( freq1 , freq2 )
4 [ " hello " , " world " ]

Implement the function get most frequent words in document distance.py as per the given instruc-
tions and docstring. In addition to running the tester file, you can quickly check your implementation
on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

1 if __name__ == " __main__ " :


2 # Tests Problem 4: Most Frequent Word ( s )
3 freq_dict1 , freq_dict2 = { " hello " : 5 , " world " : 1} , { " hello " : 1 , " world " : 5}
4 most_frequent = g et _ mo st _ fr e qu en t _w o rd s ( freq_dict1 , freq_dict2 )
5 print ( most_frequent ) # should print [" hello " , " world "]

6 Term Frequency - Inverse Document Frequency (TF-IDF)


In this part, you will calculate the Term Frequency-Inverse Document Frequency, which is a numerical
measure that signifies the importance of word(s) in a document. You will do so by first calculating the term
frequency and inverse document frequency, then combine the two together to get the TF-IDF.
The term frequency (TF) is calculated as:

5
number of times word w appears in the document
TF(w) =
total number of words in the document

The inverse document frequency (IDF) is calculated as:


 
total number of documents
IDF(w) = log10
number of documents with word w in it

where log10 is log base 10 and can be called with math.log10.


We can then combine TF and IDF to form TD-IDF(w) = TD(w) × IDF(w), where the higher the value, the
rarer the term and vice versa. For this assignment, we’ll only be working with individual words, but TF-IDF
works for larger groupings of words as well (e.g. bigrams, trigrams, etc.).
For the get tf function that you’ll implement, you’ll be given a file name stored in a variable named
text file. You will need to load the file, prep the data, and determine the TF value of each word that
appears in text file. The output should be a dictionary mapping each word to its TF. Think about how
you could re-use previous functions.
For the get idf function that you’ll implement, you’ll be given a list of text files stored in a variable named
text files. You will need to load each of the files, prep the data, and determine the IDF values of all
words that appear in any of the documents in text files. The output should be a dictionary mapping each
word to its IDF.
For the get tfidf function that you’ll implement, you’ll be given a file name text file and a list of file
names text files. You will need to load the file, prep the data, and determine the TF-IDF of all words
in text file. The output should be a sorted list of tuples (in increasing TF-IDF score), where each tuple
is of the form (word, TF-IDF). In case of words with the same TF-IDF, the words should be sorted in
increasing alphabetical order.
For example:

1 >>> text_file = " tests / student_tests / hello_world . txt "


2 >>> get_tf ( text_file )
3 { " hello " : 0.6666666666666666 , " world " : 0.3333333333333333}
4 # Explanation : There are 3 total words in " hello_world . txt ". 2 of the three total
words are " hello " , giving a TF of 2/3 for " hello " and 1/3 for " world ".

1 >>> text_files = [ " tests / student_tests / hello_world . txt " , " tests / student_tests /
hello_friends . txt " ]
2 >>> get_idf ( text_files )
3 { " hello " : 0.0 , " world " : 0.3010299956639812 , " friends " : 0.3010299956639812}
4 # Explanation : There are a total of 2 documents in this example . " hello " is in
both documents , giving " hello " a IDF of 0.0. " world " and " friends " are in only
one document each , giving them an IDF of 0.3010299956639812.

1 >>> text_file = " tests / student_tests / hello_world . txt "


2 >>> text_files = [ " tests / student_tests / hello_world . txt " , " tests / student_tests /
hello_friends . txt " ]
3 >>> get_tfidf ( text_file , text_files )
4 [( ’ hello ’ , 0.0) , ( ’ world ’ , 0.10034333188799373) ]
5 # Explanation : We multiply the corresponding TF and IDF values for each word in "
hello_world . txt " and get the TF - IDF values . " hello " has a TF of 2/3 and an IDF
of 0.0 , giving it a TF - IDF of 0.0. " world " has a TF of 1/3 and an IDF of
0.3010299956639812 , giving it a TF - IDF of 0.10034333188799373.

6
Implement the functions get tf, get idf, and get tfidf in document distance.py as per the given instruc-
tions. In addition to running the tester file, you can quickly check your implementation on the provided exam-
ples for each problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if __name__ == " __main__ " :


2 # Tests Problem 5: Find TF - IDF
3 tf_text_file = ’ tests / student_tests / hello_world . txt ’
4 idf_text_files = [ ’ tests / student_tests / hello_world . txt ’ , ’ tests /... ’]
5 tf = get_tf ( tf_text_file )
6 idf = get_idf ( idf_text_files )
7 tf_idf = get_tfidf ( tf_text_file , idf_text_files )
8 print ( tf ) # should print { ’..... ’}
9 print ( idf ) # should print { ’..... ’}
10 print ( tf_idf ) # should print [ ’..... ’]

When you are done, make sure you run the tester file test a2 student.py to check your code
against our test cases.

7 Hand-in Procedure

7.1 Naming Files

Save your solutions with the original file name: document distance.py. Do not ignore this step or save
your file with a different name! The autograder will not be able to find your file if you do and
you will receive no marks.

7.2 Final Submission


• Be sure to run the student tester and make sure all the tests pass. However, the student
tester contains only a subset of the tests that will be run to determine the problem set grade. Passing
all of the provided test cases does not guarantee full credit on assignment 2.
• Exact instructions for submitting your assignment will be provided within a few days.

8 Supplemental Reading about Document Similarity


This assignment is a greatly simplified version of a very pertinent problem in Information Retrieval. Appli-
cations of document similarity range from retrieving search engine results to comparing genes and proteins
to improving machine translation.
More advanced techniques to calculating document distance include transforming the text into a vector space
and computing the cosine similarity, Jaccard Index, or some other metric of the vectors.

You might also like