0% found this document useful (0 votes)

32 views7 pages

Lecture 10

Hff hggb hgv hg

Uploaded by

AJ SAHOTRA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views7 pages

Lecture 10

Hff hggb hgv hg

Uploaded by

AJ SAHOTRA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Assignment 2: Document Distance

Weightage: 6%
Deadline: Sunday 1st December 2024, 11:59 PM

Introduction

Objectives

• Introducing the concept of dictionaries and their applications

• Writing and calling helper functions
• Basic file handling

Collaboration

• Student should write up and hand in their assignment separately. Students may not submit the exact
same code. There is NO reason for multiple students to have similar looking code.
• Students are NOT permitted to look at or copy each other’s code or ‘code structure’ /
logic.

Although this handout is long, the information is here to provide you with context, useful examples, and
hints, so be sure to read carefully.

Getting Started

A) File Setup

Download the files document distance.py, test a2 student.py, and various documents of texts and lyrics
within the tests/student tests. When you are done, make sure you run the tester file test a2 student.py
to check your code against some of our test cases.
You will edit ONLY document distance.py.

B) Document Distance Overview

Given two words (or even two documents), you will calculate a score between 0 and 1 that will tell you how
similar they are. If the words or documents are the same, they will get a score of 1. If the documents are
completely different, they will get a score of 0. If they are somewhat similar then they will get a floating
value score between 0 and 1.

1
You can use this to detect plagiarism, find similar documents, or even to recommend similar songs or movies
to a user.
You will calculate the score in two different ways and observe whether one works better than the other.
The first way will use single word frequencies in the two texts. The second will use the TF-IDF (Term
Frequency-Inverse Document Frequency) of words in a file.
Note that you do NOT need to worry about case sensitivity throughout this assignment. All
inputs will be lower case.

1 Text to List
The first step in any data analysis problem is prepping your data. We have provided a function called
load file to read a text file and output all the text in the file into a string. This function takes in a variable
called filename, which is a string of the filename you want to load, including the extension. It removes all
punctuation, and saves the text as a string. Do not modify this function.
Here’s an example usage:

1 # hello_world . txt looks like this : ‘ hello world , hello ’

2 >>> text = load_file ( " tests / student_tests / hello_world . txt " )
3 >>> text
4 ’ hello world hello ’

You will further prepare the text by taking the string and transforming it into a list representation of the
text. Given the example from above, here is what we expect:

1 >>> text_to_list ( ’ hello world hello ’)

2 [ ’ hello ’ , ’ world ’ , ’ hello ’]

Implement text to list in document distance.py as per the given instructions and docstring. In addition
to running the tester file, you can quickly check your implementation on the provided examples for each
problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if name == " main " :

2 # # Tests Problem 0: Prep Data
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he .... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 print ( world ) # should print [ ‘ hello ’, ‘ world ’, ‘ hello ’]
7 print ( friend ) # should print [ ‘ hello ’, ‘ friends ’]

Note: You can assume that the only kinds of white space in the text documents we provide
will be new lines or space(s) between words (i.e. there are no tabs).

2 Get Frequencies
Let’s start by calculating the frequency of each element in a given list. The goal is to return a dictionary
with each unique element as the key, and the number of times the element occurs in the list as the value.
Consider the following examples:
Example 1:

2
1 >>> get_frequencies ([ ’h ’ , ’e ’ , ’l ’ , ’l ’ , ’o ’ ])
2 { ’h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}

Example 2:
1 >>> get_frequencies ([ ’ hello ’ , ’ world ’ , ’ hello ’ ])
2 { ’ hello ’: 2 , ’ world ’: 1}

Implement get frequencies in document distance.py using the above instructions and the docstring pro-
vided. In addition to running the tester file, you can quickly check your implementation on the provided exam-
ples for each problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if name == " main " :

2 # Tests Problem 1: Get Frequencies
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he .... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 world_word_freq = get_frequencies ( world )
7 friend_word_freq = get_frequencies ( friend )
8 print ( world_word_freq ) # should print { ’ hello ’: 2 , ’ world ’: 1}
9 print ( friend_word_freq ) # should print { ’ hello ’: 1 , ’ friends ’: 1}

3 Letter Frequencies
Now, given a word in the form of a string, let’s create a dictionary with each letter as the key and how many
times each letter occurs in the word as the value. That sounds very similar to get frequencies...
You must call get frequencies in your get letter frequencies to get full credit.
Example 1:
1 >>> g e t _ l ett er _f req ue nc ies ( ’ hello ’)
2 { ’h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}

Example 2:
1 >>> g e t _ l ett er _f req ue nc ies ( ’ that ’)
2 { ’t ’: 2 , ’h ’: 1 , ’a ’: 1}

Implement get letter frequencies in document distance.py using the above instructions and the doc-
string provided. In addition to running the tester file, you can quickly check your implementation on
the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

1 if name == " main " :

2 # Tests Problem 2: Get Letter Frequencies
3 freq1 = ge t_ let te r_ fr equ en ci es ( ’ hello ’)
4 freq2 = ge t_ let te r_ fr equ en ci es ( ’ that ’)
5 print ( freq1 ) # should print { ’ h ’: 1 , ’e ’: 1 , ’l ’: 2 , ’o ’: 1}
6 print ( freq2 ) # should print { ’ t ’: 2 , ’h ’: 1 , ’a ’: 1}

4 Similarity
Now it’s time to calculate similarity! Complete the function calculate similarity score based on the
definition of similarity found in the next paragraph. Your function should be able to be used with the outputs
of get frequencies or get letter frequencies.

3
Consider two lists L1 and L2 . Let U be a list made up of all elements in L1 and L2 , but with no repeats
(e.q if L1 = [‘a’,‘b’], L2 = [‘b’,‘c’]), then U = [‘a’,‘b’,‘c’]. For an element e in L1 or L2 , let:
(
number of times e appears in Li , if e in Li
count(e, Li ) = (1)
0, if e not in Li

We can then define:

• δ(e) = |count(e, L1 ) − count(e, L2 )|

• σ(e) = count(e, L1 ) + count(e, L2 )

Similarity is defined as:

δ(u1 ) + δ(u2 ) + δ(u3 ) + · · ·

1−
σ(u1 ) + σ(u2 ) + σ(u3 ) + · · ·

where the sums are taken over all the elements of u1 , u2 , u3 , · · · of U , and the result is rounded to two
decimal places.

Example (where elements are words):

• Suppose:
– L1 = [‘hello’, ‘world’, ‘hello’], and
– L2 = [‘hello’, ‘friends’]
• The list of unique elements U is U = [‘hello’, ‘world’, ‘friends’].
• The frequency differences δ(u) are:
– δ(‘hello’) = |2 − 1| = 1
– δ(‘world’) = |1 − 0| = 1
– δ(‘friends’) = |0 − 1| = 1
• The frequency totals σ(u) are:
– σ(‘hello’) = 2 + 1 = 3
– σ(‘world’) = 1 + 0 = 1
– σ(‘friends’) = 0 + 1 = 1
• Thus, similarity is calculated as:

1+1+1 3
1− = 1 − = 0.40
3+1+1 5
(0.40 rounded to two decimal places is still 0.4).

The same calculation with an alternate (but equivalent) explanation can be found in the
calculate similarity score’s docstring.
IMPORTANT: Be sure to round your final similarity calculation to 2 decimal places.
Implement the function calculate similarity score in document distance.py with the given instruc-
tion and docstrings. In addition to running the tester file, you can quickly check your implementation
on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

4
1 if __name__ == " __main__ " :
2 # Tests Problem 3: Similarity
3 test_directory = " tests / student_tests / "
4 hello_world , hello_friend = load_file ( test_directory + ’ he ... ’)
5 world , friend = text_to_list ( hello_world ) , text_to_list ( hello_friend )
6 world_word_freq = get_frequencies ( world )
7 friend_word_freq = get_frequencies ( friend )
8 word1_freq = ge t_ le tte r_ fr equ en ci es ( ’ toes ’)
9 word2_freq = ge t_ le tte r_ fr equ en ci es ( ’ that ’)
10 word3_freq = get_frequencies ( ’ nah ’)
11 word_similarity1 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word1_freq )
12 word_similarity2 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word2_freq )
13 word_similarity3 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( word1_freq , word3_freq )
14 word_similarity4 = c a l c u l a t e _ s i m i l a r i t y _ s c o r e ( world_word_freq , ...)
15 print ( word_similarity1 ) # should print 1.0
16 print ( word_similarity2 ) # should print 0.25
17 print ( word_similarity3 ) # should print 0.0
18 print ( word_similarity4 ) # should print 0.4

5 Most Frequent Word(s)

Next, you will find out which word(s) occurs the most frequently among two dictionaries. You’ll count
how many times every word occurs combined across both texts and return a list of the most frequent
word(s). The most frequent word does not need to appear in both dictionaries. It is based on
the combined word frequencies across both dictionaries. If a word occurs in both dictionaries, consider the
sum of frequencies as the combined word frequency. If multiple words are tied (i.e. have the same highest
frequency), return an alphabetically ordered list of all these words.
For example, consider the following usage:

1 >>> freq1 = { " hello " : 5 , " world " : 1}

2 >>> freq2 = { " hello " : 1 , " world " : 5}
3 >>> g e t _ m os t _f re q ue n t_ wo r ds ( freq1 , freq2 )
4 [ " hello " , " world " ]

Implement the function get most frequent words in document distance.py as per the given instruc-
tions and docstring. In addition to running the tester file, you can quickly check your implementation
on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document distance.py:

1 if name == " main " :

2 # Tests Problem 4: Most Frequent Word ( s )
3 freq_dict1 , freq_dict2 = { " hello " : 5 , " world " : 1} , { " hello " : 1 , " world " : 5}
4 most_frequent = g et _ mo st _ fr e qu en t _w o rd s ( freq_dict1 , freq_dict2 )
5 print ( most_frequent ) # should print [" hello " , " world "]

6 Term Frequency - Inverse Document Frequency (TF-IDF)

In this part, you will calculate the Term Frequency-Inverse Document Frequency, which is a numerical
measure that signifies the importance of word(s) in a document. You will do so by first calculating the term
frequency and inverse document frequency, then combine the two together to get the TF-IDF.
The term frequency (TF) is calculated as:

5
number of times word w appears in the document
TF(w) =
total number of words in the document

The inverse document frequency (IDF) is calculated as:

total number of documents
IDF(w) = log10
number of documents with word w in it

where log10 is log base 10 and can be called with math.log10.

We can then combine TF and IDF to form TD-IDF(w) = TD(w) × IDF(w), where the higher the value, the
rarer the term and vice versa. For this assignment, we’ll only be working with individual words, but TF-IDF
works for larger groupings of words as well (e.g. bigrams, trigrams, etc.).
For the get tf function that you’ll implement, you’ll be given a file name stored in a variable named
text file. You will need to load the file, prep the data, and determine the TF value of each word that
appears in text file. The output should be a dictionary mapping each word to its TF. Think about how
you could re-use previous functions.
For the get idf function that you’ll implement, you’ll be given a list of text files stored in a variable named
text files. You will need to load each of the files, prep the data, and determine the IDF values of all
words that appear in any of the documents in text files. The output should be a dictionary mapping each
word to its IDF.
For the get tfidf function that you’ll implement, you’ll be given a file name text file and a list of file
names text files. You will need to load the file, prep the data, and determine the TF-IDF of all words
in text file. The output should be a sorted list of tuples (in increasing TF-IDF score), where each tuple
is of the form (word, TF-IDF). In case of words with the same TF-IDF, the words should be sorted in
increasing alphabetical order.
For example:

1 >>> text_file = " tests / student_tests / hello_world . txt "

2 >>> get_tf ( text_file )
3 { " hello " : 0.6666666666666666 , " world " : 0.3333333333333333}
4 # Explanation : There are 3 total words in " hello_world . txt ". 2 of the three total
words are " hello " , giving a TF of 2/3 for " hello " and 1/3 for " world ".

1 >>> text_files = [ " tests / student_tests / hello_world . txt " , " tests / student_tests /
hello_friends . txt " ]
2 >>> get_idf ( text_files )
3 { " hello " : 0.0 , " world " : 0.3010299956639812 , " friends " : 0.3010299956639812}
4 # Explanation : There are a total of 2 documents in this example . " hello " is in
both documents , giving " hello " a IDF of 0.0. " world " and " friends " are in only
one document each , giving them an IDF of 0.3010299956639812.

1 >>> text_file = " tests / student_tests / hello_world . txt "

2 >>> text_files = [ " tests / student_tests / hello_world . txt " , " tests / student_tests /
hello_friends . txt " ]
3 >>> get_tfidf ( text_file , text_files )
4 [( ’ hello ’ , 0.0) , ( ’ world ’ , 0.10034333188799373) ]
5 # Explanation : We multiply the corresponding TF and IDF values for each word in "
hello_world . txt " and get the TF - IDF values . " hello " has a TF of 2/3 and an IDF
of 0.0 , giving it a TF - IDF of 0.0. " world " has a TF of 1/3 and an IDF of
0.3010299956639812 , giving it a TF - IDF of 0.10034333188799373.

6
Implement the functions get tf, get idf, and get tfidf in document distance.py as per the given instruc-
tions. In addition to running the tester file, you can quickly check your implementation on the provided exam-
ples for each problem by uncommenting the relevant lines of code at the bottom of document distance.py:

1 if name == " main " :

2 # Tests Problem 5: Find TF - IDF
3 tf_text_file = ’ tests / student_tests / hello_world . txt ’
4 idf_text_files = [ ’ tests / student_tests / hello_world . txt ’ , ’ tests /... ’]
5 tf = get_tf ( tf_text_file )
6 idf = get_idf ( idf_text_files )
7 tf_idf = get_tfidf ( tf_text_file , idf_text_files )
8 print ( tf ) # should print { ’..... ’}
9 print ( idf ) # should print { ’..... ’}
10 print ( tf_idf ) # should print [ ’..... ’]

When you are done, make sure you run the tester file test a2 student.py to check your code
against our test cases.

7 Hand-in Procedure

7.1 Naming Files

Save your solutions with the original file name: document distance.py. Do not ignore this step or save
your file with a different name! The autograder will not be able to find your file if you do and
you will receive no marks.

7.2 Final Submission

• Be sure to run the student tester and make sure all the tests pass. However, the student
tester contains only a subset of the tests that will be run to determine the problem set grade. Passing
all of the provided test cases does not guarantee full credit on assignment 2.
• Exact instructions for submitting your assignment will be provided within a few days.

8 Supplemental Reading about Document Similarity

This assignment is a greatly simplified version of a very pertinent problem in Information Retrieval. Appli-
cations of document similarity range from retrieving search engine results to comparing genes and proteins
to improving machine translation.
More advanced techniques to calculating document distance include transforming the text into a vector space
and computing the cosine similarity, Jaccard Index, or some other metric of the vectors.

Problem Set 3: Document Distance: Pset Buddy
No ratings yet
Problem Set 3: Document Distance: Pset Buddy
7 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
No ratings yet
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
150 pages
Python Cost Model: Docdist1
No ratings yet
Python Cost Model: Docdist1
12 pages
Python File Handling Tasks
100% (1)
Python File Handling Tasks
14 pages
Python
No ratings yet
Python
11 pages
Text File Based Questions
No ratings yet
Text File Based Questions
4 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Python Programming Essentials
No ratings yet
Python Programming Essentials
10 pages
CBSE Practicals
No ratings yet
CBSE Practicals
37 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Write The Program
No ratings yet
Write The Program
18 pages
CH 4 - File Handling Material For Board Exam
No ratings yet
CH 4 - File Handling Material For Board Exam
2 pages
Text File Programs Xii C
No ratings yet
Text File Programs Xii C
6 pages
Write A Python Program To Check The Given Number Is Prime or Not
No ratings yet
Write A Python Program To Check The Given Number Is Prime or Not
39 pages
Text File (3 Mark)
No ratings yet
Text File (3 Mark)
16 pages
Class Xii Text File Handling Assignment
No ratings yet
Class Xii Text File Handling Assignment
3 pages
PROGRAMS
No ratings yet
PROGRAMS
4 pages
Python Experiments
No ratings yet
Python Experiments
13 pages
Lecture 14
No ratings yet
Lecture 14
26 pages
Pythonlearn 09 Dictionaries
No ratings yet
Pythonlearn 09 Dictionaries
30 pages
Experiment 8 & 9
No ratings yet
Experiment 8 & 9
14 pages
Lab Task 8: Programming Exercises
100% (1)
Lab Task 8: Programming Exercises
3 pages
96 Yogesh Khairnar Assignment 4
No ratings yet
96 Yogesh Khairnar Assignment 4
25 pages
Python Practice Exercises Guide
No ratings yet
Python Practice Exercises Guide
3 pages
Python Exercises for CS Students
No ratings yet
Python Exercises for CS Students
3 pages
Dictionaries: 'One' 'Uno'
No ratings yet
Dictionaries: 'One' 'Uno'
10 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
CSC 108H1 F 2011 Test 2 Duration - 45 Minutes Aids Allowed: None
No ratings yet
CSC 108H1 F 2011 Test 2 Duration - 45 Minutes Aids Allowed: None
6 pages
Python Programs
No ratings yet
Python Programs
10 pages
Dictionary Question - Practice Questions
No ratings yet
Dictionary Question - Practice Questions
6 pages
3&4 Units Python Programs
No ratings yet
3&4 Units Python Programs
13 pages
Python Lab Programs Guide
No ratings yet
Python Lab Programs Guide
15 pages
Text Files Workbook
No ratings yet
Text Files Workbook
8 pages
Text File Question and Answers
No ratings yet
Text File Question and Answers
5 pages
Python Coding Exam for Universities
No ratings yet
Python Coding Exam for Universities
6 pages
Pythonlearn 09 Dictionaries 1
No ratings yet
Pythonlearn 09 Dictionaries 1
31 pages
Practical File Questions
No ratings yet
Practical File Questions
34 pages
Lab Programs Python
No ratings yet
Lab Programs Python
20 pages
Practical File by Aksh Jaiswal
No ratings yet
Practical File by Aksh Jaiswal
48 pages
Unit 6
No ratings yet
Unit 6
39 pages
09 Dictionaries
No ratings yet
09 Dictionaries
33 pages
MMMMMMMM
No ratings yet
MMMMMMMM
39 pages
File Handling Question
No ratings yet
File Handling Question
3 pages
PYTHON PGM 1to6 - 241021 - 100013
No ratings yet
PYTHON PGM 1to6 - 241021 - 100013
8 pages
Text File Practice Questions
No ratings yet
Text File Practice Questions
3 pages
Data File Handling-Text File
No ratings yet
Data File Handling-Text File
1 page
PyCode Files&Lists
No ratings yet
PyCode Files&Lists
15 pages
Akshat Sethi Practical File
No ratings yet
Akshat Sethi Practical File
50 pages
Anshika Cs Project
No ratings yet
Anshika Cs Project
20 pages
Final Cbse Practicals
60% (5)
Final Cbse Practicals
21 pages
03 Python
No ratings yet
03 Python
5 pages
CS 1301 Homework 3 - Building A Dictionary1
No ratings yet
CS 1301 Homework 3 - Building A Dictionary1
4 pages
IC152 Lab Assignment 6
No ratings yet
IC152 Lab Assignment 6
10 pages
Python
No ratings yet
Python
13 pages
Anshika's Project Do Not Touch!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
No ratings yet
Anshika's Project Do Not Touch!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
15 pages
KR C4 Na: Controller
No ratings yet
KR C4 Na: Controller
185 pages
Vahan 4.0 (Citizen Services) Sp-Morth-Ws03 175 8000
No ratings yet
Vahan 4.0 (Citizen Services) Sp-Morth-Ws03 175 8000
2 pages
Project Synopsis (Responses)
No ratings yet
Project Synopsis (Responses)
2 pages
WWW Javatpoint Com Gray Code in Digital Electronics
No ratings yet
WWW Javatpoint Com Gray Code in Digital Electronics
6 pages
TAP IT DELUXE (550 Levels) Quick Start Guide
No ratings yet
TAP IT DELUXE (550 Levels) Quick Start Guide
9 pages
Cyberpunk 2077 Console Quick Cheats
No ratings yet
Cyberpunk 2077 Console Quick Cheats
2 pages
GMV 2023 Brochure
No ratings yet
GMV 2023 Brochure
4 pages
Gantt Plan - NOV
No ratings yet
Gantt Plan - NOV
4 pages
16 Hybris DataModel
No ratings yet
16 Hybris DataModel
18 pages
NWHK LAB 1-Installing Kali Linux v.1.1
No ratings yet
NWHK LAB 1-Installing Kali Linux v.1.1
11 pages
Learning C# Programming With Unity 3D Alex Okita PDF Version
No ratings yet
Learning C# Programming With Unity 3D Alex Okita PDF Version
85 pages
EnvisionAI Entertainment PPT Class 11
No ratings yet
EnvisionAI Entertainment PPT Class 11
13 pages
Killdisk
No ratings yet
Killdisk
126 pages
SAP FICO Consultant Profile
No ratings yet
SAP FICO Consultant Profile
3 pages
TPQ CPF Program SELF-PACED (1 Yr)
No ratings yet
TPQ CPF Program SELF-PACED (1 Yr)
4 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
5 pages
Documento Web 2017 EO 13636 - 13691 Section 5 Report - Signed 012618 - Final
No ratings yet
Documento Web 2017 EO 13636 - 13691 Section 5 Report - Signed 012618 - Final
49 pages
AI Course Project - Timetable Scheduling
No ratings yet
AI Course Project - Timetable Scheduling
5 pages
04 PS200 Brochure-Specn
No ratings yet
04 PS200 Brochure-Specn
4 pages
Vedic Math - Divsion by 9
100% (1)
Vedic Math - Divsion by 9
4 pages
Proton Manual
100% (1)
Proton Manual
8 pages
Managing Intellectual Assets in The Digital Age 2nd Edition Jeffrey H. Matsuura Full Access
100% (3)
Managing Intellectual Assets in The Digital Age 2nd Edition Jeffrey H. Matsuura Full Access
102 pages
Ghana Sugar Receiver Cis
No ratings yet
Ghana Sugar Receiver Cis
4 pages
Eloquent JavaScript, 4th Edition (A Modern Introduction To Programming) Haverbeke
No ratings yet
Eloquent JavaScript, 4th Edition (A Modern Introduction To Programming) Haverbeke
10 pages
Apple Inc. SWOT & Tech Analysis
No ratings yet
Apple Inc. SWOT & Tech Analysis
4 pages
Manage SW Ie3d
No ratings yet
Manage SW Ie3d
89 pages
ISMS-FORM-06-3 Scenario-Based RAT Tool
100% (1)
ISMS-FORM-06-3 Scenario-Based RAT Tool
20 pages
Benefits and Applications of High-Volume PCB Manufacturing
No ratings yet
Benefits and Applications of High-Volume PCB Manufacturing
10 pages
Gamepad User Manual
No ratings yet
Gamepad User Manual
2 pages
User Interfaces & MySQL Assessment
No ratings yet
User Interfaces & MySQL Assessment
4 pages

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Assignment 2: Document Distance

• Introducing the concept of dictionaries and their applications

B) Document Distance Overview

1 # hello_world . txt looks like this : ‘ hello world , hello ’

1 >>> text_to_list ( ’ hello world hello ’)

1 if __name__ == " __main__ " :

1 if __name__ == " __main__ " :

1 if __name__ == " __main__ " :

We can then define:

• δ(e) = |count(e, L1 ) − count(e, L2 )|

Similarity is defined as:

δ(u1 ) + δ(u2 ) + δ(u3 ) + · · ·

Example (where elements are words):

5 Most Frequent Word(s)

1 >>> freq1 = { " hello " : 5 , " world " : 1}

1 if __name__ == " __main__ " :

6 Term Frequency - Inverse Document Frequency (TF-IDF)

The inverse document frequency (IDF) is calculated as:

where log10 is log base 10 and can be called with math.log10.

1 >>> text_file = " tests / student_tests / hello_world . txt "

1 >>> text_file = " tests / student_tests / hello_world . txt "

1 if __name__ == " __main__ " :

7.1 Naming Files

7.2 Final Submission

8 Supplemental Reading about Document Similarity

You might also like

1 if name == " main " :

1 if name == " main " :

1 if name == " main " :

1 if name == " main " :

1 if name == " main " :