Natural Language
Processing
Lecture 1: Course Overview and Introduction.
11/06/2024
COMS W4705
Daniel Bauer
The 4705 Team
• Instructor: Dr. Daniel Bauer (he/him/his)
O ce hours: Fri 1:15-2:45pm (after class 704 CEPSR and
on Zoom, starting 9/13).
• Course Assistants:
See Courseworks for contact info and o ce hour
schedule.
ffi
ffi
Important Dates
• Lectures: Fri 10:10-12:40pm (incl. 20 min break)
• Location: 417 IAB (or on Zoom for CVN or by
permission). All sessions will be recorded.
• Exam 1: Friday Oct 18
• Exam 2: Friday Dec 6
There is no additional nal exam.
fi
Course Resources
• Courseworks
• All course materials: videos, lecture notes, code,
announcements, assignments, reading materials
• Homework submission, grade book.
• Ed used for Q & A (shared between sections)
Textbook / Reading
• There is NO o cial textbook for this course.
• Recommended textbook 1 (somewhat outdated, we
won’t follow this too closely, but
references will be provided):
Dan Jurafsky & James Martin
Speech and Language Processing
2nd Ed. Prentice Hall (2009).
• Draft of most 3rd edition chapters:
https://web.stanford.edu/~jurafsky/slp3/
ffi
Textbook / Reading
• Recommended textbook 2:
Yoav Goldberg
Neural Network Methods for
Natural Language Processing
Morgan & Claypool. 2017
• Available as an ebook through the CU library
https://clio.columbia.edu/catalog/13676351
Prerequisites
• Data Structures (COMS W3134 or COMS W3137)
• Discrete Math (COMS W3202, recommended)
• Some experience with basic probability/statistics.
• Some previous or concurrent exposure to AI and machine
learning is bene cial, but not required.
• Some experience with Python is helpful.
fi
Grading
• 40%: 5 programming assignments (lowest score dropped,
10% each)
• 50%: exams (25% each)
• 10%: Participation (in class attendance and on Ed)
Homework
• Homework uploaded through Courseworks. Do not email solutions.
Check your submission!
• ~ 2-week turnaround.
• 72h limit on regrade requests.
• Only programming. Theory done in class / online ungraded
exercises.
• Use Python 3!
• There will typically be some sca olding code to start you o .
• Some assignments will require a GPU (we will use Google Colab)
ff
ff
Academic Honesty
• Submit your own answers and code.
• Review academic honesty policy on the syllabus.
• When in doubt, ask.
• When in trouble, ask for help (and early).
• We will talk _about_ ChatGPT et al. but please do not use
large language models for your homework.
NLP in the Movies
I am fluent in over
six million forms Open the pod bay
of communication doors HAL!
I’m sorry Dave, I’m
afraid I can’t do
that!
Natural Language
Processing
• Important and active research area within AI.
• Timely: Most of our activities online are text based
(web-pages, email, social media, blogs, news, product
descriptions and reviews, medical reports, course content, …)
• NLP leverages more and more available training data and
modern Machine Learning techniques, such as neural
networks (RNNs, transformers etc.)
• Communicating with computers is the “holy grail” of AI.
• NLP may be "AI complete".
Turing Test
(Alan Turing, 1950)
• A computer passes the test of intelligence if it can fool
a human interrogator into believing it is human.
• What skills are needed to build such a system?
• Language processing, knowledge representation,
reasoning, learning.
Image source: Russel & Norvig, Arti cial Intelligence - A Modern Approach
fi
Natural Language
Processing
AI NLP Linguistics
“Every time I re a linguist, my performance goes up” (Fred Jelinek)
fi
Natural Language Processing
vs. Computational Linguistics
• NLP: Build systems that can understand and generate
natural language. Focus on applications.
• Computational Linguistics: Study human language
using computational approaches.
• Many overlapping techniques.
Applications: Information
Retrieval
query
indexed document
corpus
ranked results
Applications: Text
Classi cation
• Spam ltering.
• Detecting topics / genre.
• Sentiment analysis, author recognition, forensic
linguistics, …
• Detecting hate speech.
fi
fi
Applications: Sentiment
Analysis
Fantastic... truly a wonderful family movie
I have a mixed feeling about this movie.
Well it is fun for sure but de nitely not appropriate
for kids 10 and below
My kids loved it!!
The movie is very funny and entertaining. Big A+
I got so boooored...
Disappointed. They showed all fun details in the trailer
Cute but not for adults
fi
Application: Question
Answering
“Where was George Washington born?”
Unstructured
Text
QA system
Knowledge
Base
“Westmoreland County, Virginia”
Applications: Playing
Jeopardy! IBM Watson [2011]
William Wilkinson’s “An Account of the Principalities of Wallachia and
Moldavia“ inspired this author’s most famous novel.
Combines information extraction & natural language understanding.
Applications:
Summarization
Credit: Prof. Kathleen Mc Keown
Applications: Machine
Translation
Machine Translation
• One of the main research areas in NLP, and one of the oldest.
Historical motivation: Translate Russian to English.
• MT is really di cult:
• “Out of sight, out of mind” → “Invisible, imbecile”
• “The spirit is willing, but the esh is weak”
English → Russian → English
“The vodka is good, but the meat is rotten”
• Challenges: Word order, multiple translations for a word
(need context), want to preserve meaning.
ffi
fl
Machine Translation
• Until recently phrase-based translation was the
predominant framework.
• Today neural network models are used.
• Google Translate supports > 100 languages. Near-human
translation quality for some language pairs.
Machine Translation
Applications: Virtual
Assistants
• Siri (Apple), Google Now, Cortana (Microsoft), Alexa
(Amazon).
• Subtasks: Speech recognition, language understanding
(in context?), speech generation, …
Applications: Large
Language Models
• Predictive text, content generation "Generative AI"
• See ChatGPT example.
Applications: Multi-modal
NLP
Image Captioning Visual Question Answering
“Man in black t-shirt is playing guitar.”
RoboNLP (instruction giving / following, summarization etc.)
Evolution of ML techniques
in NLP
• Rules and heuristics, patterns matching, formal grammars.
• Statistical NLP, generative probabilistic models.
• Discriminative models, support vector machines, logistic regression.
• Back to large generative models.
• Neural networks, phase 1 (RNNs including LSTMs, CNNs)
• Neural networks, phase 2, transformer models, large language
models, pretraining
• Few / zero-shot learning. Prompting.
NLP History
rules and heuristics, patterns
matching, formal grammars.
1950s Statistical NLP, generative probabilistic models.
Corpus-based NLP.
1980s and 90s
Discriminative models,
(e.g. logistic regression,
support vector machines)
early 2000s
neural nets, neural sequence models
RNN, LSTM
mid/late 2000s
pre-training,
embeddings
2010s
transformers
large pre-trained LMs
late 2010s, 2020s
•
GPT-2 Examples
GPT is a transformer-based language model created by OpenAI.
• GPT-2 example (Feb 2019 1.5b parameters, trained on 8m web pages)
https://openai.com/blog/better-language-models/#sample1
GPT "prompting" examples
• GPT-3 (GPT3, Jun 2020, 175b parameters, trained on 45TB of text)
• Fine-tuned model InstructGPT trained using Reinforcement Learning from
Human Feedback.
November 2022 ChatGPT
• Fine-tuned GPT 3.5 using dialog data, then optimized using Reinforcement
Learning from Human Feedback.
• Output may be indistinguishable from human output in many cases.
What You Will Learn In This
Course
• How can machines understand and generate natural
language?
• Theories about language (linguistics).
• Algorithms.
• Statistical / Machine Learning Methods, incl. neural
networks.
• Applications.
Course Overview
• Core NLP techniques.
• Language modeling, part-of-speech tagging, syntactic parsing, word-
sense disambiguation, semantic parsing, text similarity.
• Applications.
• text classi cation, machine translation, generation, image
captioning,...
• Machine Learning Techniques:
Supervised machine learning, bayesian models, sequence models (n-
gram models, HMMs), deep learning techniques (RNNs, transformers, ...)
• Critical assessment of NLP methods and data sets (ethics in NLP).
fi