LOGISTIC REGRESSION
Spring 2023 CS6431 Natural Language Processing
B1:
Speech and Language Processing (Third Edition draft
– Jan2022)
Daniel Jurafsky, James H. Martin
Credits
1. B1
Assignment
Read:
B1: Chapter 5
Problems:
Generative and Discriminative Classifiers
Generative: knows how to generate features if it
belonged to a particular class
E.g. Naïve Bayes
Discriminative model: directly compute 𝑃(𝑐|𝑑) by giving
more importance to features that are better at
discriminating output classes
Logistic Regression
Logistic Regressor
A single layer neural network with
Sigmoid/SoftMax as the activation function
Cross-entropy loss function
Uses stochastic gradient descent for optimization
Sigmoid/SoftMax function
Inputs, weights, and bias
Or
Sigmoid (logistic) function
Output (in case of binary classification)
For sigmoid
Let the corresponding six weights be [2.5, -5.0,-1.2, 0.5, 2.0, 0.7] and 𝑏= 0.1
Period disambiguation
What sort of features would you suggest?
Hand-crafted features
Feature interactions
Feature templates
Representation Learning
Scaling Input Features
Z-normalization
Zero mean
Unit variance
Or, simply normalize as (∈ [−1, +1])
Logistic Regression vs. Naïve Bayes
Naïve Bayes has a overly strong conditional independence assumption
Problem with correlated features
Logistic Regression is much more robust to correlated features
Naïve Bayes pluses
Works well on small datasets
Easy to implement and fast to train (no optimization step)
Multinomial logistic regression
Or SoftMax Regression
Only one among more than two classes can be true
Both predicted output 𝑦ො and actual output 𝑦 are of size 𝑘
◼𝑦
ො𝑖 estimates 𝑃(𝑦𝑖 = 1|𝑥)
Probabilistically normalized version of sigmoid
E.g., input:
Output:
For logistic regression
Or,
In multimodal, a feature can be evidence for or against each
individual class.
An exclamation mark ‘!’ may indicate positive or negative emotion, but not
neutral
Cross-entropy loss function
Conditional maximum likelihood estimation: Choose 𝑤 and 𝑏 that
maximize the log 𝑝(𝑦|𝑥) in the training data given the observations 𝑥.
Only two possible outcomes: Bernoulli distribution
Note: if y=1, 𝑝 𝑦 𝑥 = 𝑦,
ො else if y=0, 𝑝 𝑦 𝑥 = (1 − 𝑦)
ො
Taking log
To make it a loss function
Stochastic Gradient Descent
Figuring out in which direction the function’s slope is rising the most
steeply, and moving in the opposite direction
Logistic regression: convex error function
Vs. Neural network: non-convex (multiple local minima)
The partial derivative tells the steepest along that dimension
Regularization
Large weights => over
generalization (overfitting)
Add a penalty for large weights
L1 Regularization: Linear function of weights
L2 Regularization: Quadratic function of weight values
L1 vs. L2
L1
Linear but non-continuous at 0,
complex derivative
Laplace prior on the weights
Prefers a sparse weight matrix
with a few large weights
L2
Simple derivative
Gaussian prior with zero mean
Prefers weight vectors with many
small weights