[go: up one dir, main page]

0% found this document useful (0 votes)
12 views29 pages

06 LogisticRegression

The document discusses logistic regression as part of a natural language processing course, highlighting its role as a discriminative model that computes probabilities for classification tasks. It covers key concepts such as the sigmoid function, cross-entropy loss, stochastic gradient descent, and regularization techniques (L1 and L2). Additionally, it compares logistic regression to Naïve Bayes, emphasizing its robustness to correlated features.

Uploaded by

ursady4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views29 pages

06 LogisticRegression

The document discusses logistic regression as part of a natural language processing course, highlighting its role as a discriminative model that computes probabilities for classification tasks. It covers key concepts such as the sigmoid function, cross-entropy loss, stochastic gradient descent, and regularization techniques (L1 and L2). Additionally, it compares logistic regression to Naïve Bayes, emphasizing its robustness to correlated features.

Uploaded by

ursady4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

LOGISTIC REGRESSION

Spring 2023 CS6431 Natural Language Processing


B1:
Speech and Language Processing (Third Edition draft
– Jan2022)
Daniel Jurafsky, James H. Martin
Credits
1. B1
Assignment
Read:
B1: Chapter 5

Problems:
Generative and Discriminative Classifiers
 Generative: knows how to generate features if it
belonged to a particular class
 E.g. Naïve Bayes

 Discriminative model: directly compute 𝑃(𝑐|𝑑) by giving


more importance to features that are better at
discriminating output classes
 Logistic Regression
Logistic Regressor
 A single layer neural network with
 Sigmoid/SoftMax as the activation function
 Cross-entropy loss function

 Uses stochastic gradient descent for optimization


Sigmoid/SoftMax function
 Inputs, weights, and bias

 Or

Sigmoid (logistic) function


 Output (in case of binary classification)

 For sigmoid
Let the corresponding six weights be [2.5, -5.0,-1.2, 0.5, 2.0, 0.7] and 𝑏= 0.1
Period disambiguation
 What sort of features would you suggest?
 Hand-crafted features
 Feature interactions
 Feature templates

 Representation Learning
Scaling Input Features
 Z-normalization
 Zero mean
 Unit variance

 Or, simply normalize as (∈ [−1, +1])


Logistic Regression vs. Naïve Bayes
 Naïve Bayes has a overly strong conditional independence assumption
 Problem with correlated features
 Logistic Regression is much more robust to correlated features
 Naïve Bayes pluses
 Works well on small datasets
 Easy to implement and fast to train (no optimization step)
Multinomial logistic regression
 Or SoftMax Regression
 Only one among more than two classes can be true
 Both predicted output 𝑦ො and actual output 𝑦 are of size 𝑘
◼𝑦
ො𝑖 estimates 𝑃(𝑦𝑖 = 1|𝑥)
 Probabilistically normalized version of sigmoid
 E.g., input:

 Output:
 For logistic regression

Or,
 In multimodal, a feature can be evidence for or against each
individual class.
 An exclamation mark ‘!’ may indicate positive or negative emotion, but not
neutral
Cross-entropy loss function
 Conditional maximum likelihood estimation: Choose 𝑤 and 𝑏 that
maximize the log 𝑝(𝑦|𝑥) in the training data given the observations 𝑥.
 Only two possible outcomes: Bernoulli distribution

 Note: if y=1, 𝑝 𝑦 𝑥 = 𝑦,
ො else if y=0, 𝑝 𝑦 𝑥 = (1 − 𝑦)

 Taking log

 To make it a loss function


Stochastic Gradient Descent
 Figuring out in which direction the function’s slope is rising the most
steeply, and moving in the opposite direction
 Logistic regression: convex error function
 Vs. Neural network: non-convex (multiple local minima)
The partial derivative tells the steepest along that dimension
Regularization
 Large weights => over
generalization (overfitting)

 Add a penalty for large weights


 L1 Regularization: Linear function of weights

 L2 Regularization: Quadratic function of weight values


L1 vs. L2
 L1
 Linear but non-continuous at 0,
complex derivative
 Laplace prior on the weights

 Prefers a sparse weight matrix


with a few large weights
 L2
 Simple derivative
 Gaussian prior with zero mean

 Prefers weight vectors with many


small weights

You might also like