[go: up one dir, main page]

0% found this document useful (0 votes)
9 views68 pages

Lecture7 2-MultimodalInference

The document outlines instructions for a midterm project report in a multimodal machine learning course, emphasizing the evaluation of state-of-the-art models and detailed error analysis. It provides a structured format for the report and suggests various approaches for error analysis, including dataset-based, perturbation-based, and model-based methods. Additionally, it discusses the importance of reasoning and inference paradigms in multimodal learning, highlighting challenges and potential improvements in logical and causal inference.

Uploaded by

zhizhang28600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views68 pages

Lecture7 2-MultimodalInference

The document outlines instructions for a midterm project report in a multimodal machine learning course, emphasizing the evaluation of state-of-the-art models and detailed error analysis. It provides a structured format for the report and suggests various approaches for error analysis, including dataset-based, perturbation-based, and model-based methods. Additionally, it discusses the importance of reasoning and inference paradigms in multimodal learning, highlighting challenges and potential improvements in logical and causal inference.

Uploaded by

zhizhang28600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Multimodal Machine Learning

Lecture 7.2: Multimodal


Inference and Knowledge
Paul Liang

* Co-lecturer: Louis-Philippe Morency. Original course co-developed


with Tadas Baltrusaitis. Spring 2021 and 2022 editions taught by
Yonatan Bisk. Spring 2023 edition taught by Yonatan and Daniel Fried
Midterm Project Report Instructions

§ Goal: Evaluate state-of-the-art models on your dataset and identify key issues
through a detailed error analysis
§ It will inform the design of your new research ideas
§ Report format: 2 column (ICML template)
§ The report should follow a similar structure to a research paper
§ Teams of 3: 8 pages, Teams of 4: 9 pages, Teams of 5: 10 pages.
§ Number of SOTA baseline models
§ Teams of N should have at least N-1 baseline models
§ Error analysis
§ This is one of the most important part of this report. You need to understand where previous
models can be improved.

2
Examples of Possible Error Analysis Approaches

§ Dataset-based:
§ Split correct/incorrect by label
§ Manually inspect the samples that are incorrectly predicted
§ What are the commonalities?
§ What are differences with the correct ones?
§ Sub-dataset analysis: length of question, rare words, cluttered images,
high frequency in signals?

3
Examples of Possible Error Analysis Approaches

§ Perturbation-based:
§ Make targeted changes to specific parts of the image.
§ Change one word/paraphrase/add redundant tokens.
§ See whether the model remains robust

4
Examples of Possible Error Analysis Approaches

§ Model-based:
§ Visualize feature attributions: LIME, 1st/2nd order gradients
§ Ablation studies to understand what model components are important
§ Theory-based:
§ Write out the math! From optimization and learning perspective, does
the model do what’s expected?
§ Some useful tools: consider linear case/other simplest case and derive
solution, do empirical sanity checks first.

5
Examples of Possible Error Analysis Approaches

[Reddi et al., On the Convergence of Adam and Beyond. ICLR 2018]

6
Examples of Possible Error Analysis Approaches

Finding: Image captioning models capture spurious


correlations between gender and generated actions

You’ll see more in today’s reasoning lecture and in quantification lectures


[Hendricks et al., Women also Snowboard: Overcoming Bias in Captioning Models. ECCV 2018]

7
Midterm Project Report Instructions

Main report sections:


§ Abstract
§ Introduction
§ Related work
§ Problem statement The structure is
similar to a
§ Multimodal baseline models
research paper
§ Experimental methodology submission J
§ Results and discussion
§ New research ideas

8
Upcoming Deadlines

§ Sunday October 29 8pm: Midterm report deadline


§ Tuesday and Thursday (10/31 and 11/2): midterm presentations
§ All students are expected to attend both presentation sessions in person
§ Each team will present either Tuesday or Thursday
§ The focus of these presentations is about your research ideas
§ Feedback will be given by all students, instructors and TAs

9
Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

Modality A

Reasoning 𝒚
Modality B

Local representation
+ Aligned representation
10

10
Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

Structure Intermediate Inference External


A B C D
modeling concepts paradigm knowledge

words 𝑧
or

or
∧ 𝑡𝑟𝑢𝑒

11

11
Summary

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

Structure Intermediate Inference External


A B C D
modeling concepts paradigm knowledge

Temporal
Last Thursday Continuous
Hierarchical

Tuesday Interactive

Causal Knowledge
Today Discovery Discrete
Logical Commonsense
12

12
Sub-Challenge 3a: Structure Modeling

Concepts

Composition

Dense Neural
Structure
Single-step Temporal Hierarchical Interactive Discovery

Multi-step
13
Structure Discovery
End-to-end neural module networks

Recall structure - leverage syntactic structure of language based on parsing

Attend [red] Combine [and] Measure [is] YES

Is there a red shape


above a circle?
Attend [circle] Re-attend [above]

[Andreas et al., Neural Module Networks. CVPR 2016]

14
Structure Discovery
End-to-end neural module networks

Can we learn the structure end-to-end?

Attend [red]

Attend [circle]

Re-attend [above] NMN YES

Combine [and]
Is there a red shape
above a circle?
Measure [is]

[Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017]

15
Stochastic Optimization

RL
Reward

r
In RL (at least for discrete actions): ???
- T is a sequence of discrete actions
a
- p(T; ) is not reparameterizable
- r(T) is a black box function
i.e. the environment
s

REINFORCE is a general-purpose solution!

16

16
Revisiting REINFORCE
(we will revisit this equation for generative models)

We want to take gradients wrt of the term:

We can now compute a Monte Carlo estimate:

What we derived: sample trajectories and compute:

- z can be discrete or continuous!


- q(z) can be a discrete and continuous distribution!
- q(z) must allow for easy sampling and be differentiable wrt
- f(z) can be a black box!
17

17
Structure Discovery
End-to-end neural module networks

Can we learn the structure end-to-end?

Attend [red]
Attend [circle]

RNN Re-attend [above] NMN YES


Combine [and]
Is there a red shape Measure [is]
above a circle?

[Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017]

18
Structure Discovery (valid data)

𝒚
Structure fully learned from optimization and data
Add fuse
1. Define basic representation building blocks
ReLU Layer norm Conv Self-attention

2. Define basic fusion building blocks Conv Layer norm

Concat fuse Attention fuse Add fuse


Concat fuse Attention fuse
3. Automatically search for composition using neural
architecture search

Conv Self-attention

Layer norm
Nice, but slow!

[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic19Health Records. AAAI 2021]

19
Continuous Structure Discovery
Biggest problem: discrete optimization is slow.
Differentiable optimization for structure learning:

1. Approximate selection with softmax: (valid data)


𝒚

2. Solve bi-level optimization problem Concat fuse Add fuse Attention fuse

Conv Layer norm

[Liu et al., DARTS: Differentiable Architecture Search. ICLR 2019]

20
Continuous Structure Discovery
Biggest problem: discrete optimization is slow.
Differentiable optimization for structure learning:

1. Approximate selection with softmax:

2. Solve bi-level optimization problem

3. Convert softmax to argmax


Faster but still non-trivial

[Liu et al., DARTS: Differentiable Architecture Search. ICLR 2019]

21
Continuous Structure Discovery
In general, optimization over directed acyclic graphs (DAGs):

Graph G, Data X, Adjacency matrix W:

[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]

22
Continuous Structure Discovery

- K-th power of adjacency matrix W counts the number of k-step paths from
one node to another.
- If the diagonal of the matrix power is all zeros, there are no k-step cycles.
- Acyclic = check all k = 1,2, …, size of graph.

Can now do continuous optimization to solve for W, but nonconvex

[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]

23
Continuous Structure Discovery

[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]

24
Sub-Challenge 3b: Intermediate Concepts

Definition: The parameterization of individual multimodal concepts in the reasoning process.

Concepts
Discrete 𝒚

words
or

or

Continuous
Structure
Single-step Multi-step

25
Discrete Concepts via Hard Attention

Hard attention ‘gates’ (0/1) rather than soft attention (softmax between 0-1)
- Can be seen as discrete layers in between differentiable neural net layers

sentiment/emotion

controller classifier
Hard attention Classification (valid data)
Multimodal accuracy
inputs scores (0/1)
reward

[Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015]
[Chen et al., Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement
26 Learning. ICMI 2017]

26
Discrete Concepts via Hard Attention

Hard attention ‘gates’ (0/1) rather than soft attention (softmax between 0-1)
- Can be seen as discrete layers in between differentiable neural net layers

Sentiment analysis,
emotion recognition

Image captioning

[Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015]
[Chen et al., Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement
27 Learning. ICMI 2017]

27
Discrete Concepts via Language
• Large language/video/audio models interacting with each other
• Each language model has its own distinct domain knowledge
• Interaction is scripted and zero-shot

Guided multimodal discussion Combining domain knowledge


[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]

28
Discrete Concepts via Language
Image captioning

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]

29
Discrete Concepts via Language
Robot perception and planning

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]

30
Discrete Concepts via Language
Video reasoning

[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]

31
Sub-Challenge 3c: Inference Paradigm

Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.

Recall representation fusion: Concepts


Modality A
Fusion Inference
Modality B

Potential issues:
- Models may capture spurious correlations
- Not robust to targeted manipulations
- Lack of interpretability/control Representation
Structure
Single-step Multi-step

32
Sub-Challenge 3c: Inference Paradigm

Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.

Towards explicit inference paradigms:


1. Logical inference: given premises inferred from multimodal
evidence, how can one derive logical conclusions?

Inference
𝑡𝑟𝑢𝑒

Logical

Representation

33
Recall error
Logical Inference analysis!

Inference through logical operators in question

Is there beer AND is there a


Adversarial antonyms
WINE GLASS?

Is the man NOT wearing Logical connectives


shoes AND is there beer?

Is there beer? Is the man wearing shoes? Basic premises

Existing models struggle to capture logical connectives.


How can we make them more logical?

[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]

34
Logical Inference
Inference through logical operators in question

AND
Differentiable AND composition operator!

Also applies to other logic connectives:


Are they in a
AND, OR, NOT
restaurant AND
are they all boys?

Are they in a Are they all


restaurant? boys?

[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]

35
Soft Logical Operators
Inference through logical operators in question

Fréchet inequalities to make logical functions differentiable:

[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]

36
Open
Logical Inference Challenges challenges

Many open directions

Differentiable knowledge base reasoning

[Yang et al., Differentiable Learning of Logical Rules for Knowledge Base Reasoning. NeurIPS 2017]

37
Sub-Challenge 3c: Inference Paradigm

Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.

Towards explicit inference paradigms:


1. Logical inference
2. Causal inference: how can one 𝑧
determine the actual causal effect of a
variable in a larger system? Inference
𝑡𝑟𝑢𝑒
Causal

Logical

Representation

38
Causal Inference
Intervention

Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.

vs association describes how things are. Causation describes how things would have
been under different circumstances.

(side note: correlation is a specific type of linear association)

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

39
Causal Inference
Intervention

Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

40
Causal Inference
Intervention

Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

41
Causal Inference
Intervention

Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

42
Causal Inference
Intervention

Let’s say I really want to set the value of x to 3.

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

43
Causal Inference
Intervention

Let’s say I really want to set the value of x to 3. What happens to y?

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

44
Causal Inference
Intervention

The marginal distribution of y: p(y | do(x=3)). The marginal distribution of y: p(y | x=3).

The joint distribution of data alone is insufficient to predict behavior under interventions.
[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

45
Causal Inference
Causal diagrams: arrow pointing from cause to effect.

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

46
Causal Inference
Intervention mutilates the graph by removing all edges that point into the variable on which
intervention is applied (in this case x).

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

47
Causal Inference
Intervention in real-life is typically very hard!

E.g., does treatment x treat disease y?

Can I estimate the intervention p(y|do(X=x))?


Requires answering: all else being equal, what would be the patient’s outcome if they had not
taken the treatment?
confounding
variable

treatment outcome
variable

Lots of work, see Judea Pearl, The Book of Why

[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]

48
Causal Inference
Causal VQA: does my multimodal model capture causation or correlation?

Covariant VQA
Target object in question
i.e., treatment
variable

zebras prediction

Baselines: 2

BUT: correlation or causation?

[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]

49
Recall error
Causal Inference analysis!

Causal VQA: does my multimodal model capture causation or correlation?

Covariant VQA
Target object in question
i.e., treatment
variable

zebras prediction

Baselines: 2 2 Interventional conditional: 𝒑(𝒚|𝒅𝒐(𝒛𝒆𝒃𝒓𝒂𝒔 = 𝟏))

Existing models struggle to adapt to targeted causal interventions.


How can we make them more robust to spurious correlations?

[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]

50
Causal Inference
Causal VQA: does my multimodal model capture causation or correlation?

Invariant VQA
Target irrelevant object i.e., confounding
umbrella
variable

balloon prediction

Baselines: pink

Is my model picking up irrelevant objects?

[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]

51
Recall error
Causal Inference analysis!

Causal VQA: does my multimodal model capture causation or correlation?

Invariant VQA
Target irrelevant object i.e., confounding
umbrella
variable

balloon prediction

Baselines: pink red Interventional conditional: 𝒑(𝒚|𝒅𝒐(𝒏𝒐 𝒖𝒎𝒃𝒓𝒆𝒍𝒍𝒂))

Existing models struggle to adapt to targeted causal interventions.


How can we make them more robust to spurious correlations?

[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]

52
Causal Inference
Causal inference via data augmentation
Covariance Invariance
(targeted changes to answer) (answer stays the same)

With Without With Without


relevant object relevant object irrelevant object irrelevant object
[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]

53
Sub-Challenge 3c: Inference Paradigm

Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.

Towards explicit inference paradigms:


1. Logical inference
2. Causal inference: how can one 𝑧
determine the actual causal effect of a
variable in a larger system? Inference
𝑡𝑟𝑢𝑒
Causal
Nice, but you don’t get these for free! ∧

Logical
i.e., confounding
umbrella
variable
Representation
balloon prediction

54
Sub-Challenge 3d: Knowledge

Definition: The derivation of knowledge in the study of inference, structure, and reasoning.

words 𝑧
Domain knowledge or

or
∧ 𝑡𝑟𝑢𝑒

Knowledge graphs

Knowledge in other unstructured formats

55
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources

Requires knowledge of water


sports, sports equipment, etc.

Existing models struggle when external knowledge is needed.


What kind of How can we leverage external knowledge?
board is this?

[Marino et al., OK-VQA: A visual question answering benchmark requiring external knowledge. CVPR 2019]

56
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources

Concepts: interpretable language

Object
detector Wakeboard boat: boat
designed to create a wake…
Wakeboarder: …
Language
What kind of Kitesurfer: …
model
surfboard
board is this? Skiboarding: …

Boardsport: …

Structure: multi-step retrieval Composition: neural

[Gui et al., KAT: A Knowledge Augmented Transformer for Vision-and-Language. NAACL 2022]

57
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources

Concepts: interpretable
Structure: multi-step inference
Composition: graph-based

[Zhu et al., Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries. arXiv 2015]

58
Open
External Knowledge Challenges challenges

Atomic: If-then commonsense


[Sap et al., Atomic: An Atlas of Machine Commonsense for If-Then Reasoning. AAAI 2019]

59
Open
External Knowledge Challenges challenges

Delphi: Moral commonsense

Social Chemistry: Social commonsense


[Jiang et al., Can Machines Learn Morality? The Delphi Experiment. arXiv 2021]
[Forbes et al., Social Chemistry 101: Learning to Reason about Social and Moral Norms. EMNLP 2020]

60
Summary: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

Modality A

Reasoning 𝒚
Modality B

Local representation
+ Aligned representation
61

61
The Challenge of Compositionality

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

CLIP, ViLT, ViLBERT, etc.


All random chance

Compositional Generalization
to novel combinations outside
of training data

1. Structure: <subject> <verb> <object>


2. Concepts: ‘plants’, ‘lightbulb’
3. Inference: ‘surrounding’ – spatial relation
4. Knowledge: from humans!

[Thrush et al., Winoground: Probing Vision and Language Models for Visio-Linguistic
62 Compositionality. CVPR 2022]

62
Sub-Challenge 3a: Structure Modeling

Definition: Defining or learning the relationships over which reasoning occurs.

Structure
Single-step Temporal Hierarchical Interactive Discovery

Multi-step
63
Sub-Challenge 3b: Intermediate Concepts

Definition: The parameterization of individual multimodal concepts in the reasoning process.

Concepts
Discrete

words
or

or

Continuous
Structure
Single-step Multi-step
64

64
Sub-Challenge 3c: Inference Paradigm

Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.

Concepts
Discrete
𝑧

Inference
𝑡𝑟𝑢𝑒
Causal

Logical

Continuous Representation
Structure
Single-step Multi-step
65

65
Sub-Challenge 3d: External Knowledge

Definition: Leveraging external knowledge in the study of structure, concepts, and inference.

Concepts
Discrete

Inference

Causal
Knowledge

Logical

Continuous Representation
Structure
Single-step Multi-step
66

66
Summary: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps,


exploiting multimodal alignment and problem structure.

Structure Intermediate Inference External


A B C D
modeling concepts paradigm knowledge

words 𝑧
or

or
∧ 𝑡𝑟𝑢𝑒

67

67
Open
More Reasoning challenges

Concepts
Inference
Knowledge
Interpretable Causal
Logical

Dense Representation
Structure
Single-step Multi-step

Open challenges:
- Structure: multi-step inference
- Concepts: interpretable + differentiable representations
- Composition: explicit, logical, causal…
- Knowledge: integrating explicit knowledge with pretrained models
- Probing pretraining models for reasoning capabilities

68

68

You might also like