Lecture7 2-MultimodalInference
Lecture7 2-MultimodalInference
§ Goal: Evaluate state-of-the-art models on your dataset and identify key issues
through a detailed error analysis
§ It will inform the design of your new research ideas
§ Report format: 2 column (ICML template)
§ The report should follow a similar structure to a research paper
§ Teams of 3: 8 pages, Teams of 4: 9 pages, Teams of 5: 10 pages.
§ Number of SOTA baseline models
§ Teams of N should have at least N-1 baseline models
§ Error analysis
§ This is one of the most important part of this report. You need to understand where previous
models can be improved.
2
Examples of Possible Error Analysis Approaches
§ Dataset-based:
§ Split correct/incorrect by label
§ Manually inspect the samples that are incorrectly predicted
§ What are the commonalities?
§ What are differences with the correct ones?
§ Sub-dataset analysis: length of question, rare words, cluttered images,
high frequency in signals?
3
Examples of Possible Error Analysis Approaches
§ Perturbation-based:
§ Make targeted changes to specific parts of the image.
§ Change one word/paraphrase/add redundant tokens.
§ See whether the model remains robust
4
Examples of Possible Error Analysis Approaches
§ Model-based:
§ Visualize feature attributions: LIME, 1st/2nd order gradients
§ Ablation studies to understand what model components are important
§ Theory-based:
§ Write out the math! From optimization and learning perspective, does
the model do what’s expected?
§ Some useful tools: consider linear case/other simplest case and derive
solution, do empirical sanity checks first.
5
Examples of Possible Error Analysis Approaches
6
Examples of Possible Error Analysis Approaches
7
Midterm Project Report Instructions
8
Upcoming Deadlines
9
Reasoning
Modality A
Reasoning 𝒚
Modality B
Local representation
+ Aligned representation
10
10
Reasoning
words 𝑧
or
or
∧ 𝑡𝑟𝑢𝑒
11
11
Summary
Temporal
Last Thursday Continuous
Hierarchical
Tuesday Interactive
Causal Knowledge
Today Discovery Discrete
Logical Commonsense
12
12
Sub-Challenge 3a: Structure Modeling
Concepts
Composition
Dense Neural
Structure
Single-step Temporal Hierarchical Interactive Discovery
Multi-step
13
Structure Discovery
End-to-end neural module networks
14
Structure Discovery
End-to-end neural module networks
Attend [red]
Attend [circle]
Combine [and]
Is there a red shape
above a circle?
Measure [is]
[Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017]
15
Stochastic Optimization
RL
Reward
r
In RL (at least for discrete actions): ???
- T is a sequence of discrete actions
a
- p(T; ) is not reparameterizable
- r(T) is a black box function
i.e. the environment
s
16
16
Revisiting REINFORCE
(we will revisit this equation for generative models)
17
Structure Discovery
End-to-end neural module networks
Attend [red]
Attend [circle]
[Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017]
18
Structure Discovery (valid data)
𝒚
Structure fully learned from optimization and data
Add fuse
1. Define basic representation building blocks
ReLU Layer norm Conv Self-attention
Conv Self-attention
Layer norm
Nice, but slow!
[Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic19Health Records. AAAI 2021]
19
Continuous Structure Discovery
Biggest problem: discrete optimization is slow.
Differentiable optimization for structure learning:
2. Solve bi-level optimization problem Concat fuse Add fuse Attention fuse
20
Continuous Structure Discovery
Biggest problem: discrete optimization is slow.
Differentiable optimization for structure learning:
21
Continuous Structure Discovery
In general, optimization over directed acyclic graphs (DAGs):
[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]
22
Continuous Structure Discovery
- K-th power of adjacency matrix W counts the number of k-step paths from
one node to another.
- If the diagonal of the matrix power is all zeros, there are no k-step cycles.
- Acyclic = check all k = 1,2, …, size of graph.
[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]
23
Continuous Structure Discovery
[Zheng et al., DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 2018]
24
Sub-Challenge 3b: Intermediate Concepts
Concepts
Discrete 𝒚
words
or
or
Continuous
Structure
Single-step Multi-step
25
Discrete Concepts via Hard Attention
Hard attention ‘gates’ (0/1) rather than soft attention (softmax between 0-1)
- Can be seen as discrete layers in between differentiable neural net layers
sentiment/emotion
controller classifier
Hard attention Classification (valid data)
Multimodal accuracy
inputs scores (0/1)
reward
[Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015]
[Chen et al., Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement
26 Learning. ICMI 2017]
26
Discrete Concepts via Hard Attention
Hard attention ‘gates’ (0/1) rather than soft attention (softmax between 0-1)
- Can be seen as discrete layers in between differentiable neural net layers
Sentiment analysis,
emotion recognition
Image captioning
[Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015]
[Chen et al., Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement
27 Learning. ICMI 2017]
27
Discrete Concepts via Language
• Large language/video/audio models interacting with each other
• Each language model has its own distinct domain knowledge
• Interaction is scripted and zero-shot
28
Discrete Concepts via Language
Image captioning
[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]
29
Discrete Concepts via Language
Robot perception and planning
[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]
30
Discrete Concepts via Language
Video reasoning
[Zeng et al., Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. arXiv 2022]
31
Sub-Challenge 3c: Inference Paradigm
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Potential issues:
- Models may capture spurious correlations
- Not robust to targeted manipulations
- Lack of interpretability/control Representation
Structure
Single-step Multi-step
32
Sub-Challenge 3c: Inference Paradigm
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Inference
𝑡𝑟𝑢𝑒
Logical
Representation
33
Recall error
Logical Inference analysis!
[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]
34
Logical Inference
Inference through logical operators in question
AND
Differentiable AND composition operator!
[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]
35
Soft Logical Operators
Inference through logical operators in question
[Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020]
36
Open
Logical Inference Challenges challenges
[Yang et al., Differentiable Learning of Logical Rules for Knowledge Base Reasoning. NeurIPS 2017]
37
Sub-Challenge 3c: Inference Paradigm
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Logical
Representation
38
Causal Inference
Intervention
Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.
vs association describes how things are. Causation describes how things would have
been under different circumstances.
39
Causal Inference
Intervention
Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.
40
Causal Inference
Intervention
Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.
41
Causal Inference
Intervention
Causal inference is reliant on the idea of interventions —what outcome might have
occurred if X happened (an intervention), possibly contrary to observed data.
42
Causal Inference
Intervention
43
Causal Inference
Intervention
44
Causal Inference
Intervention
The marginal distribution of y: p(y | do(x=3)). The marginal distribution of y: p(y | x=3).
The joint distribution of data alone is insufficient to predict behavior under interventions.
[Example from Ferenc Huszár: https://www.inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example/]
45
Causal Inference
Causal diagrams: arrow pointing from cause to effect.
46
Causal Inference
Intervention mutilates the graph by removing all edges that point into the variable on which
intervention is applied (in this case x).
47
Causal Inference
Intervention in real-life is typically very hard!
treatment outcome
variable
48
Causal Inference
Causal VQA: does my multimodal model capture causation or correlation?
Covariant VQA
Target object in question
i.e., treatment
variable
zebras prediction
Baselines: 2
[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]
49
Recall error
Causal Inference analysis!
Covariant VQA
Target object in question
i.e., treatment
variable
zebras prediction
[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]
50
Causal Inference
Causal VQA: does my multimodal model capture causation or correlation?
Invariant VQA
Target irrelevant object i.e., confounding
umbrella
variable
balloon prediction
Baselines: pink
[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]
51
Recall error
Causal Inference analysis!
Invariant VQA
Target irrelevant object i.e., confounding
umbrella
variable
balloon prediction
[Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020]
52
Causal Inference
Causal inference via data augmentation
Covariance Invariance
(targeted changes to answer) (answer stays the same)
53
Sub-Challenge 3c: Inference Paradigm
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Logical
i.e., confounding
umbrella
variable
Representation
balloon prediction
54
Sub-Challenge 3d: Knowledge
Definition: The derivation of knowledge in the study of inference, structure, and reasoning.
words 𝑧
Domain knowledge or
or
∧ 𝑡𝑟𝑢𝑒
Knowledge graphs
55
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources
[Marino et al., OK-VQA: A visual question answering benchmark requiring external knowledge. CVPR 2019]
56
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources
Object
detector Wakeboard boat: boat
designed to create a wake…
Wakeboarder: …
Language
What kind of Kitesurfer: …
model
surfboard
board is this? Skiboarding: …
Boardsport: …
[Gui et al., KAT: A Knowledge Augmented Transformer for Vision-and-Language. NAACL 2022]
57
External Knowledge: Multimodal Knowledge Graphs
Knowledge can also be gained from external sources
Concepts: interpretable
Structure: multi-step inference
Composition: graph-based
[Zhu et al., Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries. arXiv 2015]
58
Open
External Knowledge Challenges challenges
59
Open
External Knowledge Challenges challenges
60
Summary: Reasoning
Modality A
Reasoning 𝒚
Modality B
Local representation
+ Aligned representation
61
61
The Challenge of Compositionality
Compositional Generalization
to novel combinations outside
of training data
[Thrush et al., Winoground: Probing Vision and Language Models for Visio-Linguistic
62 Compositionality. CVPR 2022]
62
Sub-Challenge 3a: Structure Modeling
Structure
Single-step Temporal Hierarchical Interactive Discovery
Multi-step
63
Sub-Challenge 3b: Intermediate Concepts
Concepts
Discrete
words
or
or
Continuous
Structure
Single-step Multi-step
64
64
Sub-Challenge 3c: Inference Paradigm
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Concepts
Discrete
𝑧
Inference
𝑡𝑟𝑢𝑒
Causal
∧
Logical
Continuous Representation
Structure
Single-step Multi-step
65
65
Sub-Challenge 3d: External Knowledge
Definition: Leveraging external knowledge in the study of structure, concepts, and inference.
Concepts
Discrete
Inference
Causal
Knowledge
Logical
Continuous Representation
Structure
Single-step Multi-step
66
66
Summary: Reasoning
words 𝑧
or
or
∧ 𝑡𝑟𝑢𝑒
67
67
Open
More Reasoning challenges
Concepts
Inference
Knowledge
Interpretable Causal
Logical
Dense Representation
Structure
Single-step Multi-step
Open challenges:
- Structure: multi-step inference
- Concepts: interpretable + differentiable representations
- Composition: explicit, logical, causal…
- Knowledge: integrating explicit knowledge with pretrained models
- Probing pretraining models for reasoning capabilities
68
68