Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.02291 (cs)

[Submitted on 5 Jul 2023 (v1), last revised 4 Sep 2023 (this version, v2)]

Title:Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Authors:Junwen Chen, Yingcheng Wang, Keiji Yanai

View PDF

Abstract:Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection and share the prediction burden, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a verb decoder. Moreover, we propose a novel Specific Target Guided (STG) DeNoising training strategy, which leverages learnable object and verb label embeddings to guide the training and accelerate the training convergence. In addition, for the inference part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Without additional features or prior language knowledge, our method (SOV-STG) achieves higher accuracy than the state-of-the-art method in one-third of training epochs. The code is available at this this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.02291 [cs.CV]
	(or arXiv:2307.02291v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.02291

Submission history

From: Junwen Chen [view email]
[v1] Wed, 5 Jul 2023 13:42:31 UTC (5,658 KB)
[v2] Mon, 4 Sep 2023 15:03:11 UTC (12,422 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators