Fully Transformer Detector with Multiscale Encoder and Dynamic Decoder

WCSE 2023
ISBN: 978-981-18-7950-0 DOI: 10.18178/wcse.2023.06.016

Dengdi Sun, Zhen Hu, Bin Luo, Zhuanlian Ding

Abstract—The recently proposed Detection Transformer (DETR) model applies the transformer encoder and decoder architecture to object detection and achieves comparable performance with CNN-based detection frameworks. However, DETR and other relevant variants usually use CNNs as backbone so that the output features of backbone are unfriendly to transformer encoders and decoders. Therefore, we propose a CNN-free end-to-end detector completely based on Transformer encoder and decoder. In addition, most detector based on transformer encoder and decoder problems lie in two aspects: slow convergence as well as disappointing detection performance for small targets. In this paper, we have improved encoder and decoder respectively for the above two issues. Firstly, we introduce multiscale encoder with feature interaction, in which there are only a few CNN operations. Additionally, we improved content object query and positional object query in the self-attention of decoder via introduce ground truth label embedding and dynamic anchor bbox, respectively. As result, it leads to impressive performance 46.9% AP and 28.8% APs on MS-COCO 2017 benchmark among the DETR-like detector using ResNet50 with DC5 or without DC5 of pre-trained on ImageNet as backbone trained in 50 epochs. We also conducted some experiments to confirm our analysis and verify the effectiveness of our method.

Index Terms—Object detection, Transformer, CNN-free, Encoder, Decoder

Dengdi Sun
School of Artificial Intelligence, Anhui University, CHINA
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, CHINA
Zhen Hu, Bin Luo
School of Computer Science and Technology, Anhui University, CHINA
Zhuanlian Ding
School of Internet, Anhui University, CHINA

[Download]

Cite: Dengdi Sun, Zhen Hu, Bin Luo, Zhuanlian Ding, "Fully Transformer Detector with Multiscale Encoder and Dynamic Decoder" Proceedings of 2023 the 13th International Workshop on Computer Science and Engineering (WCSE 2023), pp. 101-111, June 16-18, 2023.

PREVIOUS PAPER

A Comparative Study of Deep Convolutional Neural Networks for Musculoskeletal X-Ray Images

NEXT PAPER

ANN Algorithm for Brain Hemorrhage Detection Using CT Images