Abstract: Multimodal pre-training models have been successfully applied to the vision domain, and a large number of researches have been conducted on this basis for video action recognition of short ...