看图问答(vilt-b32-finetuned-vqa))
Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.
使用
https://openi.pcl.ac.cn/cubeai-model-zoo/hfdandelinvilt-b32-finetuned-vqa
模型来源
https://hf-mirror.com/dandelin/vilt-b32-finetuned-vqa
评论