shikra具有坐标参照的视觉问答模型_开源AI项目-程序员客栈

模型描述 (Model Description)

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic

Shikra, an MLLM designed to kick off referential dialogue by excelling in spatial coordinate inputs/outputs in natural language, without additional vocabularies, position encoders, pre-/post-detection, or external plug-in models.

[Project Page] [Paper]

运行环境 (Operating environment)

Install

git clone the repository

git clone --depth=1 --filter=blob:none --no-checkout https://www.modelscope.cn/haolan/shikra.git
cd shikra
git checkout master -- ms_wrapper.py
git checkout master -- mllm
git checkout master -- requirements.txt

Install Package

conda create -n shikra python=3.10 -y
conda activate shikra
pip install -r requirements.txt

代码范例 (Code example)

from modelscope.models import Model
from modelscope.pipelines import pipeline

inference = pipeline('shikra-task', model='haolan/shikra')


data = {
    'image_path':"mllm/demo/assets/man.jpg",
    'user_input':"What is the person<boxes> scared of?",
    'boxes_value':[[148, 99, 576, 497]],
    'boxes_seq':[[0]]
}
output = inference(data)

print(output)

#注：模型加载可能需要几分钟的时间

Citation

@article{chen2023shikra,
  title={Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic},
  author={Chen, Keqin and Zhang, Zhao and Zeng, Weili and Zhang, Richong and Zhu, Feng and Zhao, Rui},
  journal={arXiv preprint arXiv:2306.15195},
  year={2023}
}

shikra具有坐标参照的视觉问答模型

作品详情