OpenFlamingo 大型 LMM 训练框架_开源项目-程序员客栈

OpenFlamingo的核心是一个支持大型多模态模型(LMM)训练和评估的框架，DeepMind的Flamingo模型的开源复制品。

主要包含如下内容：

一个用于训练Flamingo风格LMM的Python框架（基于Lucidrains的flamingo实现和DavidHansmair的flamingo-mini存储库）。具有交错图像和文本序列的大规模多模态数据集。视觉语言任务的上下文学习评估基准。OpenFlamingo-9B模型（基于LLaMA）的第一个版本

OpenFlamingo架构如下图，使用交叉注意力层来融合预训练的视觉编码器和语言模型。

安装要在现有环境中安装包，请运行

pipinstallopen-flamingo或者创建运行OpenFlamingo的conda环境，运行

condaenvcreate-fenvironment.yml用法我们使用CLIPViT-Large视觉编码器和LLaMA-7B语言模型提供初始OpenFlamingo9B模型。一般来说，我们支持任何CLIP视觉编码器。对于语言模型，我们支持LLaMA、OPT、GPT-Neo、GPT-J和Pythia模型。

注意：要使用LLaMA模型，您需要通过以下方式安装最新版本的变压器pipinstallgit+https://github.com/huggingface/transformers使用此脚本将LLaMA权重转换为HuggingFace格式。

初始化OpenFlamingo模型fromopen_flamingoimportcreate_model_and_transforms

model,image_processor,tokenizer=create_model_and_transforms(clip_vision_encoder_path="ViT-L-14",clip_vision_encoder_pretrained="openai",lang_encoder_path="",tokenizer_path="",cross_attn_every_n_layers=4)

grabmodelcheckpointfromhuggingfacehub

fromhuggingface_hubimporthf_hub_downloadimporttorch

checkpoint_path=hf_hub_download("openflamingo/OpenFlamingo-9B","checkpoint.pt")model.load_state_dict(torch.load(checkpoint_path),strict=False)

生成文本这是一个以交错图像/文本为条件生成文本的示例，在这种情况下将进行少镜头图像字幕。

fromPILimportImageimportrequests

"""Step1:Loadimages"""demo_image_one=Image.open(requests.get("https://images.cocodataset.org/val2017/000000039769.jpg",stream=True).raw)

demo_image_two=Image.open(requests.get("https://images.cocodataset.org/test-stuff2017/000000028137.jpg",stream=True).raw)

query_image=Image.open(requests.get("https://images.cocodataset.org/test-stuff2017/000000028352.jpg",stream=True).raw)

"""Step2:PreprocessingimagesDetails:ForOpenFlamingo,weexpecttheimagetobeatorchtensorofshapebatch_sizexnum_mediaxnum_framesxchannelsxheightxwidth.Inthiscasebatch_size=1,num_media=3,num_frames=1(thiswillalwaysbeoneexpectforvideowhichwedon'tsupportyet),channels=3,height=224,width=224."""vision_x=[image_processor(demo_image_one).unsqueeze(0),image_processor(demo_image_two).unsqueeze(0),image_processor(query_image).unsqueeze(0)]vision_x=torch.cat(vision_x,dim=0)vision_x=vision_x.unsqueeze(1).unsqueeze(0)

"""Step3:PreprocessingtextDetails:Inthetextweexpectanspecialtokentoindicatewhereanimageis.Wealsoexpectan<|endofchunk|>specialtokentoindicatetheendofthetextportionassociatedwithanimage."""tokenizer.padding_side="left"#Forgenerationpaddingtokensshouldbeontheleftlang_x=tokenizer(["Animageoftwocats.<|endofchunk|>Animageofabathroomsink.<|endofchunk|>Animageof"],return_tensors="pt",)

"""Step4:Generatetext"""generated_text=model.generate(vision_x=vision_x,lang_x=lang_x["input_ids"],attention_mask=lang_x["attention_mask"],max_new_tokens=20,num_beams=3,)

print("Generatedtext:",tokenizer.decode(generated_text[0]))

方法OpenFlamingo是一种多模态语言模型，可用于多种任务。它在大型多模态数据集（例如MultimodalC4）上进行训练，可用于生成以交错图像/文本为条件的文本。例如，OpenFlamingo可用于为图像生成标题，或根据图像和文本段落生成问题。这种方法的好处是我们能够使用上下文训练快速适应新任务。

模型架构OpenFlamingo寻求使用交叉注意力层来融合预训练的视觉编码器和语言模型。模型架构如下图所示。

OpenFlamingo 大型 LMM 训练框架开源项目

作品详情

重点城市程序员兼职推荐

重点岗位程序员兼职推荐