{
"cells": [
{
"celltype": "markdow",
"metadata": {},
"source": [
"# 1. UIE模型简介\",
"\",
"UIE(Uiversal Iformatio Extractio):Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。为了方便大家使用UIE的强大能力,PaddleNLP借鉴该论文的方法,基于ERNIE 3.0知识增强预训练模型,训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取,实现零样本快速冷启动,并具备优秀的小样本微调能力,快速适配特定的抽取目标。\",
"\",
"\",
"
\",
"
\",
"
\",
"
\",
"
\",
"
\",
"\",
"0-shot表示无训练数据直接通过金融 医疗 互联网\",
" 0-shot 5-shot 0-shot 5-shot 0-shot 5-shot\",
" uie-base (12L768H) 46.43 70.92 71.83 85.72 78.33 81.86\",
" uie-medium (6L768H) 41.11 64.53 65.40 75.72 78.32 79.68\",
" uie-mii (6L384H) 37.04 64.65 60.50 78.36 72.09 76.38\",
" uie-micro (4L384H) 37.53 62.11 57.04 75.92 66.00 70.22\",
" uie-ao (4L312H) 38.94 66.83 48.29 76.74 62.86 72.35\",
" uie-m-large (24L1024H) 49.35 74.55 70.50 92.66 78.49 83.02\",
" uie-m-base (12L768H) 38.46 74.31 63.37 87.32 76.27 80.13\",
" paddlelp.Taskflow
进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。paddlelp.Taskflow
提供通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.1 环境准备\",
"安装PaddleNLP"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"! pip istall --upgrade paddlelp\",
"! pip show paddlelp"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"from pprit import pprit\",
"from paddlelp import Taskflow"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.2 实体抽取\",
"\",
" 命名实体识别(Named Etity Recogitio,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。\",
"\",
" 例如抽取的目标实体类型是\"时间\"、\"选手\"和\"赛事名称\", 调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = ['时间', '选手', '赛事名称'] # Defie the schema for etity extractio\",
"ie = Taskflow('iformatioextractio', schema=schema, model='uie-base')\",
"iee = Taskflow('iformatioextractio', schema=schema, model='uie-base-e')\",
"pprit(ie(\"2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!\")) # Better prit results usig pprit"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.2 关系抽取\",
"\",
"关系抽取(Relatio Extractio,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。\",
"\",
"例如以\"竞赛名称\"作为抽取主体,抽取关系类型为\"主办方\"、\"承办方\"和\"已举办次数\", 调用示例如下:\"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Defie the schema for relatio extractio\",
"ie.setschema(schema) # Reset schema\",
"pprit(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。'))"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.3 事件抽取\",
"\",
"事件抽取 (Evet Extractio, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argumet),组合为相应的事件结构化信息。\",
"\",
"例如抽取的目标是\"地震\"事件的\"地震强度\"、\"时间\"、\"震中位置\"和\"震源深度\"这些信息,调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Defie the schema for evet extractio\",
"ie.setschema(schema) # Reset schema\",
"ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.4 评论观点抽取\",
"\",
"评论观点抽取,是指抽取文本中包含的评价维度、观点词。\",
"\",
"例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Defie the schema for opiio extractio\",
"ie.setschema(schema) # Reset schema\",
"pprit(ie(\"店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队\")) # Better prit results usig pprit"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"英文模型调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = [{'Aspect': ['Opiio', 'Setimet classificatio [egative, positive]']}]\",
"iee.setschema(schema)\",
"pprit(iee(\"The teacher is very ice.\"))"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.5 情感分类\",
"\",
"句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = '情感倾向[正向,负向]' # Defie the schema for setece-level setimet classificatio\",
"ie.setschema(schema) # Reset schema\",
"ie('这个产品用起来真的很流畅,我非常喜欢')"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"英文模型调用示例如下:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = 'Setimet classificatio [egative, positive]'\",
"iee.setschema(schema)\",
"iee('I am sorry but this is the worst film I have ever see i my life.')"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.6 跨任务抽取\",
"\",
"例如在法律场景同时对文本进行实体抽取和关系抽取,调用示例如:"
]
},
{
"celltype": "code",
"executiocout": ull,
"metadata": {},
"outputs": [],
"source": [
"schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]\",
"ie.setschema(schema)\",
"pprit(ie(\"北京市海淀区人民法院\民事判决书\(199x)建初字第xxx号\原告:张三。\委托代理人李四,北京市 A律师事务所律师。\被告:B公司,法定代表人王五,开发公司总经理。\委托代理人赵六,北京市 C律师事务所律师。\")) # Better prit results usig pprit"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"## 3.7 模型训练\",
"对于简单的抽取目标可以直接使用paddlelp.Taskflow
实现零样本(zero-shot)抽取,对于细分场景我们推荐使用轻定制功能(标注少量数据进行模型微调)以进一步提升效果。模型训练细节请参考UIE训练定制"
]
},
{
"celltype": "markdow",
"metadata": {},
"source": [
"# 4. 相关论文以及引用信息\",
"\",
"-
点击空白处退出提示
评论