本研究针对医疗领域中电子病历命名实体识别任务资源匮乏问题,在医学专家的指导下制定了统一的专病实体语料库标注方法,并构建了2种专病实体语料库——儿童支气管肺炎实体语料库和糖尿病实体语料库。在 BERT-BiLSTMCRF 和 ERNIE-BiLSTM-CRF 模型上,将儿童支气管肺炎实体语料库与公开数据集进行比较,验证本文提出的专病实体语料库标注方法的有效性;再将专病实体语料库标注方法重新应用于糖尿病电子病历,以评价模型鲁棒性。模型验证结果显示:2种自建专病实体语料库的F1值均优于公开数据集,说明本文提出的专病实体语料库标注方法的鲁棒性。
Addressing the issue of resource scarcity for named entity recognition tasks in the medical field, a unified annotation methodology for special diseases entity corpora was formulated under the guidance of medical experts, and two special diseases entity corpora were constructed, namely Pediatric Bronchopneumonia Entity Corpus and Diabetes Entity Corpus. To verify the effectiveness of the proposed special disease entity corpus annotation method, the Pediatric Bronchopneumonia Entity Corpus was first compared with the publicly available dataset using BERT-BiLSTM-CRF and ERNIE-BiLSTM-CRF models. Then, the methodology was reapplied to diabetes electronic medical records to evaluate the robustness of the model. The results showed that both special diseases entity corpora got higher F1 scores than the public datasets, which suggests that special diseases entity corpus annotation methodology proposed in this paper has good robustness.
关键词/Keywords: 电子病历;命名实体识别;语料库构建;儿童支气管肺炎实体语料库;糖尿病实体语料库 / electronic medical record; named entity recognition; corpus construction; Pediatric Bronchopneumonia Entity Corpus; Diabetes Entity Corpus