تشخيص موجوديت‌هاي نامدار در متون فارسي با استفاده از يادگيري عميق

عنوان به زبان ديگر

Named Entity Recognition in Persian Text using Deep Learning

پديد آورندگان

ممتازي، سعيده دانشگاه صنعتي اميركبير، تهران - دانشكده مهندسي كامپيوتر و فناوري اطلاعات , ترابي، فرزانه دانشگاه صنعتي اميركبير، تهران - دانشكده مهندسي كامپيوتر و فناوري اطلاعات

تعداد صفحه

از صفحه

تا صفحه

112

كليدواژه

تشخيص موجوديت‌هاي نامدار , پردازش زبان طبيعي , بازنمايي معنايي كلمات , يادگيري عميق

چكيده فارسي

شناسايي موجوديت‌هاي نامدار يكي از فعاليت‌هاي زيربنايي در حوزه پردازش زبان طبيعي و به‌طور‌كلي زير‌مجموعه‌اي از استخراج اطلاعات است. در فرآيند شناسايي موجوديت‌هاي نامدار به‌دنبال يافتن عناصر اسمي در متن و دسته‌بندي آنها به رده‌هايي ازپيش‌تعيين‌شده از قبيل اسامي اشخاص، سازمان‌ها، مكان‌ها، مذاهب، عنوان كتاب‌ها، عنوان فيلمها و غيره هستيم. در اين مقاله با بهرهگيري از روشهاي نوين در اين حوزه مانند استفاده از دو بُردار مختلف بازنمايي معنايي واژگان برمبناي كلمه و حروف تشكيل‌دهنده آن برمبناي شبكه‌هاي عصبيو همچنين استفاده از روش‌هاي يادگيري عميق[4] يك سامانه تشخيص موجوديت‌هاي نامدار معرفي مي‌شود. همچنين در راستاي پژوهش حاضر، يك پيكره برچسب‌گذاريشده شامل سه‌هزار چكيده از ويكي‌پدياي فارسي كه شامل نود‌هزار واژه است با استفاده از پانزده برچسب مختلف ارائه مي‌شود كه گام مهمي در ارتقاي پژوهش‌هاي آينده اين حوزه برداشته خواهد شد. نتايج حاصل از ارزيابي سامانه پيشنهادي نشان مي‌دهد كه مي‌توان با استفاده از داده معرفي‌شده به دقت 72/09 در معيار F رسيد.

چكيده لاتين

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefits from neural network-based approaches for both word representation and entity tagging. In the word representation part of the proposed model, two different vector representations are used and compared: (1) the semantic representation of words based on their context using word2vec continues skip-gram model, and (2) the semantic representation of words based on their context as well as characters forming them using fasttext. While the former model captures the semantic concepts of words, the latter one considers the morphological similarity of words as well. For the entity identification, a deep Bidirectional Long Short Term Memory (BiLSTM) network is used. Using LSTM model helps to consider the history of text when predicting entities, while the BiLSTM model expands this idea by benefiting from the history from both sides of the context. Moreover, inline of the present research, an annotated corpus containing 3000 abstracts (90000 tokens) from the Persian Wikipedia is provided. In contrast to the available datasets in the field, which includes up to 7 label types, the new dataset contains 15 different labels, namely person individual, person group, organizations, locations, religions, books, magazines, movies, languages, nationalities, events, jobs, dates, fields, and other. Developing this dataset will be an important step in promoting future research in this field, especially for the tasks such as question answering that need wider range of entity types. The results of the proposed system show that by using the introduced model and the provided data, the system can achieve 72.92 F-measure.

سال انتشار

1398

عنوان نشريه

پردازش علائم و داده ها

فايل PDF

7755528

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=1123803