پيونددهي موجوديت‌ها با روش بدون نظارت در متون فارسي رسانه‌هاي اجتماعي

عنوان به زبان ديگر

Unsupervised Entity Linking in Persian Social Media Texts

پديدآورندگان

عسگري بيدهندي مجيد majid.asgari@gmail.com دانشگاه علم و صنعت ايران , مينايي بيدگلي بهروز b_minaei@iust.ac.ir دانشگاه علم و صنعت ايران

تعداد صفحه

كليدواژه

‌پيونددهي موجوديت , ابهام‌زدايي موجوديت , زبان فارسي , فارس‌بِيس , گراف دانش , مجموعه‌ي نوشتار رسانه‌ي اجتماعي

سال انتشار

1398

عنوان كنفرانس

پنجمين كنفرانس بين المللي وب پژوهي

زبان مدرك

فارسي

چكيده فارسي

داده‌هاي رسانه‌ها‌ي اجتماعي در سال‌هاي اخير به‌طور نمايي رشد كرده است به طوريكه مي‌توان آن‌را يكي از بزرگ‌ترين منابع داده در جهان به شمار آورد. قسمت عمده‌اي از اين داده‌ها، متون زبان طبيعي هستند. اما زبان طبيعي، بسيار مبهم است. ‌پيونددهي موجوديت، وظيفه‌ي پيوند يادكردهاي موجوديت در متن به موجوديت‌هاي مرتبط به آنها در يك پايگاه دانش است. بيشتر سامانه‌هاي ‌پيونددهي موجوديت با جستجوي موجوديت‌هاي نامزد شروع كرده و سپس آنها را ابهام‌زدايي نموده و در نهايت بهترين نامزد را انتخاب مي‌كنند. در سال‌هاي اخير، به‌خاطر نبود يك گراف دانش فارسي، اين عمليات در زبان فارسي انجام نشده بود. خوشبختانه، در سال ۱۳۹۷ فارس‌بِيس به‌عنوان يك گراف دانش فارسي با تقريباً نيم‌ميليون موجوديت معرفي شد. بر اين اساس، در اين مقاله يك سامانه‌ي ‌پيونددهي موجوديت فارسيِ بدون نظارت را با استفاده از ويژگي‌هاي وابسته به محتوا و مستقل از محتوا براي پيونددهي موجوديت‌هاي يك متن به پايگاه دانش فارس‌بيس پيشنهاد مي‌كنيم. براي اين منظور، اولين پيكره متني ‌پيونددهي موجوديت بر روي زبان فارسيِ متشكل از متون رسانه‌ي اجتماعي را كه بر اساس تعدادي از كانال‌هاي فارسي معروف در رسانه‌ي اجتماعي تلگرام ساخته شده است را منتشر مي‌كنيم. نتايج آزمايش، عملكرد بسيار كارامد اين روش پيشنهادي را نشان مي‌دهد كه با جديدترين روش‌هاي مربوطه در زبان انگليسي قابل مقايسه است.

چكيده لاتين

In recent years, social media data has exponentially increased, which can be enumerated as one of the largest data repositories in the world. A large portion of this social media data is natural language text. However, the natural language is highly ambiguous, specifically with respect to the frequent occurrences of entities, which are addressed by polysemous words or phrases. Entity linking is the task of linking the entity mentions in the text to their corresponding entities in a knowledge base. Most of the entity linking systems begin with searching for candidate entities, and then disambiguate them to, finally, choose the best candidate. Unfortunately, due to the lack of a knowledge graph, this task had not been able to be covered in the Persian language. Fortunately, recently FarsBase has been introduced as a Persian knowledge graph with almost half a million entities. Correspondingly, in this paper, we propose an unsupervised Persian Entity Linking system, using context-dependent and context-independent features. For this purpose, we also publish the first entity linking corpus on the Persian language, composed of social media texts on a number of popular Persian channels, in the Telegram social network. The results prove the highly efficient performance of the proposed method, which is comparable with the corresponding state of the art in the English language.

كشور

ايران

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=36&DC=315527