شماره ركورد :
1017929
عنوان مقاله :
يك مدل موضوعي احتمالاتي مبتني بر روابط محلّي واژگان در پنجره‌هاي هم‌پوشان
عنوان به زبان ديگر :
A Probabilistic Topic Model based on Local Word Relationships in Overlapped Windows
پديد آورندگان :
رحيمي، مرضيه دانشگاه صنعتي شاهرود , زاهدي ،مرتضي دانشگاه صنعتي شاهرود , مشايخي، هدي دانشگاه صنعتي شاهرود
تعداد صفحه :
14
از صفحه :
57
تا صفحه :
70
كليدواژه :
مدل‌هاي موضوعي احتمالاتي , خوشه‌بندي متن , مدل‌هاي گرافيكي , هم‌رخدادي , نمونه‌برداري گيبس
چكيده فارسي :
سياري از مدل‌هاي موضوعي مانند LDA كه مبتني بر هم‌رخدادي واژگان در سطح يك سند هستند قادر به بهره‌گيري از روابط محلي واژگان نيستند. برخي از مدل‌هاي موضوعي مانند BTM سعي كرده‌اند با تركيب موضوعات و مدل‌هاي زباني n-gram، اين مشكل را حل كنند. امّا BTM مبتني بر ترتيب دقيق واژگان است؛ بنابراين با مشكل تُنُكي روبه­روست. در اين مقاله يك مدل موضوعي احتمالاتي جديد معرفي شده كه قادر به مدل­كردن روابط محلي واژگان با استفاده از پنجره‌هاي هم‌پوشان است. بر اساس فرضيه هم‌رخدادي، رخداد هم­زمان واژگان در پنجره‌هاي كوتاه­تر، گواه محكم­تري بر ارتباط معنايي آنهاست. در مدل پيشنهادي، هر سند، مجموعه‌اي از پنجره‌هاي هم‌پوشان فرض مي‌شود، كه هريك متناظر با يكي از واژگان متن است. موضوعات بر مبناي هم‌رخدادي واژگان در اين پنجره‌هاي هم‌پوشان استخراج مي‌شوند. به‌عبارت ديگر، مدل پيشنهادي، روابط محلي واژگان را بدون وابستگي به ترتيب دقيق آنها مدل مي‌كند. آزمايش­هاي ما نشان مي‌دهد كه روش پيشنهادي، موضوعات منسجم‌تري را توليد و در كاربرد خوشه‌بندي اسناد، دقيق‌تر از دو مدل LDA و BTM عمل مي‌كند.
چكيده لاتين :
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution over topics and each word in the document is sampled from a chosen topic of that distribution. It assumes that a document is a bag of words and ignores the order of the words. Probabilistic topic models such as LDA which extract the topics based on documents-level word co-occurrences are not equipped to benefit from local word relationships. This problem is addressed by combining topics and n-grams, in models like Bigram Topic Model (BTM). BTM modifies the document generation process slightly by assuming that there are several different distributions of words for each topic, each of which correspond to a vocabulary word. Each word in a document is sampled from one of the distributions of its selected topic. The distribution is determined by its previous word. So BTM relies on exact word orders to extract local word relationships and thus is challenged by sparseness. Another way to solve the problem is to break each document into smaller parts for example paragraphs and use LDA on these parts to extract more local word relationships in these small parts. Again, we will be faced with sparseness and it is well-known that LDA does not work well on small documents. In this paper, a new probabilistic topic model is introduced which assumes a document is a set of overlapping windows but does not break the document into those parts and assumes the whole document as a single distribution over topics. Each window corresponds to a fixed number of words in the document. In the assumed generation process, we walk through windows and decide on the topic of their corresponding words. Topics are extracted based on words co-occurrences in the overlapping windows and the overlapping windows affect the process of document generation because; the topic of a word is considered in all the other windows overlapping on the word. On the other words, the proposed model encodes local word relationships without relying on exact word order or breaking the document into smaller parts. The model, however, takes the word order into account implicitly by assuming the windows are overlapped. The topics are still considered as distributions over words. The proposed model is evaluated based on its ability to extract coherent topics and its clustering performance on the 20 newsgroups dataset. The results show that the proposed model extracts more coherent topics and outperforms LDA and BTM in the application of document clustering.
سال انتشار :
1397
عنوان نشريه :
پردازش علائم و داده ها
فايل PDF :
7500391
عنوان نشريه :
پردازش علائم و داده ها
لينک به اين مدرک :
بازگشت