پديد آورندگان :
ﺣﺴﻦﻧﮋاد ﻧﺎﻣﻘﯽ، ﺣﺴﯿﻦ داﻧﺸﮕﺎه ﺻﻨﻌﺘﯽ ﺷﺎﻫﺮود - داﻧﺸﮑﺪه ﻣﻬﻨﺪﺳﯽ ﮐﺎﻣﭙﯿﻮﺗﺮ، ﺷﺎﻫﺮود، اﯾﺮان , ﻣﺸﺎﯾﺨﯽ، ﻫﺪي داﻧﺸﮕﺎه ﺻﻨﻌﺘﯽ ﺷﺎﻫﺮود - داﻧﺸﮑﺪه ﻣﻬﻨﺪﺳﯽ ﮐﺎﻣﭙﯿﻮﺗﺮ، ﺷﺎﻫﺮود، اﯾﺮان , زاﻫﺪي، ﻣﺮﺗﻀﯽ داﻧﺸﮕﺎه ﺻﻨﻌﺘﯽ ﺷﺎﻫﺮود - داﻧﺸﮑﺪه ﻣﻬﻨﺪﺳﯽ ﮐﺎﻣﭙﯿﻮﺗﺮ، ﺷﺎﻫﺮود، اﯾﺮان
كليدواژه :
ﺟﺮﯾﺎن داده , ﯾﺎدﮔﯿﺮي ﮔﺮوﻫﯽ , ﺗﻐﯿﯿﺮ ﻣﻔﻬﻮم , ردهﺑﻨﺪ ﻧﯿﻤﻪﻧﻈﺎرﺗﯽ , آﻧﺘﺮوﭘﯽ
چكيده فارسي :
ﺟﺮﯾﺎن داده ﺑﻪ دﻧﺒﺎﻟﻪاي از دادهﻫﺎ ﮔﻔﺘﻪ ﻣﯽ ﺷﻮد ﮐﻪ از ﻣﻨﺎﺑﻊ اﻃﻼﻋﺎﺗﯽ ﻣﺨﺘﻠﻒ ﺑﺎ ﺳﺮﻋﺖ زﯾﺎد و ﺣﺠﻢ ﺑﺎﻻ ﺗﻮﻟﯿﺪ ﻣﯽﺷﻮﻧﺪ. از ﻣﻬﻢﺗﺮﯾﻦ ﭼﺎﻟﺶﻫﺎي ﻣﻮﺟﻮد در ﺗﺤﻠﯿﻞ ﺟﺮﯾﺎن داده وﺟﻮد ﺗﻐﯿﯿﺮ ﻣﻔﻬﻮم در آنﻫﺎ اﺳﺖ. ﺗﻐﯿﯿﺮ ﻣﻔﻬﻮم ﺑﻪ ﻣﻌﻨﺎي ﺗﻐﯿﯿﺮ وﯾﮋﮔﯽﻫﺎي آﻣﺎري دادهﻫﺎﺳﺖ. در ﺑﺴﯿﺎري از ﭘﮋوﻫﺶﻫﺎي ﻣﻮﺟﻮد ﺑﺮاي ﻣﻘﺎﺑﻠﻪ ﺑﺎ ﭼﺎﻟﺶ ﻧﺎﻣﺤﺪود ﺑﻮدن ﻃﻮل ﺟﺮﯾﺎن داده و ﯾﺎ ﭼﺎﻟﺶ ﺗﻐﯿﯿﺮ ﻣﻔﻬﻮم، از روﯾﮑﺮدﻫﺎﯾﯽ ﺑﺎ ﻓﺮض ﻣﻮﺟﻮدﺑﻮدن ﺑﺮﭼﺴﺐ درﺳﺖ ﺑﺮاي ﻫﻤﻪ دادهﻫﺎ اﺳﺘﻔﺎده ﻣﯽﮐﻨﻨﺪ؛ درﺣﺎﻟﯽﮐﻪ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻫﺰﯾﻨﻪﺑﺮﺑﻮدن ﻓﺮآﯾﻨﺪ ﺑﺮﭼﺴﺐدﻫﯽ ﺟﺮﯾﺎن داده، ﺑﻪﻃﻮرﻋﻤﻮﻣﯽ ﻓﺮض ﻣﯽﺷﻮد ﺗﻨﻬﺎ ﺑﺨﺸﯽ از دادهﻫﺎ داراي ﺑﺮﭼﺴﺐ ﻫﺴﺘﻨﺪ. در اﯾﻦ ﻣﻘﺎﻟﻪ ﯾﮏ روش ﯾﺎدﮔﯿﺮي ﮔﺮوﻫﯽ ﻧﯿﻤﻪﻧﻈﺎرﺗﯽ اراﺋﻪ ﺷﺪه ﮐﻪ از ﺗﻐﯿﯿﺮ آﻧﺘﺮوﭘﯽ ﺑﺮاي ﺗﺸﺨﯿﺺ ﺗﻐﯿﯿﺮ ﻣﻔﺎﻫﯿﻢ در ردهﺑﻨﺪي ﺟﺮﯾﺎن داده اﺳﺘﻔﺎده ﻣﯽﮐﻨﺪ. ﻣﺪل ﯾﺎدﮔﯿﺮي ﮔﺮوﻫﯽ ﭘﯿﺸﻨﻬﺎدي ﺑﺎ ﺗﻌﺪاد ﻣﺤﺪودي داده ﺑﺮﭼﺴﺐدار اوﻟﯿﻪ آﻣﻮزش ﻣﯽﺑﯿﻨﺪ؛ ﺳﭙﺲ در ﺻﻮرت ﻣﺸﺎﻫﺪه ﺗﻐﯿﯿﺮ ﻣﻔﻬﻮم، از دادهﻫﺎي ﺑﺪون ﺑﺮﭼﺴﺐ ﺑﺮاي ﺑﻪروزرﺳﺎﻧﯽ ﻣﺪل ردهﺑﻨﺪ ﮔﺮوﻫﯽ اﺳﺘﻔﺎده ﻣﯽﮐﻨﺪ. روش ﭘﯿﺸﻨﻬﺎدي ﻗﺎدر اﺳﺖ ﺗﻐﯿﯿﺮات ﻣﻮﺟﻮد در ﻣﺠﻤﻮﻋﻪداده را ﺗﺸﺨﯿﺺ داده و ﺑﺎ ﺑﻪروزرﺳﺎﻧﯽ ﻣﺪل ﯾﺎدﮔﯿﺮي، در ﺑﻬﺒﻮد دﻗﺖ اﻟﮕﻮرﯾﺘﻢ ﻣﺆﺛﺮ ﺑﺎﺷﺪ. ﻧﺘﺎﯾﺞ آزﻣﺎﯾﺶﻫﺎ ﻧﺸﺎن ﻣﯽدﻫﺪ ﮐﻪ روش ﭘﯿﺸﻨﻬﺎدي از ﺟﻨﺒﻪﻫﺎي ﻣﺨﺘﻠﻒ ﻧﺴﺒﺖ ﺑﻪ ﺳﺎﯾﺮ روشﻫﺎ ﮐﺎراﯾﯽ ﺑﺎﻻﺗﺮي دارد.
چكيده لاتين :
Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refers to changes in the statistical properties of data, and is divided into four categories: sudden, gradual, incremental, and recurring. Concept drift is generally dealt with by periodically updating the classifier, or employing an explicit change detector to determine the update time. These approaches are based on the assumption that the true labels are available for all data samples. Nevertheless, due to the cost of labeling instances, access to a partial labeling is more realistic. In a number of studies that have used semi-supervisory learning, the labels are received from the user to update the models in form of active learning. The purpose of this study is to classify samples in an unlimited data stream in presence of concept drift, using only a limited set of initial labeled data. To this end, a semi-supervised ensemble learning algorithm for data stream is proposed, which uses entropy variation to detect concept drift and is applicable for sudden and gradual drifts. The proposed model is trained with a limited initial labeled set. In occurrence of concept drift, the unlabeled data is used to update the ensemble model. It does not require receiving the labels from the user. In contrast to many of the current studies, the proposed algorithm uses an ensemble of K-NN classifiers. It constructs a group of clustering-based classification models, each of which is trained on a batch of data. On receiving each new sample, first it is determined whether the data sample is an outlier or not. If the data is included in a cluster, the sample class is determined by majority voting. When a window of the stream is received, the possibility of concept drift is examined based on entropy variation, and the classifier is updated by a semi-supervised approach if necessary. The model itself determines the required data labels. The proposed method is capable of detecting concept drift in data, and improving its accuracy via updating the learning model with appropriate samples received from the stream. Therefore, the proposed method only requires a small initial labeled data. Experiments are performed using five real and synthetic datasets, and the model performance is compared to three other approaches. The results show that the proposed method is superior in terms of precision, recall and F1 score compared to other studies.