مرکز منطقه ای اطلاع رساني علوم و فناوري - Tutorial IV computational intelligence for data analytics

Abstract :

Humankind has been collecting data since the recording started, but in the last decade with the considerable advances in computing and storage technologies, advancements of cloud computing, development of ubiquitous connectivity and the internet of things, there has been explosion in the size and variety of collected data. Nevertheless, one can be data-rich and knowledge-poor, and this is where the data analytics and the development and application of machine learning models become necessity for gaining insight of complex processes to prove scientific theories and discoveries, support decision making and enhance strategic planning in different areas of the economy, finance, industry, healthcare, etc. Recently, there is an influx of polymorphic, unstructured and multimodal data - social media, images, audio, video, etc., which is complicating further the data processing and knowledge extraction process. But even the traditional structured datasets present problems that need to be addressed and overcome in the early stages of data pre-processing, feature extraction and feature selection. This is because they usually contain variety of data formats, e.g., categorical, continuous, ordinal, and frequently missing data (usually result of sensors faults, human errors, collection, transportation, or storage problems). The most popular approaches in dealing with missing data generally fall in three groups: Deletion methods; Single imputation methods; and Model-based methods. In this tutorial I will talk about the third group methods, which are considered to be the most popular, ´modem´ model-based approaches. Particularly, Multiple imputation (MI) method will be introduced and discussed in addition to the K-Nearest Neighbour Imputation (KNN-I) and Bagged Tree Imputation (BTI). Subsequently, MI, KNN-I and BTl will be applied in a case study for pre-processing a real world radar signal large dataset (more than 30 000 samples). The dataset comprises intercepted and collected pulse train characteristics, which typically include signal frequencies, type of modulation, scan period, pulse repetition intervals, etc., and usually consist of mixture of continuous, discrete and categorical data, and also frequently include missing values. Missing values are imminent part of real world datasets and radar datasets make no exception of that. Then will briefly talk about supervised and unsupervised learning and the use of three supervised approaches: Neural Networks (NN); Random Forests (RF); and Support Vector Machines (SVM) for solving radar signal classification and source identification problem. Results from applying the NN, RF and SVM (using R and Matlab) on complete data subset (without missing data) and the full dataset with substituted (up to 60%) missing data with MI, KNN-I and BTl will be critically analysed and discussed. Finally, I´ll talk about the opportunities and challenges in applying computational intelligence and machine learning techniques to Big Data and the available software for Big Data.