This book offers a general and comprehensible overview of Big Data Preprocessing and contains a formal description of each problem. It focuses on its main features and the most relevant proposed solutions. Additionally, it shows actual implementations of algorithms that help the reader to deal with these problems. Big Data preprocessing refers to the scenario in which the huge volume of information is not suitable to be used in a learning process and it needs to be processed to extract quality, actionable data. There is no unique challenge when preprocessing data. The presence of imperfect (noisy) data, redundant examples, high dimensionality and other appeal to particular techniques that are covered in this book. Being a very common scenario in real life applications, the interest of researchers and practitioners on the topic has grown significantly during these years.
This book stresses the gap that exists between big, raw data and the requirements of quality data that businesses are demanding. Such a quality data is called Smart Data, and to achieve Smart Data the preprocessing is a key step, where the imperfections, integration tasks and other processes are carried out to eliminate superfluous information. This book also presents the concept of Smart Data through data preprocessing in Big Data scenarios and connect it with the emerging paradigms of IoT and edge computing, where the end points will be able to generate Smart Data without completely relying on the cloud.
This book introduces data intrinsic characteristics that are the main causes, which added to the imperfections implicitly present in sampling the real world, truly hinders the performance of machine learning algorithms in Big Data. Then, algorithms and implementations on Big Data preprocessing are provided in order to understand the advantages related to the use of this type of approaches. Finally, this book provides some novel areas of study that are gathering a deeper attention on the Big Data preprocessing. Specifically, it considers the relation with Deep Learning (as of a technique that also relies in large volumes of data), the difficulty of finding the appropriate selection and concatenation of preprocessing techniques applied and some other open problems.