In recent years there has been an immense growth in data, leading to the Big Data. This requires large computing infrastructure with high performance processing capabilities. Getting large data ready for analysis and knowledge extraction is a difficult task and requires data to be pre-processed to improve the quality of the raw data. Data representation and quality is one of the most important facets in the data science process. Data preprocessing is a preliminary practice in data science in which the raw data are transformed into a format suitable for analysis and the modeling algorithms. It improves data quality by cleaning, normalizing, transforming, reducing, and extracting relevant characteristics from the raw data. Data pre-processing significantly improves the performance of the automatic learning algorithms, which in turn results in accurate model extraction. Discovering knowledge from noisy, irrelevant, and redundant data is a difficult task, so accurately identifying outliers and outliers, supplanting missing values, and reducing the volume of useful data poses challenging problems in data science. The challenges in data pre-processing are focused on automation and accurate decision-making in their linked use; adjustment to address complex data structure and adaptation of techniques to increase reliability, fairness and transparency of models subsequently obtained by data science algorithms.
Responsible: Salvador García López