AbstractsEngineering

Data Quality Assessment Methodology for Improved Prognostics Modeling

by Yan Chen




Institution: University of Cincinnati
Department: Engineering and Applied Science: Mechanical Engineering
Degree: PhD
Year: 2012
Keywords: Industrial Engineering; Prognostics and Health Management; Data Quality; Laplacian Eigenmap; diagnostic modeling; Feature Ranking; Outlier Detection
Record ID: 1932821
Full text PDF: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1330024393


Abstract

Recently there is a recognized trend of increasing interests in Prognostics and Health management (PHM) techniques from the automotive, renewable energy, and petrochemical industry. Considerable efforts and time are spent on the acquisition of a large amount of data from which system behavior information is expected to be extracted. However, many data quality issues hinder the data to information conversion, for example, signal noise caused by hardware error and disturbances, redundant/incomplete features, and outlier instances during data preparation. Data sets with these data quality issues not only cause a waste of time and cost, but also paralyze further PHM development. Currently, although a large amount of data mining techniques have been developed to cope with similar issues in clinical research, imaging process, and other areas, in the prognostics and health management field, there are limited systematic methods to guarantee that the collected data will be sufficient to model multiple system failure modes or their degradation mechanism.This has led us to look for a systematic data quality evaluation and improvement methodology based on the enrichment of data mining techniques. In this dissertation, the goal is to establish methods to evaluate and improve the quality of the training data used for system health diagnostic modeling. Inspired by spectral graph clustering techniques, a set of methods are proposed to evaluate training data quality and improve them by filtering out instance outliers and refining feature selection process. In the proposed quality evaluation method, data inherent cluster structures are first revealed. Then considering these structures ideally are to be used as data models of system behavior, their fitness as an independent cluster and their separation with others are quantitatively measured by a set of selected metrics. To improve the corresponding data quality, on one hand, a filtering method is proposed to detect outliers by analyzing two graphical objects that are constructed over the data instances. Local Outlier Factors (LOFs) are also calculated for discovered outlier candidates as to quantify and rank their outlier-ness. On the other hand, a feature ranking based optimization method is introduced to select the optimal feature set for the best data structure formulation. All proposed data improvement methods use a concept of graph Laplacian, such as non-linear Laplacian embedding based data filtering method and Ratio-Laplacian score for feature ranking. Besides the typical data mining testing data set, two experiment datasets from real applications provided by IMS member companies were used to validate the performance of proposed methods. Some popular methods are also compared with the proposed method in terms of performance and accuracy. The study proves that the proposed method has competitive advantages when handling nonlinear factors comparing with Principal Component Analysis in terms of space embedding and Information Gain in terms of feature ranking criterion.