Submission Type

Case Study


Electronic Health Records, Data Quality, Relational Databases, Applied Statistics


Objectives: Examine (1) the appropriateness of using a data quality (DQ) framework developed for relational databases as a data-cleaning tool for a dataset extracted from two EPIC databases; and (2) the differences in statistical parameter estimates on a dataset cleaned with the DQ framework and dataset not cleaned with the DQ framework.

Background: The use of data contained within electronic health records (EHRs) has the potential to open doors for a new wave of innovative research. Without adequate preparation of such large datasets for analysis, the results might be erroneous, which might affect clinical decision making or results of Comparative Effectives Research studies.

Methods: Two Emergency Department (ED) datasets extracted from EPIC databases (adult ED and children ED) were used as examples for examining the five concepts of DQ based on a DQ assessment framework designed for EHR databases. The first dataset contained 70,061 visits, and the second dataset contained 2,815,550 visits. SPSS Syntax examples as well as step-by-step instructions of how to apply the five key DQ concepts these EHR database extracts are provided.

Conclusions: SPSS Syntax to address each of DQ concepts proposed by Kahn et al. (2012) was developed. The dataset cleaned using Kahn’s framework yielded more accurate results than the dataset cleaned without this framework. Future plans involve creating functions in R language for cleaning data extracted from the EHR as well as an R package the combines DQ checks with missing data analysis functions.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.