Data Management Methods

Methods for Maintaining Longitudinal Population Health Studies

*to jump to Hospital-level race/ethnic data quality DataBooks, please click HERE

1. Basic Computing Environment 

Organize the computer and software
Prepare the Tools and Working (Project) Environments
Basic issues with Master and Confidential Environments

TOOLS_2019.ZIP contains FHOP macros and related files introduced in this volume

2. Standardizing Variables Over Time

Time variables
Demographic variables
Confidential data elements

3. Preparing Master Files

Setup activities
The RDYR macro
Check longitudinal consistency

4. Special Issues with Birth and Fetal Death Files

Steps to make master files
Check longitudinal consistency
Geographic classification
Data quality

BC_FORMATS_2019.ZIP contains FHOP's current format library for use with birth certificate and fetal death data (1989-2018)

11METH_WORK_2022.PDF describes steps to clean  work-related variables in the California 2010-2017 birth (mother and father) and 2014-2017 death (decedent) files and develop formats to classify those variables. Updated for 2018 and 2019 files.

WORK_FORMATS_2020.ZIP contains the resulting format library. No new formats needed for 2018 and 2019 files.

Industry and occupation in California birth certificates (1998–2019): Reporting disparities and classification codability

In California birth certificates, industry and occupation (I/O) missingness was systematically higher among parents who were male, Black or AIAN, less than 20 years old, and reported no college education. I/O codability is high when information is reported, with small percentage disparities. Improving data collection is vital to equitably describe the economic contexts that determine important family outcomes.

5. Special Issues with Death Files

Steps to make master files
Check longitudinal consistency
Cause of death
Geographic classification
Data quality

DT_FORMATS_2018.ZIP contains FHOP's current format library for use with death certificate and fetal death data (1980-2018)

SD_GEOCODE7.PDF summarizes work to evaluate the quality of address data in the California Death Statistical Master file in 2005 (before electronic death registration) and 2007 (after electronic death registration). It also compares the accuracy of two geocoding systems used in California at that time.

6. Maintaining Hospital Formats

Structure of formats program
OSHPD facility labels
Centers for Medicare and Medicaid Services
Clinical Classification System (CCS)
Injury Classification

OSH_FORMATS_2019.ZIP contains the SAS format library FHOP currently uses for OSHPD inpatient admissions (1983 to 2018) and emergency department and ambulatory care encounters (2005 to 2018). The files listed below are the source for the formats.

DXFH2018.ZIP contains the last cross-classified lists of ICD-9 diagnoses (1983 to Sep-2015). This file is the source for formats that variously classify ICD-9 diagnosis codes

DXTFH2018.ZIP contains the cross-classified lists of ICD-10 diagnoses (Oct-2015 - Dec-2017). This file is the source for formats that variously classify ICD-10 diagnosis codes

GEMI9I10.ZIP contains the longitudinal crosswalk between the ICD-9 and ICD-10 diagnosis codes. This file is the source for formats that back-classify ICD9 to be consistent with current ICD-10 groupings.

ICD10_CONVERSION_2020.PDF describes the work to validate the longitudinal GEMS crosswalk between the ICD-9 and the ICD-10 diagnosis codes with a focus on the Clinical Classification System, and particularly mental health (DXCH06) and conditions occurring during pregnancy, birth, and the puerperium (DXCH11).

PXAH2018.ZIP contains the last cross-classified lists of ICD-9 procedures (1983 to Sep-2015). This file is the source for the formats that variously classify ICD-9 procedure codes. CCS did not update ICD-9 procedure codes in 2015.

PXTFH2018.ZIP contains the cross-classified lists of ICD-10 procedures (Oct-2015 - Dec-2017). This file is the source for the formats that variously classify ICD-10 procedure codes

7. Maintaining Geography Formats

The need for longitudinal geographic datasets
Standard administrative boundaries
Planning and policy geography
Data sets with geographic boundaries

GEOG_FORMAT_INPUT_2020.ZIP contains the SAS programs and current input excel file used to make formats

GEOG_FORMATS_2020.ZIP contains the full set of California geography formats in current use

8. Annual Hospital Disclosure Report

Primary hospital data sets
Preparing AHDR data
Reconciling hospital events

9. Hospital Crosswalk

Why crosswalk is needed
Crosswalk methods and results
Crosswalk validation
Example: Hospital-level race/ethnic data quality
   The following files have longitudinal hospital-level results
   Birth Certificate
   Patient Discharge 
   Emergency Department

10. Population Master Files

Department of Finance
National Population Estimates
Intercensal Small Area Population Estimates


11. Geocoding Addresses

This document describes steps pre-clean, geocode, and then select the best geocoded version of about 26 million addresses in California population health files from 2007 forward. The results are used mainly to link records across various data sources. The process included attaching Census sub-regions such as tract, block group, and block which can be used for mapping or to merge with other datasets containing small area measures such percent of single parent  households, crowding, or community education levels.


12. Preparing Hospital Data for Linkage: Family Health History

For the population age 0 to 49, this document describes methods to classify all records from the period 1983 forward for people admitted to inpatient hospitals, and from 1995 forward for people admitted to hospital-based emergency departments and ambulatory surgery centers. Given the 2015 transition from ICD-9 to ICD-10, diagnoses before the transition were back-classified using methods recommended by the federal Agency for Health Research and Quality. For records from 1990 forward, when social security numbers (SSN) became available, admissions of "people" based on SSN were linked longitudinally. For infants and pregnant women, additional population specific variables were calculated to allow more specific analysis and also to facilitate linkage within the hospital files and over to birth, death, and fetal death files.


13. Linking Hospital and Death Data

Using the population age 0 to 24, this document describes methods to classify and link injuries for the period 1994-1997 in the inpatient files and death files.

Issues and Decisions to be made on Collecting, Coding and Reporting Race and Ethnicity for Public Health Indicators

The “Race/Ethnicity Guidelines”, approved in 2003 by the California Directors of Public Health (CDPH) and Health and Human Services (CHHS) for use by all programs, explicitly did not address how to handle multi-race coding for trend analysis. Further, the National Center for Health Statistics (NCHS) had not yet provided guidance on what to do when the same groups are not available over time or there is a mismatch between groups in the numerator and denominator. This document discusses issues related to developing a standardized approach to coding and reporting race and ethnicity for data sets maintained by CDPH. The focus is using these to explore race/ethnic differences in indicators of health status and outcomes over time. (September 2011).

Creating Longitudinal Hospital-Level Data Sets

Per California regulations, hospital licenses are based on a given physical location. When hospitals disappear from various data files the explanation is not readily apparent. We must determine whether it is because the facility closed, merged, converted to consolidated reporting, or moved, resulting in a new license ID. Yet another possibility is that a new license ID was assigned to a facility at the same location. We developed a series of decision rules to resolve such issues in a longitudinally consistent manner. These included rules to handle changes in hospital identifiers, physical location, consolidated data reporting, ownership, organizational type, and structural capacity. This document provides a full discussion of the issues encountered in creating the hospital-level data sets, their resolution, and the creation of related analysis data sets and variables. (June 2004)

Methods to Prepare Hospital Discharge Data

OSHPD distributes Patient Discharge Data (PDD) to qualified researchers such as the Family Health Outcomes Project (FHOP). The FHOP human subjects protocols permit us to have the confidential PDD, for all discharges and ages, from 1983 forward. Currently we have processed all years through 2000 and are about to start with the 2001 and 2002 files. This document presents an overview of the methods we developed to create the core files we use as the source for the different PDD-based research and data products that FHOP distributes. (June 2004)