Estimation of Low-Density Lipoprotein Cholesterol Concentration Using Machine Learning
Abstract
Objective Low-density lipoprotein cholesterol (LDL-C) can be estimated using the Friedewald and Martin-Hopkins formulas. We developed LDL-C prediction models using multiple machine learning methods and investigated the validity of the new models along with the former formulas. Methods Laboratory data (n = 59,415) on measured LDL-C, high-density lipoprotein cholesterol, triglycerides (TG), and total cholesterol were partitioned into training and test data sets. Linear regression, gradient-boosted trees, and artificial neural network (ANN) models were formed based on the training data. Paired-group comparisons were performed using a t-test and the Wilcoxon signed-rank test. We considered P values .2 to be statistically significant. Results For TG >= 177 mg/dL, the Friedewald formula underestimated and the Martin-Hopkins formula overestimated the LDL-C (P <.001), which was more significant for LDL-C <70 mg/dL. The linear regression, gradient-boosted trees, and ANN models outperformed the aforementioned formulas for TG >= 177 mg/dL and LDL-C <70 mg/dL based on a comparison with a homogeneous assay (P >.001 vs. P <.001) and classification accuracy. Conclusion Linear regression, gradient-boosted trees, and ANN models offer more accurate alternatives to the aforementioned formulas, especially for TG 177 to 399 mg/dL and LDL-C <70 mg/dL.