Classification and characterization of food with Random Forest approaches

Stephan Seifert

University of Hamburg, Hamburg/Germany

In order to comprehensively exploit the complex data generated for the authentication of food using spectroscopic and spectrometric techniques, multivariate chemometric methods are applied. These methods are divided into unsupervised approaches, which are applied without further information, and supervised approaches, by which classification models are trained using samples with known class memberships. The latter, machine learning approaches, are often applied as black boxes meaning that only the class assignment is reported, while the background that led to this decision remains unknown.

Random forest (RF) is a non-parametric machine learning approach that consist of a large number of individual binary decision and has many advantages, such as flexibility in terms of input and output variables and the possibility of internal validation.[1] Another advantage is the ability to generate variable importance measures that are used to select relevant features. However, the relationships between the predictor variables are usually not examined. We developed a novel RF based variable selection approach called Surrogate Minimal Depth (SMD) that incorporates relations into the selection process of important variables.[2] This is achieved by the exploitation of surrogate variables that have originally been introduced to deal with missing predictor variables. In addition to improving variable selection, surrogate variables and their relationship to the primary split variables can also be utilized as proxy for the relations between the different variables. This relation analysis goes beyond the investigation of ordinary correlation coefficients because it is based on the mutual impact on the outcome.

This talk will introduce RF and the SMD approach to open the black box for classification of analytical data. In addition, their application for comprehensive classification and characterization of food, such as asparagus, will be demonstrated.[3]

R Package: https://github.com/StephanSeifert/SurrogateMinimalDepth.

References

[1] L. Breiman et al., Classification and Regression Trees,Taylor & Francis, 1984.

[2] S. Seifert et al., Bioinformatics 2019, 35, 3663-3671.

[3] S. Wenck et al., Metabolites, 2022, 12, 5.