Human Breathprints as Biometric Identifiers: Assessing Uniqueness and Reliability
Traditional biometrics (fingerprint, face, iris) are effective but can be intrusive or inconvenient in some healthcare contexts. This project explores breath as a non-invasive biometric: you simply exhale into a sensor, producing a VOC “signature” that may be distinctive enough to identify an individual within a group.
Research question
Can VOC patterns in exhaled breath uniquely and reliably identify an individual (i.e., act like a fingerprint), despite natural variability caused by factors like diet, health, environment, and time?
- Assess uniqueness via patterns across individuals
- Reduce high-dimensional VOC data into informative components
- Prevent overfitting given limited samples
- Evaluate classification performance using standard metrics
Dataset & constraints
The dataset contains breath samples from 11 subjects collected across multiple days. VOCs were measured using mass spectrometry methods (SESI-MS / Q-TOF in the source study).
- High dimensionality: many VOC features
- Small sample size: risk of overfitting
- Variability: VOCs fluctuate over time
Pipeline (Iteration 1)
Goal: reduce dimensionality early, then select features.
- StandardScaler (normalise feature scales)
- PCA (retain most variance, reduce dimensions)
- LDA (supervised separation between subjects)
- Regression-based feature selection: Lasso / Ridge / ElasticNet
- Classifiers: Logistic Regression, SVM, Random Forest
Pipeline (Iteration 2)
Goal: test whether ordering matters (feature selection first).
- StandardScaler
- Regression feature selection first (Lasso / Ridge / ElasticNet)
- LDA after selection (maximise class separability)
- Same classifiers: Logistic Regression, SVM, Random Forest
Key finding: ordering matters
The core insight from this work is that changing the sequence of dimensionality reduction and feature selection can significantly affect classification outcomes. In my experiments, the second iteration (feature selection → LDA) produced a major improvement in Logistic Regression performance.
Iteration 1 — Classification metrics
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression | 25.64% | 28.61% | 25.64% | 24.46% |
| SVM | 30.77% | 40.72% | 30.77% | 33.03% |
| Random Forest | 33.33% | 32.46% | 33.33% | 32.01% |
Interpretation: performance is modest overall; the best accuracy here is Random Forest (~33%), suggesting the problem is hard under this pipeline and/or the dataset is challenging (small + variable).
Iteration 2 — Classification metrics
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression | 56.41% | 65.81% | 56.41% | 55.99% |
| SVM | 41.03% | 61.05% | 41.03% | 44.74% |
| Random Forest | 38.46% | 32.46% | 33.33% | 32.01% |
Interpretation: Logistic Regression improves dramatically under the revised ordering, indicating that the “shape” of the feature space created by the pipeline can be more important than the classifier choice alone.
What the results suggest
- Promise: Breathprints can contain identifying information, but performance depends heavily on processing choices.
- Hard problem: High-dimensional signals + temporal variability make classification non-trivial.
- Representation matters: Better feature space can unlock stronger results even with simpler models.
Limitations & future work
- Scale: Larger labelled datasets to improve generalisation
- Stability: Evaluate repeatability across longer time windows
- Methods: Explore non-linear reduction and stronger ensembles
- Validation: Test on unseen cohorts / external datasets
Next steps would focus on improving robustness and validation to move from “promising” to “reliable in practice”.