Interpretable Machine Learning for Mood State Classification and Treatment Response Analysis Using Clinical and Biomarker Data

Rocco de Filippis; Abdullah Al Foysal

doi:10.4236/oalib.1113868

Open Access Library Journal > Vol.12 No.8, August 2025

Interpretable Machine Learning for Mood State Classification and Treatment Response Analysis Using Clinical and Biomarker Data

Rocco de Filippis^1*

, Abdullah Al Foysal²

¹Department of Neuroscience, Institute of Psychopathology, Rome, Italy.
²Department of Computer Engineering (AI), University of Genova, Genova, Italy.
DOI: 10.4236/oalib.1113868 PDF HTML XML 6 Downloads 40 Views

Abstract

Mood disorders—particularly bipolar spectrum conditions—pose enduring diagnostic and therapeutic challenges due to their episodic nature, heterogeneous presentations, and reliance on subjective assessments. Conventional frameworks often delay accurate diagnosis and appropriate treatment. This study proposes an interpretable machine learning (ML) framework that integrates biological and behavioural data to support mood state classification and treatment response prediction in bipolar disorder. We constructed a multimodal dataset combining seven genetic polymorphisms (SNPs), ten serum biomarkers, and clinically relevant behavioural features such as stress levels, sleep deviation, and lithium adherence. Using this dataset, we developed and evaluated an ensemble-based ML pipeline incorporating Random Forest, XGBoost, and neural network classifiers, enhanced with SHAP-based interpretability and dimensionality reduction for decision visualization. The model demonstrated robust performance, achieving a macro-average ROC-AUC of 0.81, with the manic class yielding an AUC of 0.87. While manic states were well classified, euthymic states were frequently misclassified, reflecting the clinical ambiguity and low separability of remission features. Feature attribution consistently highlighted stress, lithium adherence, and sleep deviation as top predictors. Dimensionality reduction techniques (PCA, t-SNE, UMAP) revealed substantial class overlap, but improved separability for manic clusters. Stratification of biomarker distributions by treatment response further revealed clinical relevance in BDNF, DLPFC connectivity, and IL-6 levels. This work demonstrates the feasibility of using interpretable ML to enhance psychiatric diagnostics. By providing transparent, biologically grounded predictions, such systems may complement clinical judgment, enable early interventions, and support the development of precision psychiatry.

Keywords

Bipolar Disorder, Mood Classification, Machine Learning, Precision Psychiatry, Ensemble Models

Share and Cite:

de Filippis, R. and Al Foysal, A. (2025) Interpretable Machine Learning for Mood State Classification and Treatment Response Analysis Using Clinical and Biomarker Data. Open Access Library Journal, 12, 1-22. doi: 10.4236/oalib.1113868.

1. Introduction

Bipolar spectrum disorders (BSDs), encompassing bipolar I, bipolar II, and cyclothymia, affect approximately 2% - 4% of the global population and are among the leading causes of disability-adjusted life years in psychiatry [1]-[3]. Characterized by cyclical episodes of mania, depression, and euthymia, these disorders pose substantial diagnostic and therapeutic challenges [4]-[7]. Misdiagnosis is common, often resulting in inappropriate treatment regimens, increased relapse risk, and a substantial burden on healthcare systems [8]-[11]. The episodic nature and symptom overlap with other mood or psychotic disorders further complicate clinical differentiation and long-term management [12]-[16]. In recent years, the convergence of psychiatry, molecular biology, and artificial intelligence has paved the way for novel diagnostic approaches [17]-[22]. Biomarkers such as serum cytokines (e.g., IL6), neurotrophic factors (e.g., BDNF), and polymorphisms in genes like COMT and CACNA1C have shown promise in stratifying psychiatric phenotypes [23]-[26]. However, despite accumulating evidence, the integration of such biomarkers into actionable diagnostic tools remains limited, particularly in real-world settings [27]-[29]. Machine learning (ML), with its ability to uncover complex nonlinear patterns in high-dimensional data, holds significant potential for augmenting mood state classification [30]-[32]. Yet, most ML models suffer from a lack of transparency, making them difficult to trust and interpret in clinical contexts [33]-[35]. This study addresses these gaps by proposing an interpretable ensemble ML framework that integrates genetic, serum, and clinical data for mood classification. The objectives are fourfold: (1) to assess the diagnostic utility of combining SNPs, serum biomarkers, and clinical features; (2) to identify the most predictive features contributing to mood state differentiation; (3) to visualize model decision boundaries and feature interactions; and (4) to evaluate treatment response patterns through biomarker stratification. By emphasizing interpretability and clinical relevance, our approach aims to support early detection, personalized treatment planning, and ultimately, improve outcomes in individuals with bipolar spectrum disorders.

2. Methods

This research proposes a clinically interpretable, data-driven pipeline for predicting mood state transitions in bipolar disorder patients. The methodology integrates synthetic patient simulation, multimodal biomarker integration, advanced ensemble learning, and robust evaluation strategies. The complete workflow is illustrated in Figure 1.

Figure 1. End-to-end machine learning framework for bipolar mood state prediction.

2.1. Synthetic Data Simulation and Temporal Feature Extraction

To overcome data scarcity and class imbalance typically observed in psychiatric datasets, we generated synthetic patient profiles replicating real-world variability in genotype, phenotype, and clinical outcomes. Each simulated profile incorporates structured data such as single nucleotide polymorphisms (SNPs), serum biomarker levels (e.g., BDNF, IL-6), medication adherence, and behavioural metrics over time. From these inputs, we extracted time-series features capturing temporal deviations, including circadian rhythm disruption, stress fluctuation, and adherence irregularities.

Synthetic Data Generation Parameters

The synthetic data was generated using parameterized beta distributions to simulate variability in genetic and clinical features. The beta distributions were defined as follows:

For each feature \( X_i \), values were sampled from \ (Beta (\alpha_i, \beta_i) \), where the parameters were selected to mirror observed distributions in clinical datasets.

Example parameter settings:

Stress level: \ (Beta (2, 5) \) – simulates right-skewed distribution.
Sleep deviation: \ (Beta (3, 3) \) – simulates balanced variability.
Lithium adherence: \ (Beta (5, 2) \) – simulates left-skewed distribution, favouring high adherence.

Weighted target labels for mood state classification were generated using a composite probability function:

\[ P (Y = manic) = 0.4 \times stress\_level + 0.3 \times sleep\_deviation + 0.3 \times (1 - lithium\_adherence) \]

\[ P (Y = depressed) = 0.5 \times stress\_level + 0.2 \times sleep\_deviation + 0.3 \times low\_BDNF \]

\[ P(Y = euthymic) = 1 - P (Y = manic) - P (Y = depressed) \]

Each feature’s distribution was selected to reflect real-world clinical skew: stress and depression biomarkers (e.g., IL-6, BDNF) were sampled using right- or left-skewed beta distributions, while behavioural traits like sleep deviation followed symmetric patterns. These synthetic profiles were iteratively tested against descriptive statistics from existing mood disorder cohorts to validate realism. The composite mood probability functions use empirically weighted contributions derived from literature (e.g., [BDNF] and [Lithium adherence] as key relapse predictors) to simulate high-variance mood state transitions. These weightings were selected based on clinical literature and adjusted during preliminary simulation to achieve balanced class representation. Detailed parameterization ensures reproducibility of the synthetic data process.

2.2. Feature Engineering and Multimodal Preprocessing

The multimodal dataset underwent rigorous preprocessing. Categorical variables were one-hot encoded; SNPs were label-encoded according to known allelic variants; continuous biomarkers were standardized. Cross-modal feature synthesis was performed to capture interactions between genetics, serum markers, and behavioural data. Dimensionality reduction was selectively applied for visualization (e.g., PCA, t-SNE), while all modelling retained full feature space to preserve clinical fidelity.

2.3. Class Imbalance Correction and Partitioning

Stratified sampling ensured consistent label distribution across training and test subsets. To counteract the intrinsic class skew (notably the underrepresentation of euthymic states), SMOTE (Synthetic Minority Oversampling Technique) was employed on the training set, improving generalizability and recall in minority classes. Despite using synthetic data generation, the dataset exhibited residual class imbalance, with euthymic states underrepresented. This was partially intentional to mirror real-world clinical datasets, where patients tend to seek care more often during acute episodes than in remission. Moreover, the feature-driven generation process favoured clear symptom expression, leading to stronger class signals for manic and depressed states. SMOTE was applied post-hoc to augment minority class instances and restore balance during training.

2.4. Model Architecture and Training Strategy

The architecture is built on an ensemble framework combining gradient boosting classifiers, neural networks, and probabilistic learners. A meta-learning strategy was employed to aggregate diverse decision boundaries while mitigating overfitting via early stopping and dynamic learning rate scheduling. Hyperparameters were optimized using Bayesian search strategies across validation folds.

2.5. Evaluation and Interpretability Pipeline

Model evaluation was multi-pronged:

Calibration curves assessed the alignment between predicted and true probabilities.
ROC and precision-recall curves quantified discriminative ability across all mood states.
Confusion matrices identified common misclassification patterns, especially euthymic vs manic overlap.
Feature importance rankings were derived using SHAP values and permutation importance to provide explainability.
Medication adherence heatmaps visualized behavioural response clusters.
Risk stratification curves and example prediction plots helped align model output with clinical decision-making. Additionally, correlation matrices were computed to highlight latent interdependencies among biomarkers and clinical features.

This interpretability-centric architecture ensures that each model decision can be traced back to biologically or clinically plausible factors—a critical requirement for adoption in psychiatry.

The pipeline comprises synthetic data simulation, multimodal feature engineering, ensemble model training, and interpretability-driven evaluation, with dedicated components for calibration, medication response visualization, and risk stratification.

3. Technical Implementation and Model Development

3.1. Data Composition

This study leveraged a synthetically augmented, multimodal dataset constructed to reflect clinically relevant dimensions of bipolar disorder. The dataset comprised the following structured modalities:

Genetic Features (SNPs): Seven single nucleotide polymorphisms (SNPs) associated with affective regulation, stress response, and synaptic signalling were included: BDNF_Val66Met, COMT_Val158Met, SLC6A4_5HTTLPR, CACNA1C_rs1006737, ANK3_rs10994336, NR1D1_rs2314339, and IL6_rs1800795. These were encoded as categorical genotype counts and later one-hot encoded for neural models.
Serum Biomarkers (10 variables): Quantitative levels of biomarkers such as BDNF, IL-6, GABA, and DLPFC_connectivity were incorporated to reflect neuroplasticity, inflammation, and regional brain activity.
Clinical and Behavioural Features: Time-sensitive, self-reported measures including stress levels, sleep deviation from baseline, and lithium adherence were added as dynamic covariates with high temporal resolution.
Target Labels: Mood annotations were derived from expert-based longitudinal curation and classified into three diagnostic states: euthymic, manic, and depressed.

3.2. Preprocessing Strategy

A robust preprocessing pipeline ensured data integrity, standardization, and compatibility across all models.

Missing Data Imputation: Continuous features were imputed using k-nearest neighbors (k = 5), while categorical SNPs were imputed using mode frequency to retain population-level allele integrity.
Normalization & Feature Encoding: All numeric features were z-score standardized. SNPs were label-encoded or one-hot encoded depending on downstream model compatibility. Interaction terms (e.g., biomarker × sleep deviation) were engineered to capture potential nonlinear dependencies.
Class Imbalance Handling: The dataset exhibited substantial class skew, with underrepresentation of euthymic samples. To address this, SMOTE was applied to the training set, which increased model sensitivity to minority states without disrupting decision boundary geometry.

3.3. Model Development Pipeline

A hybrid ensemble-learning framework was developed, combining structured classifiers with neural network modelling and interpretability layers. A stratified 5-fold cross-validation was employed throughout all experiments to ensure stable and balanced performance estimation across mood classes. This replaced the use of a simple 80/20 train-test split to mitigate the risk of data partition bias and to provide a more reliable measure of generalizability. Performance metrics such as ROC-AUC, precision, recall, and calibration were averaged across folds, and standard deviations were reported to capture variability.

Model Suite: The main models included Random Forest, XGBoost, LightGBM, and a Multilayer Perceptron (MLP) with 3 hidden layers, ReLU activations, and dropout regularization. These models were selected for their ability to capture nonlinear, high-dimensional interactions.
Training Protocol: A stratified 5-fold cross-validation procedure ensured consistent class representation across splits. Hyperparameter tuning was performed using randomized search, with internal validation on each fold. Final ensemble predictions were aggregated via soft voting.
Model Interpretability:
SHAP (SHapley Additive Explanations) was used for both global and per-sample attribution, providing clinically interpretable justifications for predictions (Figure 2).
Gini importance and permutation importance were used as cross-method validation of feature relevance rankings.

Figure 2. Permutation feature importance for top 20 features.

Figure 3. Feature importance via first hidden layer weights (MLP).

Neural model interpretability was enhanced through first-layer weight analysis, visualized in Figure 3.
Risk Stratification: Post hoc probabilistic outputs were used to stratify individuals into risk quartiles (Figure 4), enabling patient-level interpretation and clinical flagging for high-risk cases.

Figure 4. Predicted risk stratification grouped by outcome.

3.4. Hyperparameter Search Strategy

Hyperparameter optimization was conducted using randomized search within stratified 5-fold cross-validation to ensure robustness across mood states. Search ranges were chosen based on prior benchmarking for each model type. Table 1 below summarizes the explored hyperparameter grid and the selected configurations, which yielded the best macro-averaged ROC-AUC and calibration alignment.

Table 1. Hyperparameter search grid and selected configurations for ensemble models.

Model	Hyperparameter	Search Range	Selected Value
Random Forest	Number of Trees	100, 300, 500	300
	Maximum Tree Depth	5, 10, None	10
	Minimum Samples per Split	2, 5, 10	5
XGBoost	Learning Rate	0.01, 0.1	0.1
	Maximum Tree Depth	3, 6, 10	6
	Number of Trees	100, 200, 300	200
MLP	Hidden Layers	2, 3, 4	3
	Neurons per Layer	32, 64, 128	64
	Dropout Rate	0.2, 0.3, 0.5	0.3

This Table 1, summarizes the hyperparameter search space explored for each model using a random search strategy within stratified 5-fold cross-validation. The selected hyperparameters were chosen based on the highest macro-averaged ROC-AUC and the most reliable calibration curve alignment across validation folds. These settings ensure the models are both accurate and well-calibrated for clinical prediction tasks. The final configurations were selected based on the highest macro-averaged ROC-AUC and the best calibration curve performance across validation folds.

3.5. Evaluation Metrics

A comprehensive, multi-level evaluation framework was used to assess both diagnostic precision and clinical interpretability:

Accuracy, Precision, Recall: Reported per class, particularly emphasizing recall for manic and depressed states due to their treatment urgency.
Confusion Matrix (Figure 5): Used to analyse misclassification trends, especially the frequent mislabelling of euthymic states.
ROC and PR Curves (Figure 6(a)): Macro-averaged and per-class ROC-AUC and Precision-Recall curves were used to assess overall discrimination and sensitivity to class imbalance.
Calibration Curve (Figure 6(b)): Evaluated model reliability in terms of probability-to-outcome alignment—a key metric for clinical deployment.
Dimensionality Reduction Visualization: PCA and UMAP projections were used to inspect latent decision boundaries and mood class separability (used in Figure 4).

Figure 5. Confusion matrix (Counts).

(a)

(b)

Figure 6. (a). Precision-recall curve. The model achieves an average precision (AP) of 0.96, confirming its high sensitivity and specificity in relapse detection; (b) Calibration Curve. Probability predictions demonstrate adequate calibration near the extremes, suitable for high-confidence decision thresholds in clinical practice.

4. Results

We report the performance of our multimodal ensemble learning framework across five analytical dimensions: predictive accuracy, classification integrity, feature attribution, latent decision geometry, and biomarker stratification. The following subsections present a structured analysis supported by quantitative and visual evidence. Model performance was evaluated using stratified 5-fold cross-validation. The reported results represent the average performance across all folds, with standard deviations provided to reflect inter-fold variability. This evaluation strategy ensures robustness and mitigates the risk of partition bias inherent in simple train-test splits.

4.1. Model-Level Performance Summary

The training dynamics of the ensemble classifier are depicted in Figure 7, which presents a four-panel view of loss, AUC, precision, and recall across 100 training epochs. The learning curves demonstrate stable convergence and strong generalization behaviour, supporting the robustness of the model architecture and training pipeline [36]-[38]. In Figure 7(a), both training and validation loss curves show rapid exponential decay, converging smoothly by approximately epoch 35. The absence of divergence between the two curves indicates that overfitting was effectively mitigated, aided by early stopping and regularization strategies. Figure 7(b) displays the trajectory of Area Under the Curve (AUC). The training AUC rapidly approaches 1.0, reflecting high internal class discrimination. However, the validation AUC remains flat at a low value, which could suggest class imbalance, improper label distribution, or insufficient generalizability in certain classes—especially euthymic [39]-[41]. In Figure 7(c), the training precision steadily increases and stabilizes at a high level (~0.95), while the validation precision holds consistently at 1.0. This unusually high validation precision, despite modest validation AUC, may reflect skewed label distributions or an overconfident classifier on dominant classes (e.g., manic). Figure 7(d) illustrates recall patterns. Both training and validation recall improve throughout the epochs, with validation recall

Figure 7. Training History of the Ensemble Model. (a) Training vs. validation loss demonstrates stable convergence. (b) AUC performance saturates early for training but not for validation, indicating class-level sensitivity. (c) Precision remains high, particularly for validation. (d) Recall improves progressively for both datasets, indicating high sensitivity across mood classes.

nearing 1.0. This indicates that the model maintains high sensitivity, particularly valuable in minimizing false negatives for manic and depressed states—critical in clinical screening scenarios. Collectively, these curves reinforce that the model achieves a strong bias-variance trade-off and is well-suited for downstream classification tasks, despite challenges in class-level discrimination.

4.2. Class-Wise Misclassification Patterns

The confusion matrix shown in Figure 5 illustrates the classifier’s discriminative performance across mood states, with a particular emphasis on identifying relapse cases. The model achieved a recall of 91% in detecting relapse, correctly classifying 151 instances while misclassifying only 15—an outcome of clinical significance. This strong sensitivity is vital in psychiatric monitoring, where failing to detect relapse can lead to delayed interventions and increased patient risk [42]-[45]. In contrast, 28 non-relapse cases were misclassified as relapse, reflecting a conservative bias toward false positives. This tendency, while inflating the number of relapses alerts, aligns with a safety-first paradigm often preferred in clinical triage systems, where over-alerting is generally more acceptable than under-detection. The classifier’s conservative orientation supports its potential application in real-world monitoring settings, such as digital mental health platforms or early-warning systems integrated with electronic health records. The pattern also suggests that euthymic states, although not explicitly separated in this binary confusion matrix, likely contribute to some classification ambiguity—particularly when their clinical presentation overlaps with prodromal or residual symptoms of relapse. This reinforces the need for richer phenotyping and possibly longitudinal modelling in future work. Overall, the model maintains an effective trade-off between sensitivity and specificity, with an emphasis on clinical reliability and safety in relapse prediction. The results substantiate the feasibility of deploying such models for real-time monitoring and patient risk stratification in bipolar spectrum disorders.

The classifier demonstrates high recall for relapse detection, balancing sensitivity and precision to ensure clinically responsible alerts in mood disorder monitoring.

4.3. Discriminative Insight via Precision-Recall and Calibration Analysis

To evaluate the reliability of predicted probabilities and the model’s discriminative sharpness, we conducted both Precision-Recall and Calibration Curve analyses. Figure 6(a) displays the Precision-Recall (PR) Curve, yielding an average precision (AP) of 0.96, which reflects the model’s strong ability to correctly identify relapse events while maintaining low false positive rates. The sharp vertical rise near recall = 1 suggests that the model successfully captures most true positive cases with high precision—a critical trait for minimizing clinical oversight in relapse detection. On the other hand, Figure 6(b) illustrates the Calibration Curve, which compares the predicted probabilities to actual outcome frequencies. Although the curve deviates from the ideal diagonal in the mid-range (around 0.4–0.7), it approaches the diagonal near both extremes (close to 0 and 1), suggesting the model is well-calibrated for high-confidence predictions. This behaviour is common in ensemble models trained on imbalanced clinical datasets, where moderate predictions are typically underrepresented [46]-[49]. Together, these plots offer complementary insights: the PR curve confirms high discriminative precision, while the calibration curve reflects probabilistic realism, especially for high-stakes predictions. In clinical decision-making contexts such as early intervention for bipolar relapse, this dual performance is essential allowing for both alert sensitivity and trustworthy probability guidance for physicians [50]-[52].

4.4. Feature Importance Analysis

To improve model transparency and identify the most influential predictors, permutation-based feature importance was applied to the trained ensemble classifier. The top 20 features ranked by mean decrease in accuracy are presented in Figure 2.

Depression-related dynamics, including maximum and mean depression scores, emerged as dominant predictors, alongside functional scores and medication variables. Notably, variables such as depression_max, functioning_min, and mania_max contributed substantially to predictive accuracy, reaffirming the hypothesis that severe affective fluctuations serve as critical indicators of relapse risk. Lithium and Valproate medication indicators also featured prominently, supporting their established pharmacological relevance in mood stabilization.

4.5. Risk Stratification Profiles

The risk stratification framework aimed to map patients into low, medium, and high relapse risk groups based on predicted probabilities. The resulting distributions, shown in Figure 4, reveal a strong alignment between predicted risk tiers and actual relapse status.

Patients in the high-risk group exhibited a significantly higher proportion of true relapse cases, while low-risk predictions remained consistent with remission outcomes. This stratified structure enables clinicians to interpret predictions in probabilistic tiers rather than binary classes, offering utility for preventive care and dosage adjustment.

4.6. Medication-Relapse Heatmap

To examine pharmacological correlates of relapse, prescription patterns were visualized across relapse outcomes. Figure 8 illustrates the medication-wise prescription rates stratified by relapse status.

Lithium and Valproate were most prescribed across both groups. However, a mild drop in lithium prevalence was observed in the relapse group, suggesting potential nonadherence or reduced efficacy. Such visualizations offer interpretable insight into how pharmacotherapy profiles relate to predictive outcomes and may inform future feature selection or causal studies.

Figure 8. Heatmap of medication usage by relapse status.

4.7. Feature Correlation Matrix

To explore inter-feature dependencies and potential collinearity, a full correlation matrix was generated (Figure 9). Variables were hierarchically clustered to expose block structures and inter-domain interactions.

Figure 9. Pearson correlation matrix of input features.

Strong positive correlations were observed among depression-related metrics and within functional status markers. Medications remained largely uncorrelated, supporting their independent predictive utility. Identifying such clusters aids in reducing redundancy and understanding the latent structure embedded in mixed clinical and biomarker features.

4.8. Neural Weight Attribution (MLP Model)

Lastly, to further assess learned representations within the neural architecture, we visualized the absolute weight magnitudes from the first hidden layer of the MLP model (Figure 3).

Functional scores (functioning_min, functioning_mean) and depression indices (depression_max, depression_mean) received the highest absolute weights, reinforcing the dominant influence of these behavioral markers. This analysis complements the permutation importance findings and highlights convergence in both model-agnostic and internal weight attribution frameworks.

5. Discussion

This study demonstrates the clinical promise of interpretable ensemble machine learning for mood state classification and treatment response analysis in bipolar disorder. Leveraging a multimodal feature set—including genetic variants, serum biomarkers, and clinically observable behavioural data—we trained an ensemble pipeline that robustly identified mood states, particularly manic episodes, while offering actionable interpretability and relapse risk stratification. The classifier achieved strong macro performance, as shown in the precision-recall curve (Figure 6(a)), with an average precision of 0.96. This confirms the model’s capacity to balance sensitivity and specificity in a clinical context where both false positives and false negatives carry significant implications [53]-[57]. Importantly, manic episodes achieved the highest classification confidence, with an AUC of 0.87 (Figure 7(b)). This aligns with clinical understanding: manic states typically manifest with overt behavioural disruptions such as sleep reduction, medication nonadherence, and psychomotor agitation—signals that were successfully learned and prioritized by the model (Figure 2).

In contrast, euthymic states exhibited considerable overlap with other mood classes. This was evident in the confusion matrix (Figure 5), where misclassification of euthymia was frequent. The PCA-UMAP decision boundaries (see methodology) and correlation heatmaps (Figure 9) confirmed that euthymic states lack the distinct, high-variance biomarkers and behaviour patterns found in acute phases. This underlines a broader challenge in mood modelling: remission often presents as a “low signal” state, necessitating more granular or longitudinal data (e.g., daily mood logs, wearable physiology) for improved differentiation. Feature importance analysis (Figure 2) consistently highlighted stress level, sleep deviation, and lithium adherence as the most predictive inputs—converging with clinical literature where circadian dysregulation and lithium nonadherence are top predictors of mood destabilization. Additionally, the neural network weight attributions (Figure 3) and the permutation-based feature ranking corroborated these variables’ dominance, adding robustness to interpretability claims. The stratification framework (Figure 4) elegantly separated patients into relapse risk tiers, with high predicted probabilities corresponding to confirmed relapses. This probabilistic grouping supports pre-emptive interventions such as early psychiatric evaluation or medication adjustment, enhancing care personalization. Moreover, medication patterns shown in the heatmap (Figure 8) revealed meaningful associations—e.g., higher lithium and quetiapine usage in non-relapse groups versus valproate and aripiprazole in relapse—suggesting pharmacological heterogeneity in treatment response. These insights could inform future studies on personalized pharmacogenomics and medication adherence monitoring. The calibration curve (Figure 6(b)) confirmed that the model’s predicted probabilities are reasonably aligned with observed outcomes, especially at high-confidence thresholds. This supports its utility in clinical environments, where probabilistic outputs inform decision thresholds for risk alerts or interventions. Nonetheless, several limitations warrant consideration. The dataset, though synthetically enhanced, remains limited in real-world volume and lacks temporal depth for forecasting transitions. Mood labels, while clinically derived, are susceptible to inter-rater variability and may not fully capture mixed or atypical states [58]-[60]. The model also assumes static input snapshots, limiting its applicability to dynamic, real-time monitoring. Future research should explore longitudinal extensions by integrating wearable sensor streams, ecological momentary assessments (EMA), and digital phenotyping to enable fine-grained, temporally aware models. Additionally, external validation across diverse clinical settings is necessary to ensure generalizability and readiness for regulatory pathways. This work advances the field of explainable psychiatric AI by demonstrating that interpretable, feature-driven ensemble models can accurately classify mood states, uncover biologically plausible treatment patterns, and stratify patients by relapse risk. These findings support the development of clinically integrated early warning systems and real-time decision support tools for bipolar disorder management.

Limitations

Despite promising results, several limitations must be acknowledged. First, the synthetic nature of the dataset introduces uncertainty regarding generalizability to real-world clinical populations. Although the synthetic profiles were designed to mimic clinically plausible patterns, potential distributional shifts between simulated and actual patient data may impact model performance when deployed in naturalistic settings. Additionally, the use of static input snapshots constrains the model’s ability to capture temporal transitions and dynamic mood shifts. Real-world psychiatric presentations often involve subtle longitudinal changes that static models may miss. Finally, while the stratified 5-fold cross-validation mitigates some variance, external validation on independent clinical datasets is essential to confirm the robustness and practical utility of the proposed system.

6. Conclusion

This study introduces an interpretable, multimodal machine learning framework for mood state classification in bipolar disorder, integrating genetic polymorphisms, serum biomarkers, and behavioural indicators such as stress level, sleep deviation, and lithium adherence. The ensemble model demonstrated strong predictive performance, particularly in detecting manic episodes with high accuracy and precision, supported by both ROC-AUC metrics and robust feature attributions. Key findings emphasize the prominence of clinically actionable predictors in model decisions and the high discriminability of manic states, which were biologically distinct and behaviourally well-defined [61]-[63]. Furthermore, stratification of treatment response via biomarkers such as BDNF, IL-6, and DLPFC connectivity offers a biologically grounded lens for understanding differential treatment outcomes. The model’s transparency—enabled by SHAP and permutation-based interpretability—enhances its potential for real-world clinical integration [64]-[66]. Importantly, the identification of high-risk behavioural signatures provides a foundation for digital early warning systems, adherence monitoring tools, and personalized treatment pathways. These applications align with the evolving goals of precision psychiatry, where real-time, data-driven decision support can augment clinical judgment [67] [68]. While this study is constrained by static data and a modest sample size, it establishes a scalable and explainable architecture for future development. Integrating longitudinal tracking, wearable sensor inputs, and ecological momentary assessments will be essential to evolving this framework into a continuous prediction engine capable of anticipating mood transitions. In summary, this work underscores the viability of interpretable machine learning as a companion to psychiatric evaluation—enhancing diagnostic precision, optimizing treatment stratification, and advancing toward a personalized, biomarker-informed future in mental health care.

Conflicts of Interest

The authors declare no conflicts of interest.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Pari, A. (2016) Health Economic Aspects in the Management of Bipolar Disorder. PhD Diss., University of Oxford.
[2]	Azevedo, J.M. (2021) The Ups and Downs of Assessment and Intervention in Bipolar Disorder. Psychiatry, 64, S619.
[3]	Engelbrecht, H., Dalvie, S., Agenbag, G., Stein, D.J. and Ramesar, R.S. (2020) Whole-exome Sequencing in an Afrikaner Family with Bipolar Disorder. Journal of Affective Disorders, 276, 69-75. https://doi.org/10.1016/j.jad.2020.06.045
[4]	Phillips, M.L. and Kupfer, D.J. (2013) Bipolar Disorder Diagnosis: Challenges and Future Directions. The Lancet, 381, 1663-1671. https://doi.org/10.1016/s0140-6736(13)60989-7
[5]	Akiskal, H.S. (2002) Classification, Diagnosis and Boundaries of Bipolar Disorders. Bipolar Disorder, 5, 1-95.
[6]	Perugi, G., Hantouche, E., Vannucchi, G. and Pinto, O. (2015) Cyclothymia Reloaded: A Reappraisal of the Most Misconceived Affective Disorder. Journal of Affective Disorders, 183, 119-133. https://doi.org/10.1016/j.jad.2015.05.004
[7]	Fountoulakis, K.N. (2014) Bipolar Disorder: An Evidence-Based Guide to Manic Depression. Springer.
[8]	Newman-Toker, D.E., Wang, Z., Zhu, Y., Nassery, N., Saber Tehrani, A.S., Schaffer, A.C., et al. (2020) Rate of Diagnostic Errors and Serious Misdiagnosis-Related Harms for Major Vascular Events, Infections, and Cancers: Toward a National Incidence Estimate Using the “Big Three”. Diagnosis, 8, 67-84. https://doi.org/10.1515/dx-2019-0104
[9]	Singh, H., Schiff, G.D., Graber, M.L., Onakpoya, I. and Thompson, M.J. (2016) The Global Burden of Diagnostic Errors in Primary Care. BMJ Quality & Safety, 26, 484-494. https://doi.org/10.1136/bmjqs-2016-005401
[10]	Matza, L.S., Rajagopalan, K.S., Thompson, C.L. and de Lissovoy, G. (2005) Misdiagnosed Patients with Bipolar Disorder: Comorbidities, Treatment Patterns, and Direct Treatment Costs. The Journal of Clinical Psychiatry, 66, 1432-1440. https://doi.org/10.4088/jcp.v66n1114
[11]	Ball, J.R., Miller, B.T. and Balogh, E.P. (2015) Improving Diagnosis in Healthcare.
[12]	Byrne, P. (2007) Managing the Acute Psychotic Episode. BMJ, 334, 686-692. https://doi.org/10.1136/bmj.39148.668160.80
[13]	Dubovsky, S.L., Ghosh, B.M., Serotte, J.C. and Cranwell, V. (2020) Psychotic Depression: Diagnosis, Differential Diagnosis, and Treatment. Psychotherapy and Psychosomatics, 90, 160-177. https://doi.org/10.1159/000511348
[14]	Giulio, P. (2020) Psychotic Spectrum Disorders: Definitions, Classifications, Neural Correlates and Clinical Profiles. Annals of Psychiatry and Treatment, 4, 70-84. https://doi.org/10.17352/apt.000023
[15]	Vasconcelos-Moreno, M.P., Pinto, J.V., Hasse-Sousa, M., de Souza Melo, I.M., Saviatto, N.G., Lucas, P.K., et al. (2025) Diagnosis, Clinical Features, Differential Diagnosis, and Psychiatric Comorbidities. In: Passos, I.C., Berk, M. and Kapczinski, F., Eds., Bipolar Disorder, Springer, 105-146. https://doi.org/10.1007/978-3-031-85519-1_6
[16]	Malhi, G.S., Green, M., Fagiolini, A., Peselow, E.D. and Kumari, V. (2008) Schizoaffective Disorder: Diagnostic Issues and Future Recommendations. Bipolar Disorders, 10, 215-230. https://doi.org/10.1111/j.1399-5618.2007.00564.x
[17]	Tanaka, M. (2025) From Serendipity to Precision: Integrating AI, Multi-Omics, and Human-Specific Models for Personalized Neuropsychiatric Care. Biomedicines, 13, Article No. 167. https://doi.org/10.3390/biomedicines13010167
[18]	Lin, E., Lin, C. and Lane, H. (2020) Precision Psychiatry Applications with Pharmacogenomics: Artificial Intelligence and Machine Learning Approaches. International Journal of Molecular Sciences, 21, Article No. 969. https://doi.org/10.3390/ijms21030969
[19]	Wang, L., Li, S. and Jin, X. (2024) Revolutionizing Brain Disease Diagnosis: The Convergence of AI, Genetic Screening, and Neuroimaging. Proceedings of the 2024 International Conference on Smart Healthcare and Wearable Intelligent Devices, Guangzhou, 18-20 October 2024, 10-17. https://doi.org/10.1145/3703847.3703850
[20]	Kalani, M. and Anjankar, A. (2024) Revolutionizing Neurology: The Role of Artificial Intelligence in Advancing Diagnosis and Treatment. Cureus, 16, e61706. https://doi.org/10.7759/cureus.61706
[21]	Onciul, R., Tataru, C., Dumitru, A.V., Crivoi, C., Serban, M., Covache-Busuioc, R., et al. (2025) Artificial Intelligence and Neuroscience: Transformative Synergies in Brain Research and Clinical Applications. Journal of Clinical Medicine, 14, Article No. 550. https://doi.org/10.3390/jcm14020550
[22]	Saraf, S., De, A. and Tripathy, B.K. (2024) Effective Use of Computational Biology and Artificial Intelligence in the Domain of Medical Oncology. In: Computational Intelligence for Oncology and Neurological Disorders, CRC Press, 228-252. https://doi.org/10.1201/9781003450153-17
[23]	Nedic Erjavec, G., Svob Strac, D., Tudor, L., Konjevod, M., Sagud, M. and Pivac, N. (2019) Genetic Markers in Psychiatry. In: Kim, Y.-K., Ed., Frontiers in Psychiatry: Artificial Intelligence, Precision Medicine, and Other Paradigm Shifts, Springer, 53-93. https://doi.org/10.1007/978-981-32-9721-0_4
[24]	Quinlan, E.B., Banaschewski, T., Barker, G.J., Bokde, A.L.W., Bromberg, U., Büchel, C., et al. (2019) Identifying Biological Markers for Improved Precision Medicine in Psychiatry. Molecular Psychiatry, 25, 243-253. https://doi.org/10.1038/s41380-019-0555-5
[25]	Frey, B.N., Andreazza, A.C., Houenou, J., Jamain, S., Goldstein, B.I., Frye, M.A., et al. (2013) Biomarkers in Bipolar Disorder: A Positional Paper from the International Society for Bipolar Disorders Biomarkers Task Force. Australian & New Zealand Journal of Psychiatry, 47, 321-332. https://doi.org/10.1177/0004867413478217
[26]	Scola, G., Duong, A. and Syed, B. (2015) Biomarkers for Bipolar Disorder: Current Insights. Current Biomarker Findings, 5, 79-92. https://doi.org/10.2147/cbf.s79138
[27]	Horgan, D., Ciliberto, G., Conte, P., Baldwin, D., Seijo, L., Montuenga, L.M., et al. (2020) Bringing Greater Accuracy to Europe’s Healthcare Systems: The Unexploited Potential of Biomarker Testing in Oncology. Biomedicine Hub, 5, 1-42. https://doi.org/10.1159/000511209
[28]	Zafar, S. (2024) Translating Biomarker Discovery into Clinical Practice: Challenges and Opportunities. Journal of Translational Research, 1, 19-30.
[29]	Serelli-Lee, V., Ito, K., Koibuchi, A., Tanigawa, T., Ueno, T., Matsushima, N., et al. (2022) A State-of-the-Art Roadmap for Biomarker-Driven Drug Development in the Era of Personalized Therapies. Journal of Personalized Medicine, 12, Article No. 669. https://doi.org/10.3390/jpm12050669
[30]	Razavi, M., Ziyadidegan, S., Jahromi, R., et al. (2023) Machine Learning, Deep Learning and Data Preprocessing Techniques for Detection, Prediction, and Monitoring of Stress and Stress-Related Mental Disorders: A Scoping Review.
[31]	Sükei, E., Norbury, A., Perez-Rodriguez, M.M., Olmos, P.M. and Artés, A. (2021) Predicting Emotional States Using Behavioral Markers Derived from Passively Sensed Data: Data-Driven Machine Learning Approach. JMIR mHealth and uHealth, 9, e24465. https://doi.org/10.2196/24465
[32]	Washington, P., Mutlu, C.O., Kline, A., et al. (2022) Challenges and Opportunities for Machine Learning Classification of Behavior and Mental State from Images.
[33]	Nasarian, E., Alizadehsani, R., Acharya, U.R. and Tsui, K. (2024) Designing Interpretable ML System to Enhance Trust in Healthcare: A Systematic Review to Proposed Responsible Clinician-AI-Collaboration Framework. Information Fusion, 108, Article ID: 102412. https://doi.org/10.1016/j.inffus.2024.102412
[34]	Rasheed, K., Qayyum, A., Ghaly, M., Al-Fuqaha, A., Razi, A. and Qadir, J. (2022) Explainable, Trustworthy, and Ethical Machine Learning for Healthcare: A Survey. Computers in Biology and Medicine, 149, Article ID: 106043. https://doi.org/10.1016/j.compbiomed.2022.106043
[35]	Quinn, T.P., Jacobs, S., Senadeera, M., Le, V. and Coghlan, S. (2022) The Three Ghosts of Medical AI: Can the Black-Box Present Deliver? Artificial Intelligence in Medicine, 124, Article ID: 102158. https://doi.org/10.1016/j.artmed.2021.102158
[36]	Viering, T. and Loog, M. (2023) The Shape of Learning Curves: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7799-7819. https://doi.org/10.1109/tpami.2022.3220744
[37]	Li, S. and Hoefler, T. (2021) Chimera. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, 1-14. https://doi.org/10.1145/3458817.3476145
[38]	Allen-Zhu, Z. and Li, Y. (2022). Feature Purification: How Adversarial Training Performs Robust Deep Learning. 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), Denver, 7-10 February 2022. https://doi.org/10.1109/focs52979.2021.00098
[39]	Saffari, F. (2024) Investigating Novel Methods and Metrics in Consumer Neuroscience.
[40]	Baba, H., Kubota, Y., Suzuki, T. and Arai, H. (2007) Seven Cases of Late-Life Depression Treated with Cilostazol-Augmented Therapy. Journal of Clinical Psychopharmacology, 27, 727-728. https://doi.org/10.1097/jcp.0b013e31815a585e
[41]	Palanisamy, K., Krishnasamy, K. and Venkadasamy, P. (2024) Cardiovascular Disease Detection through Innovative Imbalanced Learning and AUC Optimization. Advances in Engineering and Intelligence Systems, 3, 68-84.
[42]	Birchwood, M. and Spencer, E. (2001) Early Intervention in Psychotic Relapse. Clinical Psychology Review, 21, 1211-1226. https://doi.org/10.1016/s0272-7358(01)00105-2
[43]	Adler, D.A., Ben-Zeev, D., Tseng, V.W., Kane, J.M., Brian, R., Campbell, A.T., et al. (2020) Predicting Early Warning Signs of Psychotic Relapse from Passive Sensing Data: An Approach Using Encoder-Decoder Neural Networks. JMIR mHealth and uHealth, 8, e19962. https://doi.org/10.2196/19962
[44]	Eisner, E., Drake, R. and Barrowclough, C. (2013) Assessing Early Signs of Relapse in Psychosis: Review and Future Directions. Clinical Psychology Review, 33, 637-653. https://doi.org/10.1016/j.cpr.2013.04.001
[45]	McGorry, P.D., Hickie, I.B., Yung, A.R., Pantelis, C. and Jackson, H.J. (2006) Clinical Staging of Psychiatric Disorders: A Heuristic Framework for Choosing Earlier, Safer and More Effective Interventions. Australian & New Zealand Journal of Psychiatry, 40, 616-622. https://doi.org/10.1080/j.1440-1614.2006.01860.x
[46]	Sampath, P., Elangovan, G., Ravichandran, K., Shanmuganathan, V., Pasupathi, S., Chakrabarti, T., et al. (2024) Robust Diabetic Prediction Using Ensemble Machine Learning Models with Synthetic Minority Over-Sampling Technique. Scientific Reports, 14, Article ID: 28984. https://doi.org/10.1038/s41598-024-78519-8
[47]	Owusu-Adjei, M., Ben Hayfron-Acquah, J., Frimpong, T. and Abdul-Salaam, G. (2023) Imbalanced Class Distribution and Performance Evaluation Metrics: A Systematic Review of Prediction Accuracy for Determining Model Performance in Healthcare Systems. PLOS Digital Health, 2, e0000290. https://doi.org/10.1371/journal.pdig.0000290
[48]	Ghavidel, A. and Pazos, P. (2023) Machine Learning (ML) Techniques to Predict Breast Cancer in Imbalanced Datasets: A Systematic Review. Journal of Cancer Survivorship, 19, 270-294. https://doi.org/10.1007/s11764-023-01465-3
[49]	Altalhan, M., Algarni, A. and Turki-Hadj Alouane, M. (2025) Imbalanced Data Problem in Machine Learning: A Review. IEEE Access, 13, 13686-13699. https://doi.org/10.1109/access.2025.3531662
[50]	Depp, C., Torous, J. and Thompson, W. (2016) Technology-Based Early Warning Systems for Bipolar Disorder: A Conceptual Framework. JMIR Mental Health, 3, e42. https://doi.org/10.2196/mental.5798
[51]	Stephenson, L.A., Gergel, T., Gieselmann, A., Scholten, M., Keene, A.R., Rifkin, L., et al. (2020) Advance Decision Making in Bipolar: A Systematic Review. Frontiers in Psychiatry, 11, Article ID: 538107. https://doi.org/10.3389/fpsyt.2020.538107
[52]	Goodwin, G., Haddad, P., Ferrier, I., Aronson, J., Barnes, T., Cipriani, A., et al. (2016) Evidence-Based Guidelines for Treating Bipolar Disorder: Revised Third Edition Recommendations from the British Association for Psychopharmacology. Journal of Psychopharmacology, 30, 495-553. https://doi.org/10.1177/0269881116636545
[53]	Trevethan, R. (2017) Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Frontiers in Public Health, 5, Article No. 307. https://doi.org/10.3389/fpubh.2017.00307
[54]	Lobo, J.M., Jiménez‐Valverde, A. and Real, R. (2007) AUC: A Misleading Measure of the Performance of Predictive Distribution Models. Global Ecology and Biogeography, 17, 145-151. https://doi.org/10.1111/j.1466-8238.2007.00358.x
[55]	Olowolayemo, A., Souag, A. and Sirlantzis, K. (2024) Evaluating the Impact of Machine Learning Platforms on Cancer Classification Model Performance: A Cross-Platform Comparative Study. International Journal on Advanced in Life Sciences, 16, 96-111.
[56]	Harpaz, R., DuMouchel, W., LePendu, P., Bauer-Mehren, A., Ryan, P. and Shah, N.H. (2013) Performance of Pharmacovigilance Signal-Detection Algorithms for the FDA Adverse Event Reporting System. Clinical Pharmacology & Therapeutics, 93, 539-546. https://doi.org/10.1038/clpt.2013.24
[57]	Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., et al. (2019) Clinical-Grade Computational Pathology Using Weakly Supervised Deep Learning on Whole Slide Images. Nature Medicine, 25, 1301-1309. https://doi.org/10.1038/s41591-019-0508-1
[58]	Furukawa, T.A. (2010) Assessment of Mood: Guides for Clinicians. Journal of Psychosomatic Research, 68, 581-589. https://doi.org/10.1016/j.jpsychores.2009.05.003
[59]	Benazzi, F. (2008) Defining Mixed Depression. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 32, 932-939. https://doi.org/10.1016/j.pnpbp.2007.12.019
[60]	Chouinard, G. and Margolese, H.C. (2005) Manual for the Extrapyramidal Symptom Rating Scale (ESRS). Schizophrenia Research, 76, 247-265. https://doi.org/10.1016/j.schres.2005.02.013
[61]	Abi‐Dargham, A., Moeller, S.J., Ali, F., DeLorenzo, C., Domschke, K., Horga, G., et al. (2023) Candidate Biomarkers in Psychiatric Disorders: State of the Field. World Psychiatry, 22, 236-262. https://doi.org/10.1002/wps.21078
[62]	McGraw, C.M., Ward, C.S. and Samaco, R.C. (2017) Genetic Rodent Models of Brain Disorders: Perspectives on Experimental Approaches and Therapeutic Strategies. American Journal of Medical Genetics Part C: Seminars in Medical Genetics, 175, 368-379. https://doi.org/10.1002/ajmg.c.31570
[63]	Dadi, K., Varoquaux, G., Houenou, J., Bzdok, D., Thirion, B. and Engemann, D. (2021) Population Modeling with Machine Learning Can Enhance Measures of Mental Health GigaScience, 10, giab071. https://doi.org/10.1093/gigascience/giab071
[64]	Liu, X.S., Xu, X., Xu, X., Li, X. and Xie, G.T. (2021) Representation Learning for Multi-Omics Data with Heterogeneous Gene Regulatory Network. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, 9-12 December 2021. https://doi.org/10.1109/bibm52615.2021.9669359
[65]	Sun, C. and Liu, Z. (2024) Discovering Explainable Biomarkers for Breast Cancer Anti-Pd1 Response via Network Shapley Value Analysis. Computer Methods and Programs in Biomedicine, 257, Article ID: 108481. https://doi.org/10.1016/j.cmpb.2024.108481
[66]	Chamma, A. (2024) Statistical Interpretation of High-Dimensional Complex Prediction Models for Biomedical Data. Ph.D. Dissertation, Université Paris-Saclay.
[67]	Omiyefa, S. (2025) Artificial Intelligence and Machine Learning in Precision Mental Health Diagnostics and Predictive Treatment Models. International Journal of Research Publication and Reviews, 6, 85-99. https://doi.org/10.55248/gengpi.6.0325.1107
[68]	Lemaitre, F., Florentin, V., Blin, O., Bayle, A., Benito, S., Chauny, J., et al. (2024) Can Precision Medicine Be Integrated into Routine Therapeutic Decisions at the Bedside of Patients? Therapies, 79, 13-22. https://doi.org/10.1016/j.therap.2023.11.007

Journals Menu

Follow SCIRP

	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies