Interpretable Machine Learning for Mood State Classification and Treatment Response Analysis Using Clinical and Biomarker Data ()
1. Introduction
Bipolar spectrum disorders (BSDs), encompassing bipolar I, bipolar II, and cyclothymia, affect approximately 2% - 4% of the global population and are among the leading causes of disability-adjusted life years in psychiatry [1]-[3]. Characterized by cyclical episodes of mania, depression, and euthymia, these disorders pose substantial diagnostic and therapeutic challenges [4]-[7]. Misdiagnosis is common, often resulting in inappropriate treatment regimens, increased relapse risk, and a substantial burden on healthcare systems [8]-[11]. The episodic nature and symptom overlap with other mood or psychotic disorders further complicate clinical differentiation and long-term management [12]-[16]. In recent years, the convergence of psychiatry, molecular biology, and artificial intelligence has paved the way for novel diagnostic approaches [17]-[22]. Biomarkers such as serum cytokines (e.g., IL6), neurotrophic factors (e.g., BDNF), and polymorphisms in genes like COMT and CACNA1C have shown promise in stratifying psychiatric phenotypes [23]-[26]. However, despite accumulating evidence, the integration of such biomarkers into actionable diagnostic tools remains limited, particularly in real-world settings [27]-[29]. Machine learning (ML), with its ability to uncover complex nonlinear patterns in high-dimensional data, holds significant potential for augmenting mood state classification [30]-[32]. Yet, most ML models suffer from a lack of transparency, making them difficult to trust and interpret in clinical contexts [33]-[35]. This study addresses these gaps by proposing an interpretable ensemble ML framework that integrates genetic, serum, and clinical data for mood classification. The objectives are fourfold: (1) to assess the diagnostic utility of combining SNPs, serum biomarkers, and clinical features; (2) to identify the most predictive features contributing to mood state differentiation; (3) to visualize model decision boundaries and feature interactions; and (4) to evaluate treatment response patterns through biomarker stratification. By emphasizing interpretability and clinical relevance, our approach aims to support early detection, personalized treatment planning, and ultimately, improve outcomes in individuals with bipolar spectrum disorders.
2. Methods
This research proposes a clinically interpretable, data-driven pipeline for predicting mood state transitions in bipolar disorder patients. The methodology integrates synthetic patient simulation, multimodal biomarker integration, advanced ensemble learning, and robust evaluation strategies. The complete workflow is illustrated in Figure 1.
Figure 1. End-to-end machine learning framework for bipolar mood state prediction.
2.1. Synthetic Data Simulation and Temporal Feature Extraction
To overcome data scarcity and class imbalance typically observed in psychiatric datasets, we generated synthetic patient profiles replicating real-world variability in genotype, phenotype, and clinical outcomes. Each simulated profile incorporates structured data such as single nucleotide polymorphisms (SNPs), serum biomarker levels (e.g., BDNF, IL-6), medication adherence, and behavioural metrics over time. From these inputs, we extracted time-series features capturing temporal deviations, including circadian rhythm disruption, stress fluctuation, and adherence irregularities.
Synthetic Data Generation Parameters
The synthetic data was generated using parameterized beta distributions to simulate variability in genetic and clinical features. The beta distributions were defined as follows:
For each feature \( X_i \), values were sampled from \ (Beta (\alpha_i, \beta_i) \), where the parameters were selected to mirror observed distributions in clinical datasets.
Example parameter settings:
Stress level: \ (Beta (2, 5) \) – simulates right-skewed distribution.
Sleep deviation: \ (Beta (3, 3) \) – simulates balanced variability.
Lithium adherence: \ (Beta (5, 2) \) – simulates left-skewed distribution, favouring high adherence.
Weighted target labels for mood state classification were generated using a composite probability function:
\[ P (Y = manic) = 0.4 \times stress\_level + 0.3 \times sleep\_deviation + 0.3 \times (1 - lithium\_adherence) \]
\[ P (Y = depressed) = 0.5 \times stress\_level + 0.2 \times sleep\_deviation + 0.3 \times low\_BDNF \]
\[ P(Y = euthymic) = 1 - P (Y = manic) - P (Y = depressed) \]
Each feature’s distribution was selected to reflect real-world clinical skew: stress and depression biomarkers (e.g., IL-6, BDNF) were sampled using right- or left-skewed beta distributions, while behavioural traits like sleep deviation followed symmetric patterns. These synthetic profiles were iteratively tested against descriptive statistics from existing mood disorder cohorts to validate realism. The composite mood probability functions use empirically weighted contributions derived from literature (e.g., [BDNF] and [Lithium adherence] as key relapse predictors) to simulate high-variance mood state transitions. These weightings were selected based on clinical literature and adjusted during preliminary simulation to achieve balanced class representation. Detailed parameterization ensures reproducibility of the synthetic data process.
2.2. Feature Engineering and Multimodal Preprocessing
The multimodal dataset underwent rigorous preprocessing. Categorical variables were one-hot encoded; SNPs were label-encoded according to known allelic variants; continuous biomarkers were standardized. Cross-modal feature synthesis was performed to capture interactions between genetics, serum markers, and behavioural data. Dimensionality reduction was selectively applied for visualization (e.g., PCA, t-SNE), while all modelling retained full feature space to preserve clinical fidelity.
2.3. Class Imbalance Correction and Partitioning
Stratified sampling ensured consistent label distribution across training and test subsets. To counteract the intrinsic class skew (notably the underrepresentation of euthymic states), SMOTE (Synthetic Minority Oversampling Technique) was employed on the training set, improving generalizability and recall in minority classes. Despite using synthetic data generation, the dataset exhibited residual class imbalance, with euthymic states underrepresented. This was partially intentional to mirror real-world clinical datasets, where patients tend to seek care more often during acute episodes than in remission. Moreover, the feature-driven generation process favoured clear symptom expression, leading to stronger class signals for manic and depressed states. SMOTE was applied post-hoc to augment minority class instances and restore balance during training.
2.4. Model Architecture and Training Strategy
The architecture is built on an ensemble framework combining gradient boosting classifiers, neural networks, and probabilistic learners. A meta-learning strategy was employed to aggregate diverse decision boundaries while mitigating overfitting via early stopping and dynamic learning rate scheduling. Hyperparameters were optimized using Bayesian search strategies across validation folds.
2.5. Evaluation and Interpretability Pipeline
Model evaluation was multi-pronged:
Calibration curves assessed the alignment between predicted and true probabilities.
ROC and precision-recall curves quantified discriminative ability across all mood states.
Confusion matrices identified common misclassification patterns, especially euthymic vs manic overlap.
Feature importance rankings were derived using SHAP values and permutation importance to provide explainability.
Medication adherence heatmaps visualized behavioural response clusters.
Risk stratification curves and example prediction plots helped align model output with clinical decision-making. Additionally, correlation matrices were computed to highlight latent interdependencies among biomarkers and clinical features.
This interpretability-centric architecture ensures that each model decision can be traced back to biologically or clinically plausible factors—a critical requirement for adoption in psychiatry.
The pipeline comprises synthetic data simulation, multimodal feature engineering, ensemble model training, and interpretability-driven evaluation, with dedicated components for calibration, medication response visualization, and risk stratification.
3. Technical Implementation and Model Development
3.1. Data Composition
This study leveraged a synthetically augmented, multimodal dataset constructed to reflect clinically relevant dimensions of bipolar disorder. The dataset comprised the following structured modalities:
Genetic Features (SNPs): Seven single nucleotide polymorphisms (SNPs) associated with affective regulation, stress response, and synaptic signalling were included: BDNF_Val66Met, COMT_Val158Met, SLC6A4_5HTTLPR, CACNA1C_rs1006737, ANK3_rs10994336, NR1D1_rs2314339, and IL6_rs1800795. These were encoded as categorical genotype counts and later one-hot encoded for neural models.
Serum Biomarkers (10 variables): Quantitative levels of biomarkers such as BDNF, IL-6, GABA, and DLPFC_connectivity were incorporated to reflect neuroplasticity, inflammation, and regional brain activity.
Clinical and Behavioural Features: Time-sensitive, self-reported measures including stress levels, sleep deviation from baseline, and lithium adherence were added as dynamic covariates with high temporal resolution.
Target Labels: Mood annotations were derived from expert-based longitudinal curation and classified into three diagnostic states: euthymic, manic, and depressed.
3.2. Preprocessing Strategy
A robust preprocessing pipeline ensured data integrity, standardization, and compatibility across all models.
Missing Data Imputation: Continuous features were imputed using k-nearest neighbors (k = 5), while categorical SNPs were imputed using mode frequency to retain population-level allele integrity.
Normalization & Feature Encoding: All numeric features were z-score standardized. SNPs were label-encoded or one-hot encoded depending on downstream model compatibility. Interaction terms (e.g., biomarker × sleep deviation) were engineered to capture potential nonlinear dependencies.
Class Imbalance Handling: The dataset exhibited substantial class skew, with underrepresentation of euthymic samples. To address this, SMOTE was applied to the training set, which increased model sensitivity to minority states without disrupting decision boundary geometry.
3.3. Model Development Pipeline
A hybrid ensemble-learning framework was developed, combining structured classifiers with neural network modelling and interpretability layers. A stratified 5-fold cross-validation was employed throughout all experiments to ensure stable and balanced performance estimation across mood classes. This replaced the use of a simple 80/20 train-test split to mitigate the risk of data partition bias and to provide a more reliable measure of generalizability. Performance metrics such as ROC-AUC, precision, recall, and calibration were averaged across folds, and standard deviations were reported to capture variability.
Model Suite: The main models included Random Forest, XGBoost, LightGBM, and a Multilayer Perceptron (MLP) with 3 hidden layers, ReLU activations, and dropout regularization. These models were selected for their ability to capture nonlinear, high-dimensional interactions.
Training Protocol: A stratified 5-fold cross-validation procedure ensured consistent class representation across splits. Hyperparameter tuning was performed using randomized search, with internal validation on each fold. Final ensemble predictions were aggregated via soft voting.
Model Interpretability:
SHAP (SHapley Additive Explanations) was used for both global and per-sample attribution, providing clinically interpretable justifications for predictions (Figure 2).
Gini importance and permutation importance were used as cross-method validation of feature relevance rankings.
Figure 2. Permutation feature importance for top 20 features.
Figure 3. Feature importance via first hidden layer weights (MLP).
Neural model interpretability was enhanced through first-layer weight analysis, visualized in Figure 3.
Risk Stratification: Post hoc probabilistic outputs were used to stratify individuals into risk quartiles (Figure 4), enabling patient-level interpretation and clinical flagging for high-risk cases.
Figure 4. Predicted risk stratification grouped by outcome.
3.4. Hyperparameter Search Strategy
Hyperparameter optimization was conducted using randomized search within stratified 5-fold cross-validation to ensure robustness across mood states. Search ranges were chosen based on prior benchmarking for each model type. Table 1 below summarizes the explored hyperparameter grid and the selected configurations, which yielded the best macro-averaged ROC-AUC and calibration alignment.
Table 1. Hyperparameter search grid and selected configurations for ensemble models.
Model |
Hyperparameter |
Search Range |
Selected Value |
Random Forest |
Number of Trees |
100, 300, 500 |
300 |
|
Maximum Tree Depth |
5, 10, None |
10 |
|
Minimum Samples per Split |
2, 5, 10 |
5 |
XGBoost |
Learning Rate |
0.01, 0.1 |
0.1 |
|
Maximum Tree Depth |
3, 6, 10 |
6 |
|
Number of Trees |
100, 200, 300 |
200 |
MLP |
Hidden Layers |
2, 3, 4 |
3 |
|
Neurons per Layer |
32, 64, 128 |
64 |
|
Dropout Rate |
0.2, 0.3, 0.5 |
0.3 |
This Table 1, summarizes the hyperparameter search space explored for each model using a random search strategy within stratified 5-fold cross-validation. The selected hyperparameters were chosen based on the highest macro-averaged ROC-AUC and the most reliable calibration curve alignment across validation folds. These settings ensure the models are both accurate and well-calibrated for clinical prediction tasks. The final configurations were selected based on the highest macro-averaged ROC-AUC and the best calibration curve performance across validation folds.
3.5. Evaluation Metrics
A comprehensive, multi-level evaluation framework was used to assess both diagnostic precision and clinical interpretability:
Accuracy, Precision, Recall: Reported per class, particularly emphasizing recall for manic and depressed states due to their treatment urgency.
Confusion Matrix (Figure 5): Used to analyse misclassification trends, especially the frequent mislabelling of euthymic states.
ROC and PR Curves (Figure 6(a)): Macro-averaged and per-class ROC-AUC and Precision-Recall curves were used to assess overall discrimination and sensitivity to class imbalance.
Calibration Curve (Figure 6(b)): Evaluated model reliability in terms of probability-to-outcome alignment—a key metric for clinical deployment.
Dimensionality Reduction Visualization: PCA and UMAP projections were used to inspect latent decision boundaries and mood class separability (used in Figure 4).
Figure 5. Confusion matrix (Counts).
(a)
(b)
Figure 6. (a). Precision-recall curve. The model achieves an average precision (AP) of 0.96, confirming its high sensitivity and specificity in relapse detection; (b) Calibration Curve. Probability predictions demonstrate adequate calibration near the extremes, suitable for high-confidence decision thresholds in clinical practice.
4. Results
We report the performance of our multimodal ensemble learning framework across five analytical dimensions: predictive accuracy, classification integrity, feature attribution, latent decision geometry, and biomarker stratification. The following subsections present a structured analysis supported by quantitative and visual evidence. Model performance was evaluated using stratified 5-fold cross-validation. The reported results represent the average performance across all folds, with standard deviations provided to reflect inter-fold variability. This evaluation strategy ensures robustness and mitigates the risk of partition bias inherent in simple train-test splits.
4.1. Model-Level Performance Summary
The training dynamics of the ensemble classifier are depicted in Figure 7, which presents a four-panel view of loss, AUC, precision, and recall across 100 training epochs. The learning curves demonstrate stable convergence and strong generalization behaviour, supporting the robustness of the model architecture and training pipeline [36]-[38]. In Figure 7(a), both training and validation loss curves show rapid exponential decay, converging smoothly by approximately epoch 35. The absence of divergence between the two curves indicates that overfitting was effectively mitigated, aided by early stopping and regularization strategies. Figure 7(b) displays the trajectory of Area Under the Curve (AUC). The training AUC rapidly approaches 1.0, reflecting high internal class discrimination. However, the validation AUC remains flat at a low value, which could suggest class imbalance, improper label distribution, or insufficient generalizability in certain classes—especially euthymic [39]-[41]. In Figure 7(c), the training precision steadily increases and stabilizes at a high level (~0.95), while the validation precision holds consistently at 1.0. This unusually high validation precision, despite modest validation AUC, may reflect skewed label distributions or an overconfident classifier on dominant classes (e.g., manic). Figure 7(d) illustrates recall patterns. Both training and validation recall improve throughout the epochs, with validation recall
![]()
Figure 7. Training History of the Ensemble Model. (a) Training vs. validation loss demonstrates stable convergence. (b) AUC performance saturates early for training but not for validation, indicating class-level sensitivity. (c) Precision remains high, particularly for validation. (d) Recall improves progressively for both datasets, indicating high sensitivity across mood classes.
nearing 1.0. This indicates that the model maintains high sensitivity, particularly valuable in minimizing false negatives for manic and depressed states—critical in clinical screening scenarios. Collectively, these curves reinforce that the model achieves a strong bias-variance trade-off and is well-suited for downstream classification tasks, despite challenges in class-level discrimination.
4.2. Class-Wise Misclassification Patterns
The confusion matrix shown in Figure 5 illustrates the classifier’s discriminative performance across mood states, with a particular emphasis on identifying relapse cases. The model achieved a recall of 91% in detecting relapse, correctly classifying 151 instances while misclassifying only 15—an outcome of clinical significance. This strong sensitivity is vital in psychiatric monitoring, where failing to detect relapse can lead to delayed interventions and increased patient risk [42]-[45]. In contrast, 28 non-relapse cases were misclassified as relapse, reflecting a conservative bias toward false positives. This tendency, while inflating the number of relapses alerts, aligns with a safety-first paradigm often preferred in clinical triage systems, where over-alerting is generally more acceptable than under-detection. The classifier’s conservative orientation supports its potential application in real-world monitoring settings, such as digital mental health platforms or early-warning systems integrated with electronic health records. The pattern also suggests that euthymic states, although not explicitly separated in this binary confusion matrix, likely contribute to some classification ambiguity—particularly when their clinical presentation overlaps with prodromal or residual symptoms of relapse. This reinforces the need for richer phenotyping and possibly longitudinal modelling in future work. Overall, the model maintains an effective trade-off between sensitivity and specificity, with an emphasis on clinical reliability and safety in relapse prediction. The results substantiate the feasibility of deploying such models for real-time monitoring and patient risk stratification in bipolar spectrum disorders.
The classifier demonstrates high recall for relapse detection, balancing sensitivity and precision to ensure clinically responsible alerts in mood disorder monitoring.
4.3. Discriminative Insight via Precision-Recall and Calibration
Analysis
To evaluate the reliability of predicted probabilities and the model’s discriminative sharpness, we conducted both Precision-Recall and Calibration Curve analyses. Figure 6(a) displays the Precision-Recall (PR) Curve, yielding an average precision (AP) of 0.96, which reflects the model’s strong ability to correctly identify relapse events while maintaining low false positive rates. The sharp vertical rise near recall = 1 suggests that the model successfully captures most true positive cases with high precision—a critical trait for minimizing clinical oversight in relapse detection. On the other hand, Figure 6(b) illustrates the Calibration Curve, which compares the predicted probabilities to actual outcome frequencies. Although the curve deviates from the ideal diagonal in the mid-range (around 0.4–0.7), it approaches the diagonal near both extremes (close to 0 and 1), suggesting the model is well-calibrated for high-confidence predictions. This behaviour is common in ensemble models trained on imbalanced clinical datasets, where moderate predictions are typically underrepresented [46]-[49]. Together, these plots offer complementary insights: the PR curve confirms high discriminative precision, while the calibration curve reflects probabilistic realism, especially for high-stakes predictions. In clinical decision-making contexts such as early intervention for bipolar relapse, this dual performance is essential allowing for both alert sensitivity and trustworthy probability guidance for physicians [50]-[52].
4.4. Feature Importance Analysis
To improve model transparency and identify the most influential predictors, permutation-based feature importance was applied to the trained ensemble classifier. The top 20 features ranked by mean decrease in accuracy are presented in Figure 2.
Depression-related dynamics, including maximum and mean depression scores, emerged as dominant predictors, alongside functional scores and medication variables. Notably, variables such as depression_max, functioning_min, and mania_max contributed substantially to predictive accuracy, reaffirming the hypothesis that severe affective fluctuations serve as critical indicators of relapse risk. Lithium and Valproate medication indicators also featured prominently, supporting their established pharmacological relevance in mood stabilization.
4.5. Risk Stratification Profiles
The risk stratification framework aimed to map patients into low, medium, and high relapse risk groups based on predicted probabilities. The resulting distributions, shown in Figure 4, reveal a strong alignment between predicted risk tiers and actual relapse status.
Patients in the high-risk group exhibited a significantly higher proportion of true relapse cases, while low-risk predictions remained consistent with remission outcomes. This stratified structure enables clinicians to interpret predictions in probabilistic tiers rather than binary classes, offering utility for preventive care and dosage adjustment.
4.6. Medication-Relapse Heatmap
To examine pharmacological correlates of relapse, prescription patterns were visualized across relapse outcomes. Figure 8 illustrates the medication-wise prescription rates stratified by relapse status.
Lithium and Valproate were most prescribed across both groups. However, a mild drop in lithium prevalence was observed in the relapse group, suggesting potential nonadherence or reduced efficacy. Such visualizations offer interpretable insight into how pharmacotherapy profiles relate to predictive outcomes and may inform future feature selection or causal studies.
Figure 8. Heatmap of medication usage by relapse status.
4.7. Feature Correlation Matrix
To explore inter-feature dependencies and potential collinearity, a full correlation matrix was generated (Figure 9). Variables were hierarchically clustered to expose block structures and inter-domain interactions.
Figure 9. Pearson correlation matrix of input features.
Strong positive correlations were observed among depression-related metrics and within functional status markers. Medications remained largely uncorrelated, supporting their independent predictive utility. Identifying such clusters aids in reducing redundancy and understanding the latent structure embedded in mixed clinical and biomarker features.
4.8. Neural Weight Attribution (MLP Model)
Lastly, to further assess learned representations within the neural architecture, we visualized the absolute weight magnitudes from the first hidden layer of the MLP model (Figure 3).
Functional scores (functioning_min, functioning_mean) and depression indices (depression_max, depression_mean) received the highest absolute weights, reinforcing the dominant influence of these behavioral markers. This analysis complements the permutation importance findings and highlights convergence in both model-agnostic and internal weight attribution frameworks.
5. Discussion
This study demonstrates the clinical promise of interpretable ensemble machine learning for mood state classification and treatment response analysis in bipolar disorder. Leveraging a multimodal feature set—including genetic variants, serum biomarkers, and clinically observable behavioural data—we trained an ensemble pipeline that robustly identified mood states, particularly manic episodes, while offering actionable interpretability and relapse risk stratification. The classifier achieved strong macro performance, as shown in the precision-recall curve (Figure 6(a)), with an average precision of 0.96. This confirms the model’s capacity to balance sensitivity and specificity in a clinical context where both false positives and false negatives carry significant implications [53]-[57]. Importantly, manic episodes achieved the highest classification confidence, with an AUC of 0.87 (Figure 7(b)). This aligns with clinical understanding: manic states typically manifest with overt behavioural disruptions such as sleep reduction, medication nonadherence, and psychomotor agitation—signals that were successfully learned and prioritized by the model (Figure 2).
In contrast, euthymic states exhibited considerable overlap with other mood classes. This was evident in the confusion matrix (Figure 5), where misclassification of euthymia was frequent. The PCA-UMAP decision boundaries (see methodology) and correlation heatmaps (Figure 9) confirmed that euthymic states lack the distinct, high-variance biomarkers and behaviour patterns found in acute phases. This underlines a broader challenge in mood modelling: remission often presents as a “low signal” state, necessitating more granular or longitudinal data (e.g., daily mood logs, wearable physiology) for improved differentiation. Feature importance analysis (Figure 2) consistently highlighted stress level, sleep deviation, and lithium adherence as the most predictive inputs—converging with clinical literature where circadian dysregulation and lithium nonadherence are top predictors of mood destabilization. Additionally, the neural network weight attributions (Figure 3) and the permutation-based feature ranking corroborated these variables’ dominance, adding robustness to interpretability claims. The stratification framework (Figure 4) elegantly separated patients into relapse risk tiers, with high predicted probabilities corresponding to confirmed relapses. This probabilistic grouping supports pre-emptive interventions such as early psychiatric evaluation or medication adjustment, enhancing care personalization. Moreover, medication patterns shown in the heatmap (Figure 8) revealed meaningful associations—e.g., higher lithium and quetiapine usage in non-relapse groups versus valproate and aripiprazole in relapse—suggesting pharmacological heterogeneity in treatment response. These insights could inform future studies on personalized pharmacogenomics and medication adherence monitoring. The calibration curve (Figure 6(b)) confirmed that the model’s predicted probabilities are reasonably aligned with observed outcomes, especially at high-confidence thresholds. This supports its utility in clinical environments, where probabilistic outputs inform decision thresholds for risk alerts or interventions. Nonetheless, several limitations warrant consideration. The dataset, though synthetically enhanced, remains limited in real-world volume and lacks temporal depth for forecasting transitions. Mood labels, while clinically derived, are susceptible to inter-rater variability and may not fully capture mixed or atypical states [58]-[60]. The model also assumes static input snapshots, limiting its applicability to dynamic, real-time monitoring. Future research should explore longitudinal extensions by integrating wearable sensor streams, ecological momentary assessments (EMA), and digital phenotyping to enable fine-grained, temporally aware models. Additionally, external validation across diverse clinical settings is necessary to ensure generalizability and readiness for regulatory pathways. This work advances the field of explainable psychiatric AI by demonstrating that interpretable, feature-driven ensemble models can accurately classify mood states, uncover biologically plausible treatment patterns, and stratify patients by relapse risk. These findings support the development of clinically integrated early warning systems and real-time decision support tools for bipolar disorder management.
Limitations
Despite promising results, several limitations must be acknowledged. First, the synthetic nature of the dataset introduces uncertainty regarding generalizability to real-world clinical populations. Although the synthetic profiles were designed to mimic clinically plausible patterns, potential distributional shifts between simulated and actual patient data may impact model performance when deployed in naturalistic settings. Additionally, the use of static input snapshots constrains the model’s ability to capture temporal transitions and dynamic mood shifts. Real-world psychiatric presentations often involve subtle longitudinal changes that static models may miss. Finally, while the stratified 5-fold cross-validation mitigates some variance, external validation on independent clinical datasets is essential to confirm the robustness and practical utility of the proposed system.
6. Conclusion
This study introduces an interpretable, multimodal machine learning framework for mood state classification in bipolar disorder, integrating genetic polymorphisms, serum biomarkers, and behavioural indicators such as stress level, sleep deviation, and lithium adherence. The ensemble model demonstrated strong predictive performance, particularly in detecting manic episodes with high accuracy and precision, supported by both ROC-AUC metrics and robust feature attributions. Key findings emphasize the prominence of clinically actionable predictors in model decisions and the high discriminability of manic states, which were biologically distinct and behaviourally well-defined [61]-[63]. Furthermore, stratification of treatment response via biomarkers such as BDNF, IL-6, and DLPFC connectivity offers a biologically grounded lens for understanding differential treatment outcomes. The model’s transparency—enabled by SHAP and permutation-based interpretability—enhances its potential for real-world clinical integration [64]-[66]. Importantly, the identification of high-risk behavioural signatures provides a foundation for digital early warning systems, adherence monitoring tools, and personalized treatment pathways. These applications align with the evolving goals of precision psychiatry, where real-time, data-driven decision support can augment clinical judgment [67] [68]. While this study is constrained by static data and a modest sample size, it establishes a scalable and explainable architecture for future development. Integrating longitudinal tracking, wearable sensor inputs, and ecological momentary assessments will be essential to evolving this framework into a continuous prediction engine capable of anticipating mood transitions. In summary, this work underscores the viability of interpretable machine learning as a companion to psychiatric evaluation—enhancing diagnostic precision, optimizing treatment stratification, and advancing toward a personalized, biomarker-informed future in mental health care.
Conflicts of Interest
The authors declare no conflicts of interest.