1. Introduction
Despite the efforts of recent decades, malaria remains a major public health issue in sub-Saharan Africa. Caused mainly by the parasite Plasmodium falciparum, this infection is responsible for considerable morbidity and mortality. In 2023, there were about 263 million cases, resulting in 597,000 deaths, most of which are in children under 5 years of age [1]. Despite the use of coping strategies consisting of vector control, chemoprophylaxis and antimalarial treatments, the region continues to bear a disproportionate share of the global malaria burden [2].
This epidemiology is accompanied by a high prevalence of sickle cell anemia, a common genetic in Africa [3]. This hereditary hemoglobinopathy, caused by a mutation responsible for the synthesis of hemoglobin S, is characterized by partial protection of the heterozygous HbAS carrier against severe forms of parasitosis [3].
This case of balanced natural selection has led to the hypothesis of the existence of reciprocal evolutionary pressures between the human host and the P. falciparum parasite [4].
While the host-protective effect of the HbS allele is well documented, its impact on the genetic diversity of P. falciparum and its structuring are relatively poorly understood. Few studies have examined the impacts resulting from the parasite’s adaptation to the particular intracellular environment of HbS-containing red blood cells. This adaptation would impose particular constraints on the parasites due to the various pathophysiological aspects, in particular oxidative stress, membrane rigidity caused by HbS, and hypoxia [5].
In this study, we postulated that the hemoglobin genotype of the normal (HbAA), heterozygous (HbAS) or homozygous sickle cell (HbSS) subject, could contribute to modulating the profile of the P. falciparum parasite at the genomic level. Using next-generation sequencing data from 30 clinical isolates, he was identified SNPs and adopted a population genomics approach to track the genetic structuring of the parasite. To do this, the PCA [6] [7], Jost’D’s index [8] and PERMANOVA [9] have been applied.
The present work highlights a genetic signature of host HbAS strains, characterized by a weak to moderate allelic differentiation compared to those infecting HbAA and HbSS hosts. The results are supported by multivariate analysis and identification of the most discriminating SNPs, located in genes functionally related to erythrocyte invasion and immune escape, indicating adaptive defense mechanisms specific to heterozygous blood environments.
This study is organized into three sections: (i) materials and methods, describing the experimental protocols, the bioinformatics tools used and the statistical approaches implemented; (ii) results and discussion, presenting and responding to the results obtained and implemented in the context of the existing literature; (iii) finally, conclusion and perspectives synthesizing the contribution of the present work to the understanding of host-parasite co-evolutions and observing the perspectives opened up in the context of future research and malaria control in areas with high prevalence rates of hemoglobinopathies.
2. Materials and Methods
2.1. Study Area and Sample Collection
The biological samples used in this study were collected from two sites in Abidjan. The first corresponds to the clinical hematology department of the University Hospital Center (CHU) of Yopougon. It is a reference center for the care of patients with sickle cell disease in Abidjan. The second is the Community-Based Urban Health Training (FSUCOM) of Anonkoua-kouté located in the commune of Abobo. This site was chosen for the recruitment of sickle cell trait (HbAS) carriers and the control group of healthy individuals with normal hemoglobin (HbAA). These two sites were selected because of their central role in the medical management and follow-up of patients with sickle cell disease (Figure 1).
The data used in this study are from a preliminary study conducted as part of the DREPAL project. The collection phase took place from May 2017 to January 2018. A total of 30 patients with febrile episodes suggestive of malaria had participated in this exploratory phase. The participants were divided into three groups of 10 individuals. These include patients with major sickle cell disease: HbSS, sickle cell trait: HbAS, and finally the non-sickle cell control population: HbAA. Sickle cell status was initially determined by hemoglobin electrophoresis and confirmed by the FRET (Fluorescence Resonance Energy Transfer) method. The strains of Plasmodium falciparum used were those purified from blood samples with EDTA tubes and stored at −80˚C.
Figure 1. Study area.
2.2. DNA Extraction and Library Preparation for Illumina
Sequencing
Genomic DNA was extracted from whole blood samples from patients infected with Plasmodium falciparum. To improve the yield of the parasitic DNA and reduce host contamination, the samples were previously enriched by saponin treatment followed by selective lysis of uninfected erythrocytes and successive washes with PBS 1X. DNA extraction from the samples was achieved with the DNA Mini Kit, QIAamp DNA Mini Kit (Qiagen). This protocol was carried out like the manufacturer’s protocol, however, the information received was that it was necessary to expand the protocol for samples with low parasitaemia. Then, 200 μL of blood was incubated in AL and proteinase K solution at 56˚C for 30 min. The genomic DNA was precipitated and purified in silica columns and different buffer solutions. Finally, genomic DNA was eluted in 100 μL of nuclease-free ultrapure water and stored at −20˚C for further analyses. DNA quality was assessed by spectrophotometry with the A260/A280 between 1.8 and 2.0. Quantitation was done in the InvitrogenTM QubitTM 4 Fluorometer by the dsDNA High Sensitivity Assay Kit (ThermoFisherTM). Integrity by 1% agarose gel electrophoresis. The libraries were prepared for sequencing with 100 ng of total DNA per sample with the Nextera XT DNA Library Preparation Kit (Illumina®). This protocol had a tagging strategy linked to technology. The complex DNA fragment index is added with i5 and i7 adapters at the ends to identify each sample.
After purification by AMPure XP magnetic beads (Beckman Coulter) and verification of fragment sizes between 350 - 500 bp by capillary electrophoretic measurement (Agilent Bioanalyzer® or Tapestation®), the libraries were quantized QubitTM. The libraries were eventually assembled into a pool and standardized with 4 nM and then sent to the Institut Pasteur in Paris for sequencing.
The sequencing was carried out with the Illumina MiSeq in 2 × 300 bp paired-end mode. The data generated was then used for downstream bioinformatics analyses, including variant detection and comparative genomic evaluations.
2.3. Bioinformatics Analysis
The analysis began with quality control of the raw sequences using FastQC [10] to evaluate parameters such as Phred scores, the distribution of read lengths, and the presence of contaminants. The results were synthesized and visualized with MultiQC [11], providing an overview of sequence quality and facilitating the identification of anomalies to be corrected before subsequent analyses. The sequences were cleaned using Trim Galore (https://github.com/FelixKrueger/TrimGalore) which removes low-quality adapters and bases. This ensures that the data is reliable for the analyzed.
The process of extracting the parasitic sequences was done in two steps via BWA [12] and Samtools [13]: an alignment was performed on the human reference genome (HG19) to remove the human sequences, then a second alignment was made with the genome of Plasmodium falciparum (3D7) to retain only the parasitic sequences.
The BAM files (alignment file), have been marked with Picard [14], to correct the distortions associated with duplicate reads.
SNPs were then detected with FreeBayes [15], and then filtered with VCFtools [16] by applying strict quality criteria (DEP > 5, QUAL > 5). The results were recorded in a VCF (Variant Call Format) file containing detailed information about the extracted SNPs.
Finally, the functional annotation of the variants was executed with VEP (Variant Effect Predictor) [17], determining the potential impact of mutations on genes and proteins.
2.4. Genotypes Based SNP Matrix
The multivariate statistical analyses performed in this study are based on the construction of a coded matrix of SNPs. Genotypes extracted from generated bioinformatics VCF files were organized to obtain a matrix in which each row represents an individual while each column represents a SNP. In order to simplify the use of this matrix for downstream analyses, the genotypes, usually represented by traits such as (0/0) for the major homozygous genotype, (0/1) or (1/0) for the heterozygous genotype, and (1/1) for the minor homozygous genotype, were transformed into numerical values: 1 for (0/0), 2 for (0/1) or (1/0), and 3 for (1/1).
Let G be this coded matrix of genotypes of dimension n × m, where n represents the number of individuals in the study, m represents the number of SNPs (Single Nucleotide Polymorphisms) analyzed and represents the coded genotype of individual i for SNPj, with
.
(1)
2.5. Multivariate Statistical Analysis
Among the multivariate statistical methods used in our study, we should mention:
Principal Component Analysis (PCA), performed on a matrix of coded SNPs for which an additional variable, the sickle cell status of the individual.
PERMANOVA, used to verify the statistical significance of genetic differences between groups represented by sickle cell status.
finally, the pair Jost D-index was used to detract significant SNPs and then evaluate the genetic differentiation between sickle cell status pairs.
All of these methods have been implemented with R, version 4.0.3 and RStudio, version 1.3.1056.
2.5.1. Principal Component Anaysis (PCA)
Prior to the analysis, the data were standardized to ensure that the variance of each SNP was zero (0) and the mean was one (1).
(2)
where
is the mean of the SNP j,
the standard deviation of the SNP j and
the standardized matrix.
The PCA was carried out using Σ, the covariance matrix calculated on normalized data, where n is the number of individuals, and
is the transpose of the standardized matrix of G.
(3)
This method simply consists of a decomposition of the covariance matrix into eigenvalues and eigenvectors, carried out via spectral decomposition, that is to say, the resolution of the following equation:
(4)
where
is the eigenvector which is associated with the eigenvalue
of the jith principal component.
This leads to the diagonal matrix of eigenvalues Λ and the matrix V of eigenvectors:
(5)
where
PCA reduces the dimensionality of the data, but maintains its genetic variability among groups based on:
(6)
It thus facilitates the visualization of individuals and their structuring in the factorial space. In addition, it contributes to the identification of atypical individuals and clarifies the links between genotypes and clinical status.
2.5.2. PERMANOVA
PERMANOVA (Multivariate Analysis of Variance by Permutation) tests for significant differences between groups. It does this in multidimensional space using data permutations and does not require the use of assumptions such as normalities and homogeneities of variances. By applying random permutations of groups, this technique based on distance matrices makes it possible to evaluate the observed differences between groups in a more robust and flexible way.
Using a Euclidean distance, the first step is to calculate the distance matrix D between individuals.
(7)
where each element
represents the distance between individuals i and j.
(8)
The first step of the PERMANOVA involves the calculation of two inter and intra group variances, respectively.
The within-group variance for each pair of groups k and l is calculated as follows:
(9)
where:
is the within-group variance for group k,
is the number of individuals in group k and
is the distance between individuals i and j.
The inter-group variance is calculated as follows:
(10)
where:
and
are the number of individuals in groups k and l and
is the distance between individuals i (from group k) and j (from group l).
The F statistic of this test is defined by the ratio of inter-group variance to within-group variance.
(11)
PERMANOVA tests the significance of F by performing random permutations and recalculating the previous statement each time. The p-value is then gleaned from the percentage of permutations for which F is equal to or greater than the observed value.
If the p-value is less than an arbitrarily chosen value, it is customary to reject the assumption that there are no significant genetic differences between the study groups.
2.5.3. Jost’s D
Jost’s D measures genetic differentiation between two groups based on the allele frequencies of SNPs. To calculate this index, we need to know the allele frequencies of each SNP in each group.
Let
be the frequency of the major allele for the j SNP in group A, and
the frequency of the major allele for the j SNP in the B group. The Jost’s D for each SNP is expressed as follows:
(12)
where
is the genetic differentiation for SNPj, measured on a scale of 0 to 1.
The Jost’s D average is calculated for all the m SNPs analyzed to calculate this index.
(13)
The overall Jost’s D and SNPs Jost’s D evaluate genetic differentiation between groups and SNPs. A high overall Jost’s D, close to 1, indicates strong differentiation while values between 0.05 and 0.30 suggest moderate to substantial differentiation. SNPs with Jost’s D ≥ 0.15 are needed to identify the SNPs responsible for the difference between groups.
3. Results and discussion
3.1. Results
3.1.1. Descriptive Analysis of SNPs
Variant detection carried out as part of the bioinformatics analysis identified a total of 376,320 SNPs among the 30 Plasmodium falciparum strains studied.
Figure 2 shows a Venn diagram illustrating the distribution of SNPs according to the genotypic groups AA, AS, and SS. From this diagram, it can be seen that, of the 376,320 SNPs, 41,997 are specific to group AA, 18,289 to group AS and 17,276 to group SS. In addition, 11,873 SNPs are shared between the AA and AS groups, 13,464 between AA and SS, and 3878 between SA and SS, while 16,269 SNPs are common to all three groups. These results show a strong genetic heterogeneity. They underline on the one hand the intrinsic variability of each genotype and, on the other hand, the existence of a genetic basis common to the three groups.
Figure 2. Venn diagram showing the distribution of SNPs according to the genotypes (AA, AS, and SS).
3.1.2. Results of Principal Component Analysis (PCA)
The Principal Component Analysis was performed on the matrix of coded SNPs, obtained from the data from this study, including sickle cell status as an additional variable. Applying the Kaiser criteria, the slope graph and the proportion of variance explained, the first seven components were kept (Figure 3).
Figure 3. Graph of the decrease in variances.
Visualization in the first three factorial planes shows clear differentiations between the genotypic groups. The two individuals in group AA, particularly separated from their group, seem to be atypical. Two major genetic structures could be distinguished, one composed exclusively of AS individuals and the other of the group grouping AA and SS with a clear genetic differentiation between AS and other genotypes (Figures 4-6).
Figure 4. Representation of individuals in the first factorial plane.
Figure 5. Representation of individuals in the second factorial plane.
Figure 6. Representation of individuals in the third factorial plane.
3.1.3. PERMANOVA and JOST’D Results
The PERMANOVA method, with a p-value of 0.0436 (less than 0.05) and an F-statistic of 2.4, indicates that the HbAA, HbAS, and HbSS genotypes affect the genetic divergence between the groups of individuals studied in this analysis.
The analysis of the pairwise Jost’s D moderated between the genotypic groups HbAA and HbAS, as well as between HbAS and HbSS. Indeed, the Jost’s D levels between HbAA and HbAS, and between HbAS and HbSS are 0.19 and 0.17, respectively. Which implies an intermediate status for the HbAS group, which is different compared to the homozygous AA and SS groups. Also, it is low between HbAA and HbSS (0.04), which means that HbAA is close to SS, but is different from HbAS. The analysis of Jost’s D thus confirms that genotypic status here is an important factor in the genetic structuring of individuals.
Comparative analysis using Jost’D by SNP for HbAS and HbAA strains revealed 241 SNPs with significant genetic differentiation and associated medium to high Jost’s D values > 0.15. A subset of these non-synonymous SNPs is located in genes functionally related to erythrocyte invasion and immune response modulation (PF3D7_0300200 – clag3.2, PF3D7_0100100 – acs7, PF3D7_1200600 – var2csa, PF3D7_0800100 – var1csa, PF3D7_0400400 – var).
On the other hand, the SS and AA analysis identified two significant SNPs, located in intergenic regions. Even if they are not present in coding sequences, the two variants could play a regulatory role, regulating the expression of adjacent genes involved in the virulence, metabolism and adaptation of the parasite. These results draw attention to the potential role of non-coding elements in adaptive processes, particularly in the context of SS red blood cells, which are marked by high oxidative stress and disruption of membrane structure.
3.2. Discussion
This study sought to further investigate the effect of the hemoglobin status of the human host on the genetic structuring of Plasmodium falciparum in a malaria-endemic context, using a population genomics approach. The conclusions obtained provide solid new evidence concerning the adaptive dynamics of the parasite in response to the erythrocyte pressures imposed by sickle cell anemia.
The joint use of PCA, PERMANOVA and Jost’s differentiation index led to the discovery of internal genetic patterning patterns. This combination of multivariate methods was essential to estimate the effect of host hemoglobin status on P. falciparum diversity. PCA showed a partial but very clear separation of HbAS-related strains and PERMANOVA confirmed the important contribution of host genotype to genomic variance. Such a methodological choice was, therefore, consistent with that currently used in the analysis of population genomics data to explore adaptation signals under complex epidemiological conditions [18]-[22].
All the discriminating mutations identified in this study, including those affecting genes known for their role in erythrocyte invasion and modulation of the immune response of the host clag3.2, acs7, rh5, var2csa, var1csa, PfEMP1, conclude that host-parasite co-evolution occurs during a Plasmodium falciparum - human host interaction. This adaptive theme is supported by an evolutionary drift between Plasmodium falciparum and the human host. These processes evolve over the long term in endemic areas where the parasite is subject to specific selection pressures that exclusively concern hemoglobin status. In addition, a significant proportion of SNPs could also be found in intergenic vegetation and can possibly affect the regulation of transcription in the parasite [6] [23]-[26].
In the context of molecular surveillance of malaria, these observations have important practical implications. They imply the need to take into account the genetic diversity of human populations in the design of malaria control strategies. In addition, practices confirm the value of integrating data into common databases, such as MalariaGEN [27] or PlasmoDB [28], for longitudinal monitoring of adaptive variants. This is to be more foreseeable in relation to the risks of vaccine or therapeutic escape. Notably, the variability of diversity in the multigene families of var and rifin could reduce the effect of some interventions based on antigenic recognition.
Although the present work is not directly in line with the specific initial objectives of the DREPAL project, this study can prove to be a relevant valorization of the material and biological resources that have been generated in this framework. As a result of this additional focus, access to the data has been partially restricted. However, it can be noted that even if the sample size obtained is relatively small (n = 30), the main objectives of the study have nevertheless been achieved: it is to develop an accessible, reproducible methodological approach capable of generating exploratory data on mutations in the Plasmodium falciparum genome in sickle cell hosts. At this stage, the work is mainly functionally descriptive. The biological implications of the identified SNPs, in particular their potential effects on protein expression, structure or function, will be investigated by means of experimental analyses in subsequent work. In this context, the integration of multiomics data (transcriptomics, proteomics, metabolomics), coupled with molecular dynamics simulation methods, seems very promising in refining the functional characterization of the identified mutations. This type of approach has already been successfully used to understand the mechanisms of adaptation of the parasite to different hostile cellular environments. In addition, evolutionary-specific experimental models with HbAA, HbAS, and HbSS erythrocytes are needed to test the adaptation hypothesis. Finally, better integration of epigenomic and immune data would be richer in information on the underlying mechanisms of host-parasite coevolution.
4. Conclusion
Overall, this work revealed a strong genetic structuring of Plasmodium falciparum as a function of the host’s hemoglobin state, identifying signatures of adaptation to disturbed intracellular environments related to sickle cell disease. Through the use of rigorous statistical methods such as PCA, PERMANOVA and Jost’s D, coupled with a focus on coding and non-coding SNP data, a number of differentiating mutations in relevant functional genes were found. This suggests that genetic plasticity in the parasite is influenced by selective pressures resulting from the high levels of HbAS and HbSS red blood cells, in conjunction with a host/parasite co-evolution model. In addition to the fundamental aspect, this information could have direct applications in the field of malaria genomics and the improvement of control strategies, particularly in regions heavily affected by haemoglobinopathies. In addition, the incorporation of this knowledge into global molecular monitoring programs and the deployment of large-scale functional and multi-omics approaches will be essential to further identify adaptive mechanisms. In the long term, these initiatives could pave the way for more targeted control strategies that take into account the genetic variability of the parasite and its host.
Acknowledgements
We would like to thank the Paludology Unit and the Molecular Genetics Platform (PFGM) of the Institut Pasteur de Côte d’Ivoire for their methodological support and the provision of their study resources. We extend our sincere thanks to Drs. Diplo Bernadette and Akoguhi N’guessan Patrice for their comments and constructive criticisms that have strengthened this book. Our thanks also go to the National Computing Center of Côte d’Ivoire (CNCCI) for providing the IT infrastructure that simplified the analysis of NGS data.
Financing
This study received funding from Rotary International.
Authors’ Contribution
SEA designed the bioinformatics analysis pipelines, analyzed the data, wrote and approved the final version of the manuscript. SYO contributed to the design of the pipelines and data analysis. ABA, BMK, and AAG analyzed the data and reviewed drafts of the manuscript. OAT, KJA, and RJ reviewed the drafts of the manuscript and approved its final version. Pipeline Data and Analysis The data and analysis pipelines generated for this study are available upon request to authors.