Bioinformatics Analysis Revealed Potential Tumor Suppressors (KLF4/CGN), Oncogenes (SHH/LIF) and Biomarkers of Asian Stomach Adenocarcinoma ()
1. Introduction
More than one million people worldwide are diagnosed with gastric adenocarcinoma or stomach adenocarcinoma (STAD) each year. STAD is the fifth most prevalent cancer and the third leading cause of cancer-related death in the world. Gastric cancer is more common in Asia than in most Western countries. The incidence in East Asia is about 50 cases per 100,000 people, about 10 times more than that in North America [1] [2]. There is an urgent need to identify potential novel oncogenes and tumor suppressor genes, and biomarkers of diagnosis, therapy and prognosis specifically for Asian STAD.
Weighted correlation network analysis (WGCNA) [3] is a data reduction and unsupervised classification method. It simplifies the interpretation of many gene responses to multiple synthetic genomes (or modules). The net establishes a link between genes whose expression is related. There will be connections between genes, depending on the value of the correlation (weight). Connectivity between genes is then interpreted as distance, with which genes are grouped into modules. This is the way to reduce many genes to several clusters, the expression of which is quantified by Eigengenes (the first principal component in the module). It assumes that highly related genes in the module are involved in a common biological process.
At present, there are only 4 articles in PubMed in which WGCNA was used to study the gastric cancer transcriptome and none of the 4 articles specifically studied the Asian human STAD transcriptome [4] [5] [6] [7]. Here we used WGCNA together with other bioinformatic tools to identify potential novel oncogenes, tumor suppressor genes, and biomarkers of diagnosis, therapy and prognosis specifically for Asian STAD.
2. Materials and Methods
2.1. Installation of R and Perl
R x64 4.03 language installation package were downloaded from https://www.r-project.org/, Perl installation package downloaded from https://www.perl.org/, and they were installed as instructed.
2.2. Datasets from TCGA
With filters of Cases (Stomach, TCGA, TCGA-STAD, Adenomas and Adeno Carcinomas and Asian) and Files (Transcriptome Profiling, Gene Expression Quantification, HTSeq-Counts and Txt), 74 samples (7 normal tissues and 67 tumors) were filtered out from the TCGA database (https://portal.gdc.cancer.gov/) and the required files were exported via Cart. The exported files were decompressed, and the 74 samples were processed through Perl to merge them together and calculated to get a Matrix file. The matrix file ENSG ID then was converted into Symbol ID by R language.
With filters of Cases (Stomach, TCGA, TCGA-STAD, Adenomas and Adeno Carcinomas and Asian) and Files (Clinical and Bcr xml), then clinical datasets of 443 samples were obtained from TCGA. Clinical information of gene ID, futime, fustat, age, gender, grade, stage, T, M and N was obtained through Perl.
2.3. Datasets from GEO
The Expression Profiling by Array (Series GSE 54129) was obtained from GEO (https://www.ncbi.nlm.nih.gov/). We contained information on 111 human gastric cancer tissues and 21 non-cancerous gastric tissues, which were collected in Ruijin Hospital, SJTU, China. Through Perl processing of the GEO file, we obtained one probe, one matrix file, and one annotation file.
2.4. Differential Expression Analysis
The differential expression analysis environment was constructed with four packages including limma, edgeR, pheatmap and ggplot2 through R software to output the analysis results. Take logFC filter conditions as logfcfilter = 1; After correction, the P value filtering condition was set as fdrFilter = 0.05. TCGA datasets and GEO datasets were analyzed, respectively.
2.5. Construction of WGCNA and Identification of Important Modules
Data was processed using R 4.0.3 software. To ensure the reliability of the network construction, the abnormal samples were deleted. Pearson correlation coefficient was calculated to assess the similarity of gene expression profiles, and then the correlation coefficient between genes was weighted by a power function to obtain a scale-free network. In terms of co-expression, gene modules are densely interconnected gene clusters. WGCNA uses hierarchical clustering to identify gene modules and colors to indicate modules. Dynamic tree cutting was used to identify different modules. In the module selection process, the adjacency matrix (a measure of topological similarity) was converted into a topological covering matrix (TOM, no graph was output due to the limited function of the computer), and the modules were detected by clustering analysis.
The WGCNA and limma packages of R were used. The minimum gene module size was set to 50 to obtain modules of appropriate size and the threshold for merging similar modules was set to 0.25. TCGA datasets and GEO datasets were analyzed, respectively.
2.6. Venn Diagram
TCGA differential expression analysis results, GEO differential expression analysis results, TCGA turquoise module and GEO black module were made into Venn Diagram by Venn Diagram package in R, and the intersection gene data text of the four kinds of data were output at the same time.
2.7. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) Enrichment Analysis
Intersection gene data text showed gene by symbol ID. We translated it into entrezID with R’s org.Hs.eg.db package. Then the enrichment analysis was done by R to library clusterProfiler, org.Hs.eg.db, enrichplot and ggplot2 packages. p value filter condition was set as p value filter = 0.05; the adjusted p-value filter condition was set as q value Filter = 1.
2.8. PPI (Protein Interaction Network) Analysis
On the website https://string-db.org/, we input the intersection genes of Venn, selected species, and then got the network diagram of protein interaction. We highlighted the connection of these genes by setting the minimum required interaction score. It can also set medium confidence (0.400) or hide disconnected nodes in the network. We output proteins interactions file and protein interaction diagram.
2.9. Cytoscape
We installed java, then installed Cytoscape. We imported the files obtained from PPI processing into Cytoscape to obtain hub genes.
2.10. HPA
The obtained ten hub genes were introduced into the http://www.proteinatlas.org/, to obtain immunohistochemical images related to the target genes and diseases.
2.11. Survival and Clinical Analysis
Overall survival of STAD patients was analyzed using the Kaplan Meier plot (http://kmplot.com).
3. Result
3.1. 6652 Differentially Expressed Genes Were Identified between STAD and Normal Samples
A total of 4859 differentially expressed genes from TCGA and 1793 differentially expressed genes from GEO were identified between STAD and normal samples as shown in Figure 1(a) and Figure 1(b) (TCGA) and Figure 1(c) and Figure 1(d) (GEO).
3.2. 13 key Modules Were Identified in STAD by WGCNA Analysis
The first 25% differentially expressed genes in TCGA and GEO data were separately used for cluster analysis through WGCNA package. Hierarchical clustering trees were constructed using the gene expression data with the height threshold limit and screen out outliers. The rest of the data was used to construct and weight the co-expression network. To determine the optimal value of soft threshold (power), the analysis needs were made in a certain range and the scale-free condition. When the power value was set to 1 and 8 (Figure 2(a)), the connectivity between genes in the network satisfied the scale-free network distribution. By combining the modules with higher similarity of characteristic
memes (MEME, a WGCNA term for a module with the same characteristics) using the dynamic mixing shearing method, 5 and 8 MEME modules were finally obtained for TCGA (Figure 2(b)) and GEO (Figure 2(c)) data sets, respectively. We found that the expression of the genes in turquoise modules of TCGA samples and in black modules of GEO samples have the greatest correlation with the tumor tissues (Figure 2(d)).
3.3. 293 Potential STAD Associated Genes Were Identified from Intersection by Venn Diagram
To reduce the false positive rate of the results, the Venn diagram of the above four data sets (TCGA turquoise module, GEO black module, TCGA differentially expressed genes and GEO differentially expressed genes) was used for intersection. There are 293 genes that are intersected in all 4 groups, and pictures and text files were output (Figure 3).
![]()
Figure 3. 293 STAD associated genes were identified from intersection by Venn Diagram.
3.4. The 293 Intersected Genes Were Enriched in Cell Cortex and Infection by GO and KEGG Analysis
GO and KEGG analyses were carried out to explore their biological function of the 293 intersected genes identified by Venn Diagram. GO analysis showed that these genes are mainly involved in the processes of cell cortex, secretory granule lumen, carbohydrate binding and actin binding. KEGG analysis showed that these genes were highly correlated with Hepatitis C, pathogenic Escherichia coli infection, amino sugar and nucleotide sugar metabolism, complement and coastal cascades, and Histidine metabolism (Table 1 and Figure 4).
3.5. 10 Hub Genes Were Identified from PPI and Cytoscape Analyses of the Intersected Genes
PPI analysis of the 293 intersected genes identified by Venn Diagram sorted out 354 pairwise links and deleted the genes that were not connected with the subject network (Figure 5). Then we imported the interactive file obtained from the PPI analysis into Cytoscape to construct a gene network and screened out 10 hub genes (KLF4, CGN, SHH, LIF, GATA6, FOXA2, OCLN, FOXA1, CLDN1 and NQO1) according to their degree of associations (Table 2 and Table 3, Figure 5). Among the 10 hub genes, KLF4 and CGN expression was decreased while other 8 gene expression was increased in STAD tumors compared to normal tissues (Table 3). When we imported into the HPA official website, immunohistochemical images of malignant gastric cancer tissues verified the expression of these hub genes.
3.6. KLF4/CGN Low and SHH/LIF High Expression Were Associated with Short Overall Survival of Asian STAD Patients
Interestingly, using the Kaplan Meier plot, we found that SHH/LIF high expression among tumors was associated with short overall survival of Asian STAD patients, while low expression of KLF4/CGN and the other 6 hub genes among tumors was associated with short overall survival of Asian STAD patients (Table 4 and Figure 6). The association of the gene expression among tumors and
![]()
![]()
Figure 4. The 293 intersected genes were enriched in cell cortex and infection by GO (left) and KEGG (right) analysis.
![]()
Table 1. The most significant biological function of the 293 STAD associated Venn intersection gene from KEGG.
![]()
![]()
Figure 5. 10 hub genes were identified from PPI and Cytoscape analyses of the intersected genes.
![]()
Figure 6. KLF4/CGN low and SHH/LIF high expression were associated with short overall survival of Asian STAD patients. The 10 genes shown in the figure are negatively (LIF/SHH) or positively (the other 8 genes) correlated with the survival time of STAD patients.
![]()
Table 2. The 10 hub genes filtered based on MCC information.
Note: maximal clique centrality (MCC); density of maximum neighborhood component (DMNC); maximum neighborhood component (MNC); edge percolated component (EPC).
![]()
Table 3. Differential expression of the 10 hub genes in STAD tumors verse normal tissues (GEOdiff).
LogFC means log (expression in cancer/expression in normal tissue). The higher the value, the higher the expression levels in cancer tissues are. Minus of LogFC means the expression is lower in cancer tissues compared to that in normal tissues from GEO datasets series GSE 54129.
![]()
Table 4. The survival times of the 10 hub genes.
overall survival are consistent with the KLF4/CGN low expression or SHH/LIF high expression in the STAD tumors compared to normal tissues (Table 3). Our findings support that KLF4/CGN is potential tumor suppressor and SHH/LIF is potential oncogene for STAD of Asian patients.
4. Discussion and Conclusion
STAD is a kind of high incidence and mortality tumor in adults, especially in Asian, and there is an urgent need to identify potential novel oncogenes, tumor suppressor genes, and biomarkers of diagnosis, therapy and prognosis, specifically for Asian STAD. In this study, the gene expression data of the STAD transcriptome were analyzed, and a total of 4859 differentially expressed genes from TCGA and 1793 genes from GEO were identified in STAD, followed by identification of the five modules of TCGA and eight modules of GEO. These differentially expressed genes and the tumor’s associated module genes were subjected to Venn intersection, and 293 intersected genes were obtained. These intersected genes are enriched in a few biological processes and biological functions, such as chemotaxis, inflammatory reactions, angiogenesis, cell cycle, etc., which may be related to the occurrence and development of the cancer. Ten hub genes were screened out from these 293 intersected genes for association analysis with survival in patients with STAD. Our results showed that KLF4/CGN low and SHH/LIF high expression were associated with short overall survival of Asian STAD patients. Our bioinformatics analysis revealed potential novel tumor suppressors (KLF4/CGN) and oncogenes (SHH/LIF) and biomarkers for diagnosis, therapy and prognosis of STAD, specifically for Asian patients [8] [9] [10].
In our study, the system biology-based methods, including WGCNA, were used to identify the 10 network hub genes related to STAD, namely KLF4, CGN, LIF, SHH, GATA6, FOXA2, OCLN, FOXA1, CLDN1 and NQO1. While KLF4/CGN low and SHH/LIF high expression were associated with short overall survival of Asian STAD patients, the other 6 hub genes (GATA6, FOXA2, OCLN, FOXA1, CLDN1 and NQO1) were all expressed higher in STAD tumors compared to the normal tissues, but higher expression in tumors showed longer overall survival of STAD patients. The inconsistence may suggest that the 6 hub genes may be passenger genes or expressed as an active compensation to suppress tumor progression. It has been shown that KLF4, CGN, GATA6, FOXA2, OCLN, FOXA1, CLDN1 and NQO1 can inhibit the occurrence and development of tumors [11] - [17], while SHH and LIF promote tumor occurrence and development [18] [19]. LIF promotes tumorigenesis and metastasis of breast cancer, Shh Bladder promotes cancer stemness and tumorigenesis. At present, experimental studies on CGN are not found. KLF4 can inhibit proliferation and invasion of breast cancer [20], reduce the impact of chemotherapy on colon cancer cells [21]. As an important signal transduction pathway for the occurrence and development of cancer, SHH has become an important target gene for the treatment of medulloblastoma and pancreatic cancer [22] [23]. LIF can be induced by HIF-2α and promotes tumor progression, metastasis, and chemical resistance in a variety of solid tumors [24] [25]. The potential novel tumor suppressors and oncogenes and biomarkers identified here need to be further validated.
Some limitations of our study should be mentioned. First, this was a retrospective design study, not a prospective cohort study. In addition, a large sample size was required to verify our findings. Thirdly, these results may be validated experimentally in future.
Funding
This work was supported by National Natural Science Foundation of China (81872412).
NOTES
*Authors contributed equally.
#Corresponding authors.