Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Main Menu

Journal Search

[

Research article

]

The Transactions of the Korean Institute of Electrical Engineers

KIEE Vol. 71, No. 10, p.1393-1404

ISSN (print) :

1975-8359

ISSN (online) :

2287-4364

Received : 29 April 2022Revised : 12 September 2022Accepted : 14 September 2022

DOI :

http://doi.org/10.5370/KIEE.2022.71.10.1393

Deep Autoencoder based Classification for Clinical Prediction of Kidney Cancer

신장암의 임상 예측을 위한 딥 오토인코더 기반 분류

손호선 (Ho Sun Shon) ¹iD ErdenebilegBatbaatar (Erdenebileg Batbaatar) ²iD 차은종 (Eun Jong Cha) ³iD 강태건 (Tae Gun Kang) ^§iD 최성곤 (Seong Gon Choi) ^†iD 김경아 (Kyung Ah Kim) ^†iD

(Medical Research Institute, School of Medicine, Chungbuk National University, Korea.)
(Electronics and Telecommunications Research Institute, Korea.)
(Dept. of Biomedical Engineering, School of Medicine, Chungbuk National University, Korea.)
(Institute for Trauma Research, College of Medicine, Korea University, Korea.)

^†Corresponding Author : College of Electrical and Computer Engineering, Chungbuk National University, Korea.

E-mail : choisg@cbnu.ac.kr

^†Corresponding Author : Dept. of Biomedical Engineering, School of Medicine, Chungbuk National University, Korea.

E-mail : kimka@chungbuk.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.kiee.or.kr).

Abstract

Predicting clinical information using gene expression is challenging given the complexity and high dimensionality of gene data. This study propose a deep learning framework for cancer diagnosis through feature extraction and classifier based on various pre-trained autoencoder technologies for kidney cancer. It can be fine-tuned for any tasks and predict clinical information by neural network classifiers. Our model achieved micro and macro F1-scores of 96.2% and 95.8% for gender, 95.8% and 76.3% for race, and 99.8% and 99.6% for sample type predictions, respectively, which is much higher than the values of traditional dimensionality reduction and machine learning techniques. In the results, the conditional variational mutation autoencoder (CVAE) improved the macro F1 score, a difficult race prediction task, by 7.6%. Our results are useful for the prognosis as well as prevention and early diagnosis of kidney cancer.

Key words

Kidney cancer, Deep learning, Generative models, Autoencoder

1. Introduction

Kidney cancer is among the ten most common cancers worldwide ⁽¹⁾, ⁽²⁾; unfortunately, it is hard to detect early through normal clinical means. It is not a single disease; instead, it comprises different histologically and genetically distinct types of cancer, each with its own histologic type, which in turn has its own clinical course and therapy responses ⁽³⁾, ⁽⁴⁾. The Cancer Genome Atlas (TCGA) Research Network has conducted a series of comprehensive molecular characterizations in distinctive histologic types of kidney cancers ⁽⁵⁾.

Kidney cancer is the 6th most frequent cancer in males and the 10th in females, representing 5% and 3% of all new cases, respectively ⁽⁶⁾. Gender disparities in kidney cancer incidence have been reported, with a higher incidence and worse outcome in males ⁽⁷⁾. Over half of all people aged 50 have cysts, which are fluid-filled and are usually benign (noncancerous) and do not need treatment ⁽⁸⁾. Solid tumors of the kidney are rare; however, approximately three-quarters of these tumors are cancerous, with a potential to spread ⁽⁹⁾. According to the Centers for Disease Control (CDC), in the United States in 2014, black men were the most likely to get kidney cancer (24.7 per 100,000), followed by white men (22.0 per 100,000). Among women, African-American women are the most likely to get kidney and renal pelvis cancers (12.4 per 100,000), followed by Hispanic women (11.9 per 100,000) ⁽¹⁰⁾. Studies suggest that the distribution of kidney cancer subtypes differs between racial groups ⁽¹¹⁾, ⁽¹²⁾. Race and ethnicity cause inter-tumoral heterogeneity in cancers, ranging from disease incidence, morbidity, and mortality rates to treatment outcomes ⁽¹³⁾, ⁽¹⁴⁾. Therefore, the identification of population-specific molecular biomarkers is essential ⁽¹⁵⁾.

Identifying genes that contribute to the prognosis of cancer patients is one of the challenges faced while providing appropriate treatment for patients. The critical challenges in bioinformatics are searching for biomarkers that represent the state of patients and predicting the prognosis of cancer patients. The number of gene data is enormous compared with the number of patients, making it challenging to analyze it. To solve these problems, significant genes that represent the state of patients must be extracted. In addition, developing a classification model from the extracted genes may be helpful for early diagnosis and prediction of the prognosis of cancer patients. Cancer is caused by gene variation, damaging genes regulating cell replication in a predetermined order; thus, cells multiply unlimitedly. Therefore, the cells invade adjacent normal tissues and are transferred to the whole body. Because cancer stem cells from mutated genes, they are thought to be a genetic disorder, although only a small number of cancers are genetic. In the case of a mutation in a reproductive cell, the mutation is transferred over generations, and it exists in the whole somatic cell ⁽¹⁶⁾. To predict the state of patients, researchers applied deep learning techniques for analyzing the mutation in a sequence and, studies have accurately predicted major mutations that cause diseases such as spinal muscular atrophy, hereditary non-species colon cancer, and autism ⁽¹⁷⁾.

Kidney cancer is a primary tumor stemming from kidney and renal cell carcinoma; it is a malignant tumor that accounts for over 90% of cases ⁽²⁾, ⁽¹⁸⁾. Because kidney cancer has no symptoms in the early stages, there is often a progressive step at the time of discovery. According to the national cancer registration statistics published in 2020, among the 243,837 cases of cancer in 2018, 5,456 were attributed to kidney cancer, accounting for 2.2% of all cancer cases. By gender, kidney cancer ranked eighth with 3.0% (3,806 cases) of all male cancers ⁽¹⁹⁾. In addition, the symptoms and treatment of kidney cancer decrease patients’ quality of life by increasing the disease burden and medical costs. Risk factors for kidney cancer include environmental habits, living factors, genetic factors, and existing kidney diseases. Among them, smoking, obesity, high blood pressure, and eating habits can be the causes associated with living factors ⁽²⁰⁾. Recently, researchers conducted research to extract features using genetic data from kidney cancer patients and apply classification algorithms through neighborhood component analysis methods ⁽²¹⁾. Furthermore, we used big data from a large cohort (KOTCC database) of kidney cancer patients collected from eight domestic medical institutions to extract variables affecting kidney cancer recurrence. We applied a machine learning algorithm to predict recurrence within five years of surgery ⁽²²⁾.

In this study, we propose a method to extract genes that affect prognosis prediction in kidney cancer patients using a deep learning algorithm and apply a classification algorithm based on the extracted genes to predict the prognosis of cancer patients. We combined gene expression data and clinical data from kidney cancer patients obtained from TCGA portal sites to extract genes that contribute to patient prognosis and applied classification techniques to present their utilization ⁽²³⁾. Next, we selected gender (male, female), sample type (primary cancer, normal), and race (white, black, Asian) as the target variables for analysis. Notably, we extracted genes from kidney cancer patients based on gender, sample type, and race to overcome heterogeneity and extract genetic biomarkers that could allow a more accurate prognosis prediction. After testing the functionality of genes, we presented their applicability and developed the optimal prediction model by comparing and analyzing classification algorithms using extracted genes.

2. Related works

Machine learning and deep learning algorithms are being applied to various analyses of biological data. Some studies predicted the risk of 20 cancers by applying machine learning techniques and artificial intelligence methods to genetic big data analysis ⁽²⁴⁾. The Bayesian classifier has been applied to the problem of classifying proteins that have sequence and structural information, and studies have also used the Bayesian network to combine various details related to proteins and genes with improving the predictive performance for gene function ⁽²⁵⁾. As such, different machine learning technologies are being applied for the analysis of biological data. In a study utilizing the TCGA-KIRC database, they used CT and MRI scan data and clinical data of 227 kidney cancer patients were used to predict the classification accuracy of the cancer stage by applying deep learning ⁽²⁶⁾. In another study, significant gene extraction from kidney cancer data in the TCGA was performed using a deep autoencoder compared to the traditional methods such as least absolute shrinkage and selection (LASSO). The predictive accuracies of classification were compared with the conventional state-of-the-art classification methods and analyzed ⁽²⁷⁾. Researchers integrated data of various cancer patient types, conducted the analysis using AE structures, presented their availability in clinical applications, and suggested ways to efficiently perform posterior inference via stochastic variational inference and learning algorithms in the presence of posterior probability distributions and continuous latent variables for extensive data ⁽²⁸⁾. An AE is a neural network in which the output is set to input x to extract features. By learning how to reconstruct an input, the AE extracts basic or abstract properties that facilitate the accurate prediction of the information. In principle, a linear AE with a single hidden layer in a multi-layer perceptron is the same as principal component analysis (PCA) ⁽²⁹⁾, ⁽³⁰⁾. More generally, nonlinear autoencoders have been studied to extract key properties, including high-level features and Gabor-filter features ⁽³¹⁾, ⁽³²⁾.

The variational autoencoder (VAE) is a model that, given training data as a generative model, produces new data with a sampled value in the same distribution as the actual distribution of the training data. An AE is a model that compresses high-dimensional input data into smaller representations in the stochastic form. Unlike the conventional AE, which maps inputs to latent vectors, VAE maps input data to parameters in the identical probability distributions as the mean and variance of Gaussian distributions. This method produces structured latent spaces and is therefore helpful for image generation ^(33-³⁵⁾.

Supervised autoencoder (SAE) is a neural network that jointly predicts the targets and inputs (reconstruction). For a single hidden layer, this simply means that a classification loss is added to the output layer. The innermost layer has a classification loss added to the layer for a deeper AE, which is usually handed off to the supervised learner after training the AE. The SAE uses unsupervised auxiliary tasks to improve the generalization performance ^(36-³⁸⁾.

The CVAE is a modification of existing VAE structures that enable supervised learning, which considers category information when learning data distributions in the form of added class label y to encoders and decoders. CVAE is a deep conditional generative model for structured output prediction using Gaussian latent variables. The model has efficiently trained in the stochastic gradient variational Bayes framework and allows fast prediction using stochastic feed-forward inference ^(39-⁴¹⁾.

We validated the performance of the proposed framework by comparing it with the traditional data mining and classification methods. The proposed framework employs the various AE-based deep learning techniques by taking advantage of pre-training and fine-tuning strategies. The experimental results show that the AE-based deep learning methods show better performances than the combinations of traditional data mining and classification methods.

3. Methodologies

3.1 Architecture

The main challenge faced by general analysis is the characteristic of genetic data because they have more gene expression values than the number of samples. We propose a novel deep learning–based framework by combining the various AE-based techniques for cancer analysis and compared with the existing feature extraction methods—PCA and NMF—and demonstrate its superiority. The following section describes a pre-training method for auto-encoder-based feature extraction. Proper training of neural networks requires a large amount of learning data; however, often, we have a small quantity of labeled learning data and large amounts of unlabeled learning data. In this case, unlabeled learning data are used to pre-train each layer of a neural network called unsupervised pre-training. AE and VAE only have reconstruction loss in pre-training, but SAE and CVAE also include classification loss. Once the parameters for each layer have been determined to some extent, the classification performance can be improved through fine-tuning using labeled learning data.

In particular, feature extraction was first performed to compare traditional classification algorithms with deep learning techniques. We used conventional dimension reduction techniques such as principal component analysis (PCA) and non-negative matrix factorization (NMF) followed by state-of-the-art classification algorithms. And we used deep learning techniques such as autoencoder (AE), variational autoencoder (VAE), conditional autoencoder (CAE), and conditional variational autoencoder (CVAE), followed by a neural network classifier. For significant gene selection using traditional classification algorithms, we used PCA and NMF, solved the data imbalance problem, applied various classification techniques, and compared and analyzed the results. For deep learning–based significant gene selection, we used all the improved algorithms based on the AE algorithm. We compared the extracted genes, and the classification accuracy was analyzed using a multi-layer perceptron (MLP).

3.1.1 Autoencoder

The AE is a deep learning structure for efficiently coding data. Coding refers to compressing data; in other words, dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space to efficiently represent some data. The neural network architecture of AE has the same input and output and can be represented by Fig. 1 as a symmetrically constructed structure. Because dimensionality reduction is the goal in our study, we take some data X and obtain the node value Z of the hidden layer as a combination of the weighted multiplication and sum and the activation function, which we call the encoder.

그림. 1. 오토인코더의 아키텍쳐

Fig. 1. Architecture of autoencoder

AE has the same structure as MLP, except that the input and output layers have the same number of neurons. Because AE reconstructs the input, the output is called reconstruction, and the loss function is calculated using the difference between the input and reconstruction. When learning AE, it follows unsupervised learning, and the loss is used as the maximum likelihood (ML). Once the hidden layer Z parameters have been determined to some extent, we can use the labeled learning data can be used to perform supervised fine-tuning.

3.1.2 Variational autoencoder

A VAE combines the input data X with the mean (μ) and variance (σ2) (two vector outputs) through the encoder to create a normal distribution. That allows the sampling to create latent vector Z to pass through the decoder to produce new data similar to any existing input data. Therefore, the VAE is a generative model developed generate new data using probability distributions. The structure of the VAE is shown in Fig. 2.

We used the ideal sampling function posterior (to sample, allowing generators to learn the input data well. Equation (1) is used to make the value generated by the sampling function equal to the input value. The maximum likelihood estimation that maximizes the value of (1) shows how well the reconstruction restores data like the input data when the Z vector (latent vector) extracted from the ideal sampling function is given.

그림. 2. 변형 오토인코더의 아키텍쳐

Fig. 2. Architecture of a variational autoencoder

(1)

$Eq_{\Phi}(z|x)[\log(p(x|z))]$

Finding an optimal formula that satisfies these conditions results in evidence lower bound formula that fulfills the above conditions when X is given to the network as evidence.

(2)

$Eq_{\Phi}(z|x)[\log(p(x|z))]-KL(q_{\Phi}(z|x)||p(z))+KL(q_{\Phi}(z|x)||p(z|x))$

The first term of (2) is the reconstruction term, indicating how well it is restored from the ideal sampling function. The second term is a regularization term, which makes the perfect sampling function the same as the prior as possible. The conditions are given to sample values like priors among several samples. The third term represents the distance between the two probability distributions, the distance between the ideal sampling function $(q_{\Phi}(z |x))$ and the sample function $p(z |x)$.

3.1.3 Supervised autoencoder

An SAE is an AE with the addition of a classification loss to the representation layer. For a single hidden layer, this means that a classification loss is added to the output layer. For a deeper AE, the innermost layer would have a classification loss added to the layer, usually handed off to the supervised learner after training the AE, which is explained in Fig. 3.

그림. 3. 감독 오토인코더의 아키텍쳐

Fig. 3. Architecture of the supervised autoencoder

3.1.4 Conditional variational autoencoder

The CVAE is a modification of existing VAE structures which enables supervised learning. CVAE adds a class label y to the encoder and decoder, considering the category information when learning data distributions. Thus, in CVAE, a particular condition is given and added to the encoder and decoder if the label information is known. The y-value is given along with x to find the latent vector z in the encoder. Similarly, the decoder can represent the y-value that generates data as follows. Therefore, the loss function is represented by the reconstruction loss and classification loss. The form is shown in Fig. 4.

그림. 4. 조건부 변형 오토인코더의 아키텍쳐

Fig. 4. Architecture of the conditional variational autoencoderㄴ

3.2 Classifier

To establish a classification model, we employ a multilayer perceptron (MLP) classifier followed by the various autoencoder-based techniques. A multilayer perceptron is a neural network connecting multiple layers in a directed graph, which means that the signal path through the nodes only goes one way. Each node, apart from the input nodes, has a nonlinear activation function. An MLP uses backpropagation as a supervised learning technique.

3.3 Training

3.3.1 Generative pre-training

The number of samples for a given phenotype prediction task is generally small; however, many other gene expression profiles unrelated to this phenotype are available. These profiles were grouped to form a large dataset of samples without labels. This unlabeled dataset cannot predict the phenotype but helps construct a hierarchical representation of gene expressions in the neural network. The idea is to find nonlinear combinations of inputs that provide functional patterns for gene expression analysis. The unlabeled dataset is used to initialize the weights of the MLP before supervised learning. We pre-trained the AE models iteratively for each hidden layer to learn a denoising AE that reconstructs the previous layer’s output.

In the current setting, a generative approach is an approach that provides a training dataset; that is, the empirical distribution can generate synthetic observations that should exhibit the essential structural properties observed in the empirical distribution. The VAE and CVAE generative training strategies ultimately result in a pre-trained model with a good understanding of representation. It can generate the correct features of the given data well.

3.3.2 Fine-tuning for classification

Fine-tuning involves tuning the parameters pre-trained with large-scale data using small-scale data. We fine-tuned the encoder of the pre-trained VAE and CVAE pre-trained with an imbalanced large amount of data. We added a supervised neural network classifier after the encoder of the VAE and CVAE, ignoring the decoder part. With model loss and cross-entropy, we also trained the model using the Adam optimizer to update the model’s weights.

4. Experiments

4.1 Dataset

TCGA has collected cancer data from various platforms worldwide and has produced a dataset of immeasurable values using standardized analysis methods. These data were obtained through TCGA’s data portal. In this study, we collected 1,157 kidney cancer samples from TCGA. We used the transcription profiling file’s data format and contained both case files and clinical information files for the samples. Next, we combined clinical, expression, and case data into a single file based on the case ID and file name using Python. Therefore, the dataset was analyzed using 1,157 samples and 60,483 gene expression data from patients. The frequencies of the target variables are shown in Table 1 below. To solve an imbalanced problem, we applied AE-based nonlinear data transformation and generation techniques during training.

The samples classified by gender were 407 women (35.2%) and 750 men (64.8%). The samples classified by race were as follows: 940 white (81.2%), 150 black or African Americans (13.0%), 17 Asians (1.5%), and the remaining 50 were not reported (4.3%). In the sample types, 1,010 cases had a primary tumor (87.3%), 139 people had a solid tissue normal (12%), and the remaining had missing values.

표 1. 클래스 레이블(종양, 성별, 인종)에 따른 빈도

Table 1. Frequency according to class label(tumor, gender, race)

	Frequency	Percentage	Cumulative (%)
Primary Tumor	1010	87.9	87.6
Solid Tissue Normal	139	12.1	100.0
Total	1149	100.0
Female	407	35.2	35.2
Male	750	64.8	100.0
Total	1157	100.0
Asian	17	1.5	1.6
Black or African-American	150	13.6	14.6
White	940	84.9	100.0
Total	1107	100.0

4.2 Overall analytical structure

We leveraged the integrated data obtained from TCGA to conduct classification analysis on gene expression data using traditional classification techniques and deep learning–based MLP. Fig. 5 shows the overall framework used by traditional classification algorithms. We calculate the interquartile range for outlier detection, a widely used technique that helps find outliers in continually distributed data. We used IQR in our preprocessing because it is a reasonably robust measure of variability. Besides, it is not affected by outliers since it uses the middle 50% of the distribution for calculation and is computationally cheap. First, we eliminated noise and outliers from the genetic data of kidney cancer and extracted 5,000 genes via chi-square tests. We performed 5-fold cross-validation (train (80%) and test (20%)) on 5,000 data samples and used PCA and NMF as data transformation methods. Subsequently, we utilized SMOTE algorithms to solve the data imbalance problem for gender, race, and sample type variables. Furthermore, the classification accuracy of the said variables (race, gender, and sample type) was compared and analyzed by applying classification algorithms such as KNN, SVM, DT, RF, AB, NB, and MLP.

그림. 5. 신장암 유전자 발현 데이터에 대한 전통적인 분류

Fig. 5. Traditional classification for gene expression of kidney cancer

The classification accuracies of the deep learning techniques for race, gender, and sample type based on the AE are shown in Fig. 6. In AE-based techniques, we eliminated noise and outliers, and 5,000 genes were extracted via chi-square tests. We performed 5-fold cross-validation (80% for training and 20% for testing) on the selected 5,000 features and corresponding data samples followed by AE, VAE, SAE, and CVAE during the pre-training and training phase. Finally, we extracted the 100 latent variables. We also solved the imbalanced data problem by fine-tuning the generative pre-trained encoder for highly imbalanced data. The encoder and MLP were combined as classifiers to predict the classification accuracy for race, gender, and sample type.

Compared to Fig. 5, the experiments consist in assessing two different approaches when training the classification model, allowing fine-tuning of the entire network and embedding the AEs into the classification network, namely by only importing the encoding layers. The unsupervised pre-training on the gene expression data and fine-tuning it on specific tasks affect the classification performance. The experimental results show that autoencoder-based approaches achieved a higher classification performance than the traditional classification approaches as reported in the next sections.

그림. 6. 신장암 유전자 발현 데이터에 대한 오토인코더 기반 분류

Fig. 6. Autoencoder-based classification for gene expression data of kidney cancer

4.3 Evaluation measures

To evaluate the model’s performance for classification accuracy prediction, we utilized the precision, recall, and F1-score using a confusion matrix. Precision represents the true positive ratio of the predicted positive data, and recall represents the proportion of actual positive data that is predicted well. The F1-score uses the harmonic mean of precision and recall to compute the mean so that the more imbalanced the data are, the more penalty is applied, which is close to a smaller value. We compared the macro-average and micro-average because our target data had an imbalance problem. The macro-average is used while verifying whether a classifier works well for all classes. It is used when all the classes of data are the same. The micro-average is used when the sizes of each class are different; that is when the sizes of the independently measured confusion matrix are different. Therefore, it can be used more effectively on datasets with class-imbalance problems. Abbreviations used in the confusion matrix refer to true positive (TP), false positive (FP), false negative (FN), and true negative (TN). The following micro-average is used when the number of classes is different; for example, if the class label is 2, the following forms of micro-precision, micro-recall, and micro-F1-score can be expressed from equations (3) to (5).

(3)

$Micro-Precision=\dfrac{TP1+TP2}{TP1+FP1+TP2+FP2}$

(4)

$Micro-Recall=\dfrac{TP1+TP2}{TP1+FN1+TP2+FN2}$

(5)

$Micro-F1-score= 2\times\dfrac{Micro-Precision\times Micro-Recall}{Micro-Precision + Micro-Recall}$

The macro-averaging normalizes the sum of all metrics. Thus, Macro-averaging does not consider the number of events in each class. Macro-precision, macro-recall, and macro-F1-score can be expressed using equations (6) to (10).

(6)

$Precision=\dfrac{TP}{TP+FP}$

(7)

$Recall=\dfrac{TP}{TP+FN}$

(8)

\begin{align*} Macro-Precision= \\\\\dfrac{(Precision \enspace {for} \enspace Class 1+Precision \enspace {for} \enspace Class 2)}{2} \end{align*}

(9)

\begin{align*} Macro-Recall= \\\\\dfrac{(Recall \enspace {for} \enspace Class 1+Recall \enspace {for} \enspace Class 2)}{2} \end{align*}

(10)

\begin{align*} Macro-F1-score= \\\\2\times\dfrac{Macro-Precision\times Macro-Recall}{Macro-Precision + Macro-Recall} \end{align*}

All experiments were executed on an Intel Xeon E5-2698 v4 @ 2.20GHz, 256GB (CPU), NVIDIA Tesla V100 32GB (GPU), and Ubuntu 18.04 operation system. We also used the Scikit-Learn ⁽⁴²⁾ and PyTorch ⁽⁴³⁾ libraries with the Python programming language for all analyses.

5. Results

This section extensively evaluates our approach and compares it with other unsupervised feature extraction techniques, followed by over-sampling and state-of-the-art classifiers. We also report on an ablation study we conducted to explore the most significant 20 genes for each clinical information.

Tables 2-4 show the performance comparison among all methods according to gender, race, and sample type, respectively. The classification performance of values using the micro-average is superior to that of evaluation metrics using the macro-average. The AE-based methods achieved higher performance than the conventional feature extraction methods. That means that AE-based methods can better extract the complexity of cancer and produce more meaningful features. We used only an MLP classifier for the features extracted by AE-based methods because of its neural network structure, and we did not use any sampling for it. When the data imbalance problem was solved using traditional algorithms, the generative AE-based methods achieved higher performance than when sampling was performed using SMOTE algorithms.

Table 2 presents the classification results for gender. The results show that VAE achieved a macro-F1-score of 0.958, and a micro-F1-score of 0.962, indicating higher classification performance than the other methods. It offers results comparable with other AE-based methods and improves the highest performance results of conventional PCA+SVM with SMOTE over-sampling, by 0.021 macro- and 0.02 micro-F1-score, respectively. A gender disparity exists in the incidence of kidney carcinomas, with more incidence reported in men ⁽⁴⁴⁾. Men are at a higher risk of developing kidney cancer and usually have a more aggressive disease at the time of diagnosis. Females generally show more favorable histological kidney cancer and have better oncological outcomes than males ⁽⁴⁵⁾. Extracting valuable features by VAE or other AE-based methods gives deeper information about gender-related differences in kidney cancer therapy.

표 2. 성별에 따른 분류 성능 평가

Table 2. Classification performance evaluation according to gender

Feature Extraction	Sampling	Classifier	Micro- Precision	Micro- Recall	Micro-F1 - score	Macro- Precision	Macro- Recall	Macro- F1-score
AE	FALSE	MLP	0.953	0.952	0.953	0.945	0.952	0.948
VAE	FALSE	MLP	0.963	0.962	0.962	0.958	0.960	0.958
CAE	FALSE	MLP	0.950	0.950	0.950	0.945	0.946	0.945
CVAE	FALSE	MLP	0.958	0.957	0.957	0.952	0.955	0.953
NMF	FALSE	AB	0.908	0.907	0.907	0.896	0.901	0.898
		DT	0.894	0.893	0.893	0.884	0.881	0.882
		KNN	0.657	0.671	0.659	0.630	0.615	0.616
		MLP	0.835	0.836	0.835	0.822	0.814	0.818
		NB	0.804	0.798	0.784	0.808	0.740	0.753
		RF	0.910	0.910	0.909	0.909	0.892	0.899
		SVM	0.781	0.777	0.759	0.785	0.710	0.722
	TRUE	AB	0.909	0.908	0.908	0.897	0.903	0.900
		DT	0.868	0.867	0.867	0.854	0.856	0.854
		KNN	0.647	0.603	0.612	0.603	0.613	0.594
		MLP	0.847	0.847	0.847	0.834	0.829	0.831
		NB	0.815	0.813	0.805	0.814	0.768	0.780
		RF	0.916	0.915	0.915	0.908	0.908	0.907
		SVM	0.777	0.754	0.758	0.743	0.759	0.743
PCA	FALSE	AB	0.867	0.868	0.866	0.860	0.846	0.852
		DT	0.736	0.737	0.736	0.711	0.709	0.710
		KNN	0.737	0.744	0.732	0.725	0.689	0.697
		MLP	0.943	0.943	0.943	0.939	0.936	0.937
		NB	0.657	0.677	0.639	0.640	0.585	0.578
		RF	0.845	0.833	0.822	0.858	0.778	0.797
		SVM	0.940	0.939	0.939	0.936	0.932	0.933
	TRUE	AB	0.867	0.864	0.865	0.850	0.858	0.853
		DT	0.762	0.759	0.759	0.738	0.739	0.737
		KNN	0.736	0.692	0.699	0.692	0.710	0.685
		MLP	0.940	0.940	0.940	0.934	0.934	0.934
		NB	0.756	0.764	0.748	0.746	0.708	0.712
		RF	0.857	0.858	0.856	0.851	0.834	0.841
		SVM	0.943	0.942	0.942	0.936	0.938	0.937

Table 3 shows that when the target variable is race, and the label is white, black or African-American, and Asian, the class label imbalance is very severe. Our data included 940 (81.2%), 150 (13.0%), and 17 (1.5%) white, African-American, and Asian samples, respectively. Clearly, race prediction is a more challenging task than other clinical prediction tasks. It shows a much lower macro-averaged performance. The results show that CVAE achieved a macro-F1-score of 0.763, and a micro-F1-score of 0.959, indicating a higher classification performance than the other methods. It offers a micro-F1-score that is comparable with other AE-based methods and improves the highest performance results of conventional PCA+SVM with SMOTE over-sampling by 0.121 macro- and 0.018 micro-F1-score, respectively. It also enhances the highest performance results for the AE method by 0.076. We can conclude that CVAE can extract the complexity of cancer and works well for more complex tasks than other AE-based methods. Surveillance and epidemiology data indicate that kidney cancer incidence and mortality rates are higher among African-American patients compared to white patients ⁽⁴⁶⁾.

White and Asian patients (age 63.9 and 62.6 years, respectively) had a slightly older age of onset than Black and Native American patients (age 60.7 and 60.3 years) ⁽⁴⁷⁾. However, feature extraction for racial information is challenging; we can achieve a higher macro-F1-score (higher than 70%) using the CVAE method.

Table 4 presents the breakdown results for the sample types. The label for the sample type was 1,010 primary tumors (87.3%) and 139 solid tissue normal (12%), and the remaining had missing values. The results show that all AE-based methods achieved comparable results, with macro-F1-score of 0.996 and micro F1-score of 0.998, and a higher classification performance than the other methods. All AE-based methods improve the

표 3. 인종에 따른 분류 성능 평가

Table 3. Classification performance evaluation of race

Feature Extraction	Sampling	Classifier	Micro- Precision	Micro- Recall	Micro-F1 - score	Macro- Precision	Macro- Recall	Macro- F1-score
AE	FALSE	MLP	0.953	0.961	0.956	0.743	0.660	0.687
VAE	FALSE	MLP	0.956	0.964	0.958	0.767	0.663	0.685
CAE	FALSE	MLP	0.955	0.960	0.956	0.720	0.662	0.678
CVAE	FALSE	MLP	0.961	0.959	0.958	0.832	0.753	0.763
NMF	FALSE	AB	0.857	0.849	0.848	0.515	0.479	0.489
		DT	0.874	0.880	0.877	0.530	0.517	0.523
		KNN	0.790	0.844	0.809	0.480	0.402	0.416
		MLP	0.845	0.873	0.852	0.505	0.456	0.468
		NB	0.807	0.589	0.657	0.384	0.409	0.355
		RF	0.889	0.914	0.889	0.576	0.508	0.516
		SVM	0.772	0.867	0.812	0.373	0.385	0.369
	TRUE	AB	0.866	0.845	0.853	0.512	0.507	0.506
		DT	0.871	0.842	0.855	0.508	0.515	0.509
		KNN	0.814	0.622	0.685	0.421	0.535	0.405
		MLP	0.847	0.857	0.851	0.608	0.518	0.540
		NB	0.800	0.594	0.656	0.376	0.413	0.349
		RF	0.901	0.921	0.905	0.586	0.537	0.550
		SVM	0.810	0.601	0.671	0.416	0.475	0.388
PCA	FALSE	AB	0.819	0.852	0.827	0.468	0.415	0.424
		DT	0.822	0.836	0.828	0.457	0.433	0.442
		KNN	0.826	0.858	0.816	0.510	0.378	0.387
		MLP	0.933	0.946	0.937	0.626	0.590	0.604
		NB	0.797	0.834	0.809	0.456	0.402	0.414
		RF	0.847	0.860	0.807	0.575	0.364	0.364
		SVM	0.935	0.947	0.939	0.629	0.594	0.608
	TRUE	AB	0.847	0.849	0.846	0.506	0.488	0.492
		DT	0.825	0.799	0.811	0.448	0.472	0.456
		KNN	0.817	0.669	0.722	0.412	0.478	0.408
		MLP	0.940	0.950	0.945	0.625	0.610	0.616
		NB	0.891	0.883	0.885	0.543	0.533	0.534
		RF	0.886	0.900	0.878	0.601	0.471	0.503
		SVM	0.938	0.948	0.941	0.691	0.622	0.642

highest performance results of conventional PCA+KNN, PCA+MLP without any over-sampling, and PCA+MLP with SMOTE over-sampling, by 0.002 macro- and 0.001 micro-F1-score, respectively. Survival in patients with kidney cancer can be correlated with the expression of various genes based solely on the expression profile in the primary kidney tumor ⁽⁴⁸⁾. Compared to other tasks, distinguishing extracted features is more straightforward, and predicting sample type is much easier. For sample types, classifiers based on both traditional techniques and deep learning performed well. Using the AE-based pre-training algorithm is slightly better, and overall, than the other compared methods. The are several methods to predict cancer subtypes or sample types using deep learning techniques on gene expression data ^(49-⁵¹⁾. To the best of our knowledge, methods identifying kidney cancer biomarkers by combining AE-based methods and model interpretation techniques are still lacking.

In general, unsupervised learning algorithms applied to gene expression data extract biological and technical signals present in input samples. It is best to compress gene expression data using several algorithms and many different latent space dimensionalities. These compressed gene expression features represent important biological signals, including gender, race, and presence of tumor. We showed, through several experiments tracking lower dimensional gene expression representations, and supervised learning performance, that optimal biological features are learned using a variety of latent space dimensionalities and different compression algorithms.

표 4. 종양 유무에 따른 분류 성능 평가

Table 4. Classification performance evaluation according to tumor type

Feature	Sampling	Classifier	Micro- Precision	Micro- Recall	Micro- F1-score	Macro- Precision	Macro- Recall	Macro- F1-score
AE	FALSE	MLP	0.998	0.998	0.998	0.996	0.996	0.996
VAE	FALSE	MLP	0.998	0.998	0.998	0.996	0.996	0.996
CAE	FALSE	MLP	0.998	0.998	0.998	0.996	0.996	0.996
CVAE	FALSE	MLP	0.998	0.998	0.998	0.996	0.996	0.996
NMF	FALSE	AB	0.989	0.989	0.989	0.970	0.978	0.974
		DT	0.984	0.983	0.984	0.951	0.975	0.962
		KNN	0.976	0.974	0.974	0.929	0.957	0.941
		MLP	0.991	0.990	0.990	0.983	0.973	0.977
		NB	0.995	0.995	0.995	0.982	0.994	0.988
		RF	0.995	0.995	0.995	0.991	0.984	0.988
		SVM	0.964	0.963	0.960	0.963	0.863	0.897
	TRUE	AB	0.992	0.991	0.991	0.974	0.986	0.980
		DT	0.977	0.975	0.975	0.929	0.961	0.943
		KNN	0.959	0.943	0.948	0.845	0.955	0.887
		MLP	0.992	0.992	0.992	0.984	0.980	0.981
		NB	0.995	0.995	0.995	0.985	0.991	0.988
		RF	0.994	0.994	0.994	0.987	0.984	0.986
		SVM	0.981	0.978	0.979	0.931	0.978	0.952
PCA	FALSE	AB	0.993	0.993	0.993	0.987	0.980	0.983
		DT	0.980	0.979	0.979	0.962	0.942	0.950
		KNN	0.997	0.997	0.997	0.993	0.995	0.994
		MLP	0.997	0.997	0.997	0.992	0.995	0.994
		NB	0.985	0.984	0.984	0.961	0.966	0.964
		RF	0.987	0.987	0.987	0.989	0.949	0.968
		SVM	0.996	0.996	0.996	0.988	0.991	0.990
	TRUE	AB	0.991	0.991	0.991	0.980	0.979	0.979
		DT	0.986	0.985	0.985	0.968	0.963	0.965
		KNN	0.995	0.995	0.995	0.983	0.994	0.988
		MLP	0.997	0.997	0.997	0.992	0.995	0.994
		NB	0.969	0.967	0.968	0.914	0.941	0.925
		RF	0.993	0.993	0.993	0.993	0.974	0.983
		SVM	0.996	0.996	0.996	0.988	0.991	0.990

6. Conclusions

We combined kidney cancer clinical data and gene data collected through the TCGA database to extract significant gender, race, and sample type genes. We conducted a classification analysis based on these data. Based on deep learning algorithms, we compared and analyzed datasets using traditional classification techniques and pre-training processes, such as AE, VAE, SAE, and CVAE. For feature extraction of significant genes for the classification analysis using traditional techniques, PCA and NMF techniques were employed, while in our proposed deep learning–based techniques, important genes were extracted through pre-training processes such as AE, VAE, SAE, CVAE, and fine-tuning. As a result, deep learning–based effective gene extraction methods performed better.

There are several methods to predict cancer subtypes or sample types using deep learning techniques on gene expression data. To the best of our knowledge, there is a lack methods for identifying kidney cancer biomarkers that combine AE-based methods and model interpretation techniques. As shown in Tables 2-4, extracting race-related features is the most challenging task, and sample type feature extraction is much easier than other tasks. For the challenging tasks, CVAE outperforms the other methods.

Furthermore, we compared micro and macro measures according to the number of class labels of the target variables. The micro-measure exhibited better performance. In the future, the extracted genes will be able to confirm the gene’s function through verification and help predict the prognosis of kidney cancer patients. In further work, we will consider the other data samples such as clinical, RNA, DNA methylation, etc.

Acknowledgements

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) by the Ministry of Education under Grant No. 2019R1F1A1051569, and No. 2020R1I1A1A01065199, No. 2020R1A6A1A12047945.

References

V. M. G. Olivares, L. M. G. Torres, G. H. Cuartas, M. C. N. De la Hoz, 2019, Immunohistochemical profile of renal cell tumours, Revista Espanola De Patologia, Vol. 52, No. 4, pp. 214-221

J. J. Hsieh, M. P. Purdue, S. Signoretti, C. Swanton, L. Albiges, M. Schmidinger, D. Y. Heng, J. Larkin, V. Ficarra, 2017, Renal cell carcinoma, Nat. Rev. Dis. Primers, Vol. 3, No. 17010, pp. 1-19

W. M, Linehan, M. M. Walther, B. Zbar, 2003, The genetic basis of cancer of the kidney, J. Urol., Vol. 170, pp. 2163-2172

W. M. Linehan, B. Zbar, 2004, Focus on kidney cancer, Cancer Cell, Vol. 6, No. 3, pp. 223-228

Cancer Genome Atlas Research Network, 2016, Comprehensive molecular characterization of papillary renal-cell carcinoma, N. Engl. J. Med., Vol. 374, No. 2, pp. 135-145

L. A. Torre, B. Trabert, C. E. DeSantis, K. D. Miller, G. Samimi, C. D. Runowicz, M. M. Gaudet, A. Jemal, R. L. Siegel, 2018, Ovarian cancer statistics, 2018, CA Cancer J. Clin., Vol. 68, pp. 284-296

A. J. Peired, R. Campi, M. L. Angelotti, G. Antonelli, C. Conte, E. Lazzeri, F. Becherucci, L. Calistri, S. Serni, P. Romagnani, 2021, Sex and Gender Differences in Kidney Cancer: Clinical and Experimental Evidence, Cancers, Vol. 13, No. 18, pp. 4588

Y. Zhan, C. Pan, Y. Zhao, J. Li, B. Wu, S. Bai, 2021, Systematic Analysis of the Global, Regional and National Burden of Kidney Cancer from 1990 to 2017: Results from the Global Burden of Disease Study 2017, Eur. Urol. Focus, Vol. 8, No. 1, pp. 302-319

N. Chowdhury, C. G. Drake, 2020, Kidney cancer: an overview of current therapeutic approaches, Urol. Clin., Vol. 47, No. 4, pp. 419-431

D. A. Siegel, S. J. Henley, J. Li, L. A. Pollack, E. A. Van Dyne, A. White, pp 950-954 2017, Rates and trends of pediatric acute lymphoblastic leukemia—United States, 2001–2014, Morb. Mortal. Wkly. Rep., Vol. 66, No. 36, pp. 950-954

A. F. Olshan, Y. M. Kuo, A. M, Meyer, M. E. Nielsen, M. P. Purdue, W. K. Rathmell, 2013, Racial difference in histologic subtype of renal cell carcinoma, Cancer Med., Vol. 2, No. 5, pp. 744-749

L. Lipworth, A. K. Morgans, T. L. Edwards, D. A. Barocas, S. S. Chang, S. D. Herrell, D. F. Penson, M. J. Resnick, J. A. Smith, P. E. Clark, 2016, Renal cell cancer histological subtype distribution differs by race and sex, BJU Int., Vol. 117, No. 2, pp. 260-265

T. R. Rebbeck, 2018, Prostate cancer disparities by race and ethnicity: from nucleotide to neighborhood, Cold Spring Harbor Persp. Med., Vol. 8, No. 9, pp. a030387

S. J. O. Nomura, Y. T. Hwang, S. L. Gomez, T. T. Fung, S. L. Yeh, C. Dash, L. Allen, S. Philips, L. Hilakivi-Clarke, Y. L. Zheng, J. H. Y. Wang, 2017, Dietary intake of soy and cruciferous vegetables and treatment- related symptoms in Chinese-American and non-Hispanic White breast cancer survivors, Breast Cancer Res. Treat., Vol. 168, No. 2, pp. 467-79

P. Mamoshina, K. Kochetov, E. Putin, F. Cortese, A. Aliper, W. S. Lee, S. M. Ahn. L. Uhn, N. Skjodt, O. Kovalchuk, M. Scheibye-Knudsen, 2018, Population specific biomarkers of human aging: a big data study using South Korean, Canadian, and Eastern European patient populations, J. Gerontology: Ser. A, Vol. 73, No. 11, pp. 1482-1490

H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, T. R. Hughes, Q. Morris, Y. Barash, A. R. Krainer, N. Jojic, S. W. Scherer, B. J. Blencowe, B. J. Frey, 2015, RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease, Science, Vol. 347, No. 6218, pp. 1-20

M. Amgad, H. Elfandy, H. Hussein, L. A. Atteya, M. A. T. Elsebaie, L. S. A. Elnasr, R. A. Sakr, H. S. E. Salem, A. F. Ismail, A. M. Saad, J. Ahmed, M. A. T. Elsebaie, M. Rahman, I. A. Ruhban, N. M. Elgazar, Y. Alagha, M. H. Osman, A. M. Alhusseiny, M. M. Khalaf, A. F. Younes, A. Abdulkarim, D. M. Younes, A. M. Gadallah, A. M. Elkashash, S. Y. Fala, B. M. Zaki, J. Beezley, D. R. Chittajallu, D. Manthey, D. A. Gutman, L. A. D. Cooper, 2019, Structured crowdsourcing enables convolutional segmentation of histology images, Bioinformatics, Vol. 35, No. 18, pp. 3461-3467

V. M. G. Olivares, L. M. G. Torres, G. H. Cuartas, M. C. N. De la Hoz, 2019, Immunohistochemical profile of renal cell tumours, Rev. Esp. Patol., Vol. 52, No. 4, pp. 214-221

accessed on 17 August, 2021, National Cancer Center. Available online: https://ncc.re.kr/ index

B. H. Chi, I. H. Chang, 2018, The overdiagnosis of kidney cancer in Koreans and the active surveillance on small renal mass, Korean J. Urol. Oncol., Vol. 16, No. 1, pp. 15-24

A. M. Ali, H. Zhuang, A. Ibrahim, O. Rehman, M. Huang, A. Wu, 2018, A machine learning approach for the classification of kidney cancer subtypes using miRNA genome data, Appl. Sci., Vol. 8, No. 12, pp. 1-14

H. M. Kim, S. J. Lee, S. J. Park, I. Y. Choi, S. H. Hong, 2021, Machine learning approach to predict the probability of recurrence of renal cell carcinoma after surgery: Prediction model development study, JMIR Med. Inform., Vol. 9, No. 3, pp. e25635

accessed on 17 August, 2021, Genomic Data Commons. Available online: https://portal.gdc. cancer.gov

B. J. Kim, S. H. Kim, 2018, Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method, PNAS USA, Vol. 115, No. 6, pp. 1322-1327

O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, D. Botstein, 2003, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), PNAS USA, Vol. 100, No. 14, pp. 8348-8353

N. Hadjiyski, 2020, Kidney cancer staging: Deep learning neural network based approach, 2020 International Conference on E-Health and Bioengineering (EHB 2020)

H. S. Shon, E. Batbaatar, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of kidney cancer data using cost-sensitive hybrid deep learning approach, Symmetry, Vol. 12, No. 1,154

N. Simidjievski, C. Bodnar, I. Tariq, P. Scherer, H. A. Terre, Z. Shams, M. Jamnik, P. Liò, 2019, Variational autoencoders for cancer data integration: Design principles and computational practice. Front. Genet.,, Vol. 10, No. 1205

P. Baldi, K. Hornik, 1989, Neural networks and principal component analysis: Learning from examples without local minima., Neural Netw., Vol. 2, pp. 53-58

M. Mohri, A. Rostamizadeh, A. Talwalkar, 2012, Foundations of Machine Learning, MIT Press

P. Vincent, H. Larochelle, L. Lajoie, Y. Bengio, P. A. Manzagol, 2010, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, J. Mach. Learn. Res., Vol. 11, pp. 3371-3408

M. A. Ranzato, C. S. Poultney, S. Chopra, Y. LeCun, 2007, Efficient learning of sparse representations with an energy-based model, Adv. Neural Inf. Process. Syst., Vol. 19, pp. 1137-1144

D. P. Kingma, M. Welling, 2014, Auto-encoding variational bayes, Proceedings of the 2nd International Conference on Learning Representations

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, L. Carin, 2016, Variational autoencoder for deep learning of images, labels and captions, 30th Conference on Neural Information Processing Systems (NIPS 2016)

K. Simonyan, A. Zisserman, 2015, Very deep convolutional networks for large-scale image recognition, The 3rd International Conference on Learning Representations (ICLR)

L. Le, A. Patterson, M. White, 2018, Supervised autoencoders: Improving generalization performance with unsupervised regularizers, 32nd Conference on Neural Information Processing Systems (NIPS 2018)

M. Mohri, A. Rostamizadeh, D. Storcheus, 2015, Generalization bounds for supervised dimensionality reduction, JMLR: Workshop and Conf. Proc., Vol. 44, pp. 226-241

L. A. Gottlieb, A. Kontorovich, R. Krauthgamer, 2016, Adaptive metric dimensionality reduction, Theor. Comput. Sci., Vol. 620, pp. 105-118

K. Sohn, H. Lee, X. Yan, 2015, Learning structured output representation using deep conditional generative models, Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, pp. 3483-3491

S. Belharbi, R. Hérault, C. Chatelain, S. Adam, 2018, Deep neural networks regularization for structured output prediction, Neurocomputing, Vol. 281, pp. 169-177

Y. Bengio, E. Laufer, G. Alain, J. Yosinski, 2014, Deep generative stochastic networks trainable by backprop, Proceeding of the 31st International Conference on Machine Learning, Vol. 32, pp. 226-234

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, 2011, Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., Vol. 12, pp. 2825-2830

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, 2019, Pytorch: An imperative style, high-performance deep learning library, Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8026-8037

I. Lucca, T. Klatte, H. Fajkovic, M. De Martino, S. F. Shariat, 2015, Gender differences in incidence and outcomes of urothelial and kidney cancer, Nat. Rev. Urol., Vol. 12, No. 12, pp. 585-592

M. Mancini, M. Righetto, G. Baggio, 2020, Gender-related approach to kidney cancer management: Moving forward, Int. J. Mol. Sci., Vol. 21, No. 9, pp. 3378

D. Hepps, A. Chernoff, 2006, Risk of renal insufficiency in African-Americans after radical nephrectomy for kidney cancer, Urologic Oncology: Seminars and Original Investigations, Vol. 24, No. 5, pp. 391-395

B. Shuch, S. Vourganti, C. J. Ricketts, L. Middleton, J. Peterson, M. J. Merino, A. R. Metwalli, R. Srinivasan, W. M. Linehan, 2014, Defining early-onset kidney cancer: implications for germline and somatic mutation testing and clinical management, J. Clin. Oncol., Vol. 32, No. 5, pp. 431-437

J. R. Vasselli, J. H. Shih, S. R. Iyengar, J. Maranchie, J. Riss, R. Worrell, C. Torres-Cabala, R. Tabios, A. Mariotti, R. Stearman, M. Merino, W. M. Linehan, 2003, Predicting survival in patients with metastatic kidney cancer by gene- expression profiling in the primary tumor, Proceedings of the National Academy of Sciences, Vol. 100, No. 12, pp. 6958-6963

M. Mostavi, Y. C. Chiu, Y Huang, Y. Chen, 2020, Convolutional neural network models for cancer type prediction based on gene expression, BMC Med. Genom., Vol. 13, No. 44, pp. 1-13

N. E. M. Khalifa, M. H. N. Taha, D. E. Ali, A. Slowik, A. E. Hassanien, 2020, Artificial intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized deep learning approach, IEEE Access, Vol. 8, pp. 22874-22883

R. Tabares-Soto, S. Orozco-Arias, V. Romero-Cano, V. S. Bucheli, J. L Rodríguez-Sotelo, C. F. Jiménez-Varón, 2020, A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data, PeerJ Comput. Sci., Vol. 6, No. e270

저자소개

손호선(Ho Sun Shon)

2010 : Ph.D in Computer Science, Chungbuk National University, Korea.

2012 to present : Visiting professor in Medical Research Institute, School of Medicine, Chungbuk National University, Korea.

Erdenebileg Batbaatar

2019 : Ph.D in Computer Science, Chungbuk National University, Korea

2021 to present : Researcher in Electronics and Telecommunications Research Institute, Korea.

차은종(Eun Jong Cha)

1987 : Ph.D in Biomedical Engineering, University of Southern California, U. S. A.

1988 to present : Professor in Department of Biomedical Engineering, School of Medicine, Chungbuk National University, Korea.

강태건(Tae Gun Kang)

2000 : Ph. D in Industrial Engineering, Dongguk University, Korea

2021 to present : Research professor in Institute for Trauma Research, College of Medicine, Korea University, Korea.

최성곤(Seong Gon Choi)

2004 : Ph.D in Information Communications University, Korea

2004 to present : Professor in College of Electrical and Computer Engineering, Chungbuk National University, Korea.

김경아(Kyung Ah Kim)

2001 : Ph.D in Biomedical Engineering, Chungbuk National University, Korea.

2005 to present : Professor in Department of Biomedical Engineering, School of Medicine, Chungbuk National University, Korea.

KIEEThe Transactions ofthe Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

신장암의 임상 예측을 위한 딥 오토인코더 기반 분류

Abstract

Key words

1. Introduction

2. Related works

3. Methodologies

3.1 Architecture

3.1.1 Autoencoder

3.1.2 Variational autoencoder

(1)

(2)

3.1.3 Supervised autoencoder

3.1.4 Conditional variational autoencoder

3.2 Classifier

3.3 Training

3.3.1 Generative pre-training

3.3.2 Fine-tuning for classification

4. Experiments

4.1 Dataset

4.2 Overall analytical structure

4.3 Evaluation measures

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

5. Results

6. Conclusions

Acknowledgements

References

저자소개

손호선(Ho Sun Shon)

Erdenebileg Batbaatar

차은종(Eun Jong Cha)

강태건(Tae Gun Kang)

최성곤(Seong Gon Choi)

김경아(Kyung Ah Kim)

Article Information (continued)

Key words

KIEEThe Transactions of
the Korean Institute of Electrical Engineers