Cancer-Classifier-ML-Model/CancerClassifier.Rmd at main · isha2106/Cancer-Classifier-ML-Model · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
---
title: "Dec"
author: "Isha Parikh"
date: "March 31, 2024"
output:
  pdf_document:
    latex_engine: xelatex
  html_document:
    df_print: paged
  word_document: default
---

# Data Understanding
This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumor: BRCA (Breast Cancer), KIRC (Kidney Renal Clear Cell Carcinoma), COAD (Colon Adenocarcinoma), LUAD(Lung Adenocarcinoma) and PRAD(Prostate Adenocarcinoma). Samples (instances) are stored row-wise. Variables (attributes) of each sample are RNA-Seq gene expression levels measured by illumina HiSeq platform. Some of the data points are removed(missing values) as per the the project requirement.


## Data Acquisition
Loading and exploring the data and merging the actual data with the labels for better data understanding.
```{r loadData}
# Load the data
data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSx3gPbBa0-_s2mPqlXij2sTVdoF5hcp7zb5NIAnaeMuwlLZ5PXmijMaOWSk0nVDw/pub?gid=1571246026&single=true&output=csv")
#str(data)

# Load the labels
label_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQvYplBm1JbPw-VSFB79N01SeR4lWTCh9yl3WUiFtiqHVenbYTijM5ryqJ5ELzAew/pub?gid=1180405259&single=true&output=csv")
#str(label_data)

# Merge the data with the labels
cancer_data <- merge(data, label_data, by = "X")

# Explore the data
head(cancer_data)
str(cancer_data)
summary(cancer_data)
```

There are a total of 801 instances(patients)(rows/observations) and 20531 features (gene expression)(columns/variables) or 257 different genes. Here our target variable will be the cancer type (Class). Also, here we do not use X which is an additional unused feature. Thus, we drop it.

```{r}
if (!require("dplyr", character.only = TRUE)) {
  install.packages("dplyr", dependencies = TRUE)
}
library(dplyr)
```

```{r}
# Dropping column X (samplename)
cancer_data <- select(cancer_data, -X)
```


## Data Exploration

### Exploratory data plots
Plotting a histogram to understand the number of cases for each type of cancer

```{r echo=FALSE}
# Load the library
if (!require("ggplot2", character.only = TRUE)) {
  install.packages("ggplot2", dependencies = TRUE)
}
library(ggplot2)
```

```{r histogramPlot}
# Plot a histogram
ggplot(cancer_data, aes(x = Class)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Number of cases for different types of Cancer", x = "Class", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

The graph indicates that there is significant class imbalance in the dataset. BRCA has the majority of samples. COAD has the least which may impact the training of the machine learning models as they might become biased towards overrepresented classes. KIRC, LUAD and PRAD cancers have similar amount of cases. For machine learning, such imbalances makes it important for us to use strategies like resampling or weighted class handling to ensure that the model generalizes well across all the cancer types.

### Identification of missing values
I check if there are any missing values in the data, if so I impute the mean of each gene column in it's place which does not affect the data.
```{r missingVals}
# Checking for missing values
sum(is.na(cancer_data))
```

### Data imputation of missing data
```{r}
# Impute missing values with mean of each column
cancer_data[] <- lapply(cancer_data, function(x) {
  if(is.numeric(x)) {
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    }
  return(x)
})

# Check again for missing values
sum(is.na(cancer_data))
```

### Detection of Outliers
It is important to identify anomalies that can skew data analysis and to study the underlying patterns in the data points. I perform identification of outliers using z scores. The threshold used is 2.5 which will help to detect significant deviations while minimizing false positives.

```{r outlierData}
# Function for finding outliers
outliers <- function(data, column) {

  # Calculate mean and standard deviation
  mu <- mean(data[[column]], na.rm = TRUE)
  sd <- sd(data[[column]], na.rm = TRUE)

  # Calculate z-score
  z_scores <- (data[[column]] - mu) / sd

  # Identify z-scores above the threshold
  outlier_threshold <- which(abs(z_scores) > 2.5)

  return(outlier_threshold)
}

# Call the function for each gene
cols <- names(cancer_data)[-c(1, ncol(cancer_data))]
outliers_list <- list()

for(column in cols) {
  outliers_list[[column]] <- outliers(cancer_data, column)
}
#outliers_list

# Total number of outliers
total_outliers <- sum(lengths(outliers_list))
print(paste("The total number of outliers are:", total_outliers))
```

There are 3611 outliers in the data. Thus, the data needs to be normalized. Before, normalization we check the correlation between our features and the particular class of cancer.

```{r}
# Visualize the outlier counts
outlier_counts <- sapply(outliers_list, length)
barplot(outlier_counts, main = "Outlier Counts per Gene", xlab = "Genes", ylab = "Number of Outliers", las = 2)
```

```{r}
# Let's visualize genes with outlier counts above a certain threshold, e.g., greater than 10
high_outlier_genes <- names(which(outlier_counts > 20))
if (length(high_outlier_genes) > 0) {
  boxplot(cancer_data[, high_outlier_genes], las = 2, main = "Boxplots for Genes with High Outlier Counts", xlab = "Genes", ylab = "Expression Levels")
} else {
  print("No genes have more than 10 outliers.")
}
```

you can either keep them, especially for rare disease subtypes, or mitigate their influence using robust statistical methods or transformations (like log transformation)

### Correlation Analysis
It is important to find the correlation between the genes and each unique class to select all the relevant features. I find the least correlated genes for each class to explore the data better. Here I use one hot encoding method to represent categorical variables as numerical values and use dummy variables.

```{r echo=FALSE}
if (!require("reshape2", character.only = TRUE)) {
  install.packages("reshape2", dependencies = TRUE)
}
library(reshape2)
```

Encode the categorical variable to factor using one hot encoding method
```{r}
# One hot encoding
cancer_data$Class <- as.factor(cancer_data$Class)
```

Finding the least correlated genes to each class
```{r corAnalysis, warning=FALSE}

# Create dummy variables
dummy_class <- model.matrix(~ Class - 1, data = cancer_data)

# Convert all data to numeric
gene_data <- select(cancer_data, -Class)
gene_data <- as.data.frame(lapply(gene_data, as.numeric))

# Calculate correlations
cor_analysis <- (cor(cbind(gene_data, dummy_class), use = "complete.obs"))
cor_genes <- ncol(gene_data)
cor_class <- ncol(dummy_class)
final_cor <- cor_analysis[1:cor_genes, (cor_genes+1):(cor_genes+cor_class)]

# Melt and format the matrix
melt_cor <- melt(final_cor, varnames = c("Gene", "Class"))
melt_cor$Class <- gsub("Class", "", melt_cor$Class)

# Find the least 10 correlated genes for each class
order_data <- melt_cor[order(melt_cor$Class, melt_cor$value), ]
group_data <- group_by(order_data, Class)
least_cor_genes <- summarise(group_data, least_genes = Gene[1:5], least_val = value[1:5], .groups = 'drop')

print(least_cor_genes)

# Find common least correlated genes between the classes
genes_per_class <- by(least_cor_genes, least_cor_genes$Class, function(df) {
  as.character(df$least_genes)
})
common_genes <- Reduce(intersect, genes_per_class)
print(common_genes)
```
There are no common genes that show least correlation between the classes and that we can eliminate further during feature selection.

Now, I plot a correlation graph using UMAPs used for visualizing high-dimensional data in a lower dimensional space to see the correlation between the genes for each class.
```{r echo=FALSE}
if (!require("umap", character.only = TRUE)) {
  install.packages("umap", dependencies = TRUE)
}
library(umap)
```

```{r plotUMAP}
set.seed(3434)

# UMAP dimensionality reduction
umap_plot <- umap(gene_data)

# Convert for ggplot
umap_df <- as.data.frame(umap_plot$layout)
umap_df$Class <- cancer_data$Class

# Plot a UMAP
ggplot(umap_df, aes(x = V1, y = V2, color = Class)) +
  geom_point(alpha = 0.7) +
  labs(title = "UMAP for Correlation between Genes of each Class", x = "Genes", y = "Genes") +
  theme_minimal() +
  theme(legend.position = "right")
```


```{r , warning=FALSE}
# Check correlation between genes
gene_cor_matrix <- cor(gene_data, use = "complete.obs")
gene_cor_matrix <- melt(gene_cor_matrix)
```


```{r}
ggplot(gene_cor_matrix, aes(x = value)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Correlation Values", x = "Correlation Value", y = "Frequency") +
  theme_minimal()
```
The histogram shows a normal-like distribution of correlation values. It indicates that most gene expression relationships are near zero, suggesting little to no linear correlation between them.
Linear correlation may not capture all types of relationships. Look for non-linear patterns or interactions between genes using methods like decision trees, random forests, or neural networks that can model complex relationships.


The UMAP plot shows distinct clusters corresponding to different classes of cancer which indicates patterns of gene correlation unique to each class. It can be seen that some genes of BRCA form clusters which might possibly be outliers. Hence, we need to normalize the data. Also, the separation between the clusters suggests that the gene expression profiles for these classes are quite different.


# Data Preparation
## Data Cleaning & Shaping
There are a number of outliers detected. Thus, the data needs to be normalized for better results. Before, normalization I log2 transform the data to reduce the skewness introduced by outliers.

### Transformation of features to adjust distribution
```{r logTransform, warning=FALSE}
# Perform log transform
log_tranform_data <- select(cancer_data, -Class)
log_tranform_data <- log2(log_tranform_data + 1)
log_tranform_data$Class <- cancer_data$Class

summary(log_tranform_data)
```


### Normalization of feature values
Here, I normalize using Z score normalization method.
```{r normalize}
# Function for normalization
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x))) }

# Call the function
norm_data <- select(log_tranform_data, -Class)
norm_data <- as.data.frame(lapply(norm_data, normalize))
norm_data$Class <- log_tranform_data$Class

summary(norm_data)
```

After normalization, we need to check if there are still any outliers.
```{r outliers}
# Function for finding outliers
outliers <- function(data, column) {

  # Calculate mean and standard deviation
  mu <- mean(data[[column]], na.rm = TRUE)
  sd <- sd(data[[column]], na.rm = TRUE)

  # Calculate z-score
  z_scores <- (data[[column]] - mu) / sd

  # Identify z-scores above the threshold
  outlier_threshold <- which(abs(z_scores) > 2.5)

  return(outlier_threshold)
}

# Call the function for each gene
cols <- names(norm_data)[-c(1, ncol(norm_data))]
outliers_list <- list()

for(column in cols) {
  outliers_list[[column]] <- outliers(norm_data, column)
}
#outliers_list

# Total number of outliers
total_outliers <- sum(lengths(outliers_list))
print(paste("The total number of outliers are:", total_outliers))
```

There was a decrease in the number of outliers to 3530. There still are outliers but these will be kept in the dataset since these outliers might show some extreme true positive values which might help during training the model.

Filtering out columns that only have Zero values
```{r filterZero}
# Remove columns with only 0 values
normalized_data <- select(norm_data, -Class)
zero_cols <- sapply(normalized_data, function(x) length(unique(x)) == 1)
constant_column_names <- names(normalized_data)[zero_cols]
normalized_data <- normalized_data[, !zero_cols]

print(paste(constant_column_names, "removed because they only have zero values"))
```

### Identification of principal components (PCA)
```{r pcaComp}
# Apply PCA
pca_results <- prcomp(normalized_data)
str(pca_results)
summary(pca_results)

# Plot PCA graph
plot(pca_results, type = "l")
```

The exact choice depends on the level of variance you want to capture and the trade-off you're willing to make between dimensionality reduction and information retention.

Now, I make a new dataset with PCA features.
```{r pcaData}
# Making a new dataset
pc_scores <- as.data.frame(pca_results$x)
pc_scores$Class <- norm_data$Class

# Components that reach 90% variance
pc_comp <- cumsum(pca_results$sdev^2 / sum(pca_results$sdev^2))
num_components <- which(pc_comp >= 0.90)[1]
pca_dataset <- pc_scores[, 1:num_components]
pca_dataset$Class <- norm_data$Class
```
By comparing both the results of pca_dataset and normalized_data we can try and get a better model.


```{r}
# Normalized Dataset
normalized_data$Class <- norm_data$Class
str(normalized_data)

# PCA Dataset
str(pca_dataset)
```

# Modeling
## Model Construction
### Creation of training & validation subsets for normalized_data
```{r echo=FALSE}
if (!require("caret", character.only = TRUE)) {
  install.packages("caret", dependencies = TRUE)
}
library(caret)
```

```{r subsetNorm}
set.seed(3434)
# Subsets based on index and classes
norm_train_index <- createDataPartition(normalized_data$Class, p = 0.8, list = FALSE)

# Training set
norm_train_subset <- normalized_data[norm_train_index, ]

# Testing set
norm_test_subset <- normalized_data[-norm_train_index, ]
```

### Creation of training & validation subsets for pca_dataset
```{r subsetPCA}
set.seed(3434)
# Subsets based on index and classes
pca_train_index <- createDataPartition(pca_dataset$Class, p = 0.8, list = FALSE)

# Training set
pca_train_subset <- pca_dataset[pca_train_index, ]

# Testing set
pca_test_subset <- pca_dataset[-pca_train_index, ]
```


### Creation of model A: SVM
My model A is SVM. I train my model on both normalized_data and pca_dataset.
```{r echo=FALSE}
if (!require("kernlab", character.only = TRUE)) {
  install.packages("kernlab", dependencies = TRUE)
}
library(kernlab)
```

```{r svmModel_norm}
# SVM model for norm_train_subset
set.seed(3434)
svm_model_norm <- ksvm(Class ~ ., data = norm_train_subset, kernel = "rbfdot")
```

```{r svmModel_pca}
# SVM model for pca_train_subset
set.seed(3434)
svm_model_pca <- ksvm(Class ~ ., data = pca_train_subset, kernel = "rbfdot")
```


### Creation of model B: Gradient Boosting
My model B is Gradient Boosting. I train my model on both normalized_data and pca_dataset.
```{r echo=FALSE}
if (!require("gbm", character.only = TRUE)) {
  install.packages("gbm", dependencies = TRUE)
}
library(gbm)
```

```{r gbModel_norm}
# Gradient Boosting model for normalized_data
set.seed(3434)
gbm_model_norm <- train(Class ~ ., data = norm_train_subset, method = "gbm", verbose = FALSE)
```

```{r gbModel_pca}
# Gradient Boosting model for normalized_data
set.seed(3434)
gbm_model_pca <- train(Class ~ ., data = pca_train_subset, method = "gbm", verbose = FALSE)
```


### Creation of model C: Random Forest
My model C is Random Forest. I train my model on both normalized_data and pca_dataset.

```{r}
if (!require("randomForest", character.only = TRUE)) {
  install.packages("randomForest", dependencies = TRUE)
}
library(randomForest)
```

```{r rfModel_norm}
# Random Forest model for normalized_data
set.seed(3434)
rf_model_norm <- randomForest(Class ~ ., data = norm_train_subset, verbose = FALSE)
```

```{r rfModel_pca}
# Random Forest model for pca_train_subset
set.seed(3434)
rf_model_pca <- randomForest(Class ~ ., data = pca_train_subset, verbose = FALSE)
```

```{r}
importance(rf_model_norm)

# Plot variable importance
varImpPlot(rf_model_norm)
varImpPlot(rf_model_pca)
```
This plot displays feature importance scores from a Random Forest model, where variables are possibly principal components (PC1, PC2, ..., PC19) from PCA. The x-axis shows the Mean Decrease Gini score, indicating the importance of each component in the model.
# Evaluation
## Model Evaluation
### Evaluation of fit of models with holdout method
For SVM model
```{r svm_predict_norm}
# Evaluate SVM model for norm_test_subset
set.seed(3434)
svm_predict_norm_test <- predict(svm_model_norm, norm_test_subset)
confusionMatrix(svm_predict_norm_test, norm_test_subset$Class)
```
```{r svm_predict_pca}
# Evaluate SVM model for pca_test_subset
set.seed(3434)
svm_predict_pca_test <- predict(svm_model_pca, pca_test_subset)
confusionMatrix(svm_predict_pca_test, pca_test_subset$Class)
```

For Gradient Boosting model
```{r gb_predict_norm}
# Evaluate Gradient Boosting model for norm_test_subset
set.seed(3434)
gb_predict_norm_test <- predict(gbm_model_norm, norm_test_subset)
confusionMatrix(gb_predict_norm_test, norm_test_subset$Class)
```
```{r gb_predict_pca}
# Evaluate Gradient Boosting model for pca_test_subset
set.seed(3434)
gb_predict_pca_test <- predict(gbm_model_pca, pca_test_subset)
confusionMatrix(gb_predict_pca_test, pca_test_subset$Class)
```

For Random Forest model
```{r rf_predict_norm}
# Evaluate Random Forest model for norm_test_subset
set.seed(3434)
rf_predict_norm_test <- predict(rf_model_norm, norm_test_subset)
confusionMatrix(rf_predict_norm_test, norm_test_subset$Class)
```
```{r rf_predict_pca}
# Evaluate Random Forest model for pca_test_subset
set.seed(3434)
rf_predict_pca_test <- predict(rf_model_pca, pca_test_subset)
confusionMatrix(rf_predict_pca_test, pca_test_subset$Class)
```


### Evaluation with k-fold cross-validation and tuning of the model
Using cross validation with 5 folds so that the data trains well for all the different classes.
```{r trainCtrl}
# Creating train control
set.seed(3434)
train_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE, savePredictions = "final",                               classProbs = TRUE, summaryFunction = multiClassSummary)
```

```{r}
if (!require("caret", character.only = TRUE)) {
  install.packages("caret", dependencies = TRUE)
}
library(caret)
```

```{r svm_crossTune, warning=FALSE}
# SVM grid
svm_grid <- expand.grid(C = c(0.1, 1, 10), sigma = c(0.001, 0.01, 0.05))

# k-fold cross-validation SVM model for normalized_data
set.seed(3434)
cross_svm_model_norm <- train(Class ~ ., data = normalized_data, method = "svmRadial", trControl =
                          train_control, metric = "Accuracy", tuneGrid = svm_grid)


# k-fold cross-validation SVM model for pca_dataset
set.seed(3434)
cross_svm_model_pca <- train(Class ~ ., data = pca_dataset, method = "svmRadial", trControl =
                            train_control, metric = "Accuracy", tuneGrid = svm_grid)
```

```{r gbm_crossTune}
# Gradient Boosting grid
gbm_grid <- expand.grid(n.trees = c(50, 100), interaction.depth = c(1, 3), shrinkage = c(0.1),  n.minobsinnode = c(5, 10))

# k-fold cross-validation Gradient Boosting model for normalized_data
set.seed(3434)
cross_gbm_model_norm <- train(Class ~ ., data = normalized_data, method = "gbm", trControl =
                                train_control, metric = "Accuracy", verbose = FALSE, tuneGrid = gbm_grid)

# k-fold cross-validation Gradient Boosting model for pca_dataset
set.seed(3434)
cross_gbm_model_pca <- train(Class ~ ., data = pca_dataset, method = "gbm", trControl = train_control,
                       metric = "Accuracy", verbose = FALSE, tuneGrid = gbm_grid)

```

```{r rf_crossTune, warning=FALSE}
# Random forest grid
rf_grid <- expand.grid(mtry = c(2, 4, 6))

# k-fold cross-validation Gradient Boosting model for normalized_data
set.seed(3434)
cross_rf_model_norm <- train(Class ~ ., data = normalized_data, method = "rf", trControl = train_control,
                             metric = "Accuracy", verbose = FALSE, tuneGrid = rf_grid)

# k-fold cross-validation Gradient Boosting model for pca_dataset
set.seed(3434)
cross_rf_model_pca <- train(Class ~ ., data = pca_dataset, method = "rf", trControl = train_control,
                       metric = "Accuracy", verbose = FALSE, tuneGrid = rf_grid)
```

```{r}
create_ensemble <- function(rf_model, gbm_model, svm_model, test_data) {

  # Predictions for Random Forest
  rf_predictions <- predict(rf_model, test_data)

  # Predictions for Gradient Boosting
  gbm_predictions <- predict(gbm_model, newdata = test_data)

  # Predictions for SVM
  svm_predictions <- predict(svm_model, test_data)

  # Combine predictions into a matrix
  combined_predictions <- cbind(as.character(rf_predictions),
                                as.character(gbm_predictions),
                                as.character(svm_predictions))

  # Perform majority vote with random selection in case of a tie
  ensemble_predictions <- apply(combined_predictions, 1, function(row) {
    freq <- table(row)
    return(names(which.max(freq)))
  })
  return(ensemble_predictions)
  }

norm_ensemble <- create_ensemble(cross_rf_model_norm, cross_gbm_model_norm, cross_svm_model_norm, norm_test_subset)

pca_ensemble <- create_ensemble(cross_rf_model_pca, cross_gbm_model_pca, cross_svm_model_pca, pca_test_subset)
```


```{r}
norm_ensemble_accuracy <- sum(norm_ensemble == norm_test_subset$Class) / length(norm_test_subset$Class)
print(paste("Ensemble Accuracy:", norm_ensemble_accuracy))

pca_ensemble_accuracy <- sum(pca_ensemble == pca_test_subset$Class) / length(pca_test_subset$Class)
print(paste("Ensemble Accuracy:", pca_ensemble_accuracy))

```