Introduction
This vignette will provide a brief introduction into the basic
usage of reclanc
. If you’re interested in how
reclanc
works, I’d recommend reading the blog post I
wrote and the original
paper by Alan Dabney, who created the original ClaNC
algorithm.
Fitting
To create new centroids from existing expression data, use the
clanc
function.
reclanc
provides some synthetic expression data that we
can work with:
lapply(synthetic_expression, head)
#> $expression
#> sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
#> gene1 8.097529 7.119188 7.304400 7.554689 7.953206 7.714925 7.512700 8.597547
#> gene2 8.641837 9.400416 8.500865 8.878687 8.318438 8.728683 7.812591 7.638167
#> gene3 3.436236 4.317915 3.435193 3.515755 3.024976 4.762209 5.048956 2.006646
#> gene4 4.368008 5.212750 4.618249 4.201365 3.195294 4.707750 5.126769 6.178658
#> gene5 2.423974 3.563816 4.062362 2.163278 2.021435 2.813873 0.000000 4.652358
#> gene6 5.371205 5.919809 4.366915 4.805534 4.834856 5.622157 3.883531 3.593082
#> sample9 sample10 sample11 sample12
#> gene1 6.475641 7.648858 8.637526 7.345038
#> gene2 8.110285 7.906104 7.424728 7.927039
#> gene3 2.739211 3.111668 3.161077 4.306611
#> gene4 5.170265 4.259578 5.872855 6.159023
#> gene5 1.532242 3.399823 3.691250 1.932937
#> gene6 4.246205 4.637316 3.575837 2.730452
#>
#> $classes
#> [1] A A A A A A
#> Levels: A B
These data include 12 samples, 6 of class “A” and 6 of class “B”, with 100 genes.
reclanc
is able to ingest a variety of common formats
for expression data. Objects like SummarizedExperiments
and
ExpressionSets
are frequently used in bioinformatic
analyses and arrange their data so samples are in columns and genes are
in rows. This is in conflict with the expected formula input in base R,
where predictors (genes) and outcomes (classes) are in columns.
reclanc
eases this friction by expecting input in its most
common format, abstracting away the wrangling aspect of analysis.
Because of this, there are two broad categories of input - ‘wide data’ and ‘tall data’.
Wide inputs
Wide inputs require data that has both predictors and outcomes as
columns, together, in a single data.frame
.
Formula
form_data <- cbind(
class = synthetic_expression$classes,
as.data.frame(t(synthetic_expression$expression))
)
head(form_data[1:5])
#> class gene1 gene2 gene3 gene4
#> sample1 A 8.097529 8.641837 3.436236 4.368008
#> sample2 A 7.119188 9.400416 4.317915 5.212750
#> sample3 A 7.304400 8.500865 3.435193 4.618249
#> sample4 A 7.554689 8.878687 3.515755 4.201365
#> sample5 A 7.953206 8.318438 3.024976 3.195294
#> sample6 A 7.714925 8.728683 4.762209 4.707750
clanc(class ~ ., form_data, active = 5)
#> <clanc>
#> $centroids
#> class gene expression pooled_sd active prior
#> 1 A gene12 7.514718 0.4779155 5 0.5
#> 2 A gene2 8.744821 0.3147537 5 0.5
#> 3 A gene13 8.936462 0.3418472 5 0.5
#> 4 A gene21 6.584681 0.5279636 5 0.5
#> 5 A gene24 4.307301 0.7214700 5 0.5
#> 6 A gene74 4.028507 0.4940783 5 0.5
#> 7 A gene41 4.328516 0.6317005 5 0.5
#> 8 A gene95 6.873184 0.4462475 5 0.5
#> 9 A gene52 3.743798 0.5173769 5 0.5
#> 10 A gene66 7.008174 0.5883218 5 0.5
#> 11 B gene12 8.072284 0.4779155 5 0.5
#> 12 B gene13 9.938137 0.3418472 5 0.5
#> 13 B gene2 8.273987 0.3147537 5 0.5
#> 14 B gene24 3.370467 0.7214700 5 0.5
#> 15 B gene21 5.789423 0.5279636 5 0.5
#> 16 B gene41 5.518354 0.6317005 5 0.5
#> 17 B gene74 3.226598 0.4940783 5 0.5
#> 18 B gene52 2.438579 0.5173769 5 0.5
#> 19 B gene95 6.288173 0.4462475 5 0.5
#> 20 B gene66 7.891588 0.5883218 5 0.5
recipe
reclanc
also supports tidymodels
workflows:
discrim_linear() |>
set_engine("clanc", active = 5) |>
fit(class ~ ., data = form_data)
#> parsnip model object
#>
#> <clanc>
#> $centroids
#> class gene expression pooled_sd active prior
#> 1 A gene12 7.514718 0.4779155 5 0.5
#> 2 A gene2 8.744821 0.3147537 5 0.5
#> 3 A gene13 8.936462 0.3418472 5 0.5
#> 4 A gene21 6.584681 0.5279636 5 0.5
#> 5 A gene24 4.307301 0.7214700 5 0.5
#> 6 A gene74 4.028507 0.4940783 5 0.5
#> 7 A gene41 4.328516 0.6317005 5 0.5
#> 8 A gene95 6.873184 0.4462475 5 0.5
#> 9 A gene52 3.743798 0.5173769 5 0.5
#> 10 A gene66 7.008174 0.5883218 5 0.5
#> 11 B gene12 8.072284 0.4779155 5 0.5
#> 12 B gene13 9.938137 0.3418472 5 0.5
#> 13 B gene2 8.273987 0.3147537 5 0.5
#> 14 B gene24 3.370467 0.7214700 5 0.5
#> 15 B gene21 5.789423 0.5279636 5 0.5
#> 16 B gene41 5.518354 0.6317005 5 0.5
#> 17 B gene74 3.226598 0.4940783 5 0.5
#> 18 B gene52 2.438579 0.5173769 5 0.5
#> 19 B gene95 6.288173 0.4462475 5 0.5
#> 20 B gene66 7.891588 0.5883218 5 0.5
Tall inputs
Tall inputs require genes as rows and samples as columns
data.frame
/matrix
It is often convenient to supply a data.frame
,
particularly after some data-munging has been done. Both
data.frame
and matrix
inputs require
expression with genes as column names and sample IDs as rownames, as
well as a factor vector of classes:
clanc(
synthetic_expression$expression,
classes = synthetic_expression$classes,
active = 5
)
#> <clanc>
#> $centroids
#> class gene expression pooled_sd active prior
#> 1 A gene12 7.514718 0.4779155 5 0.5
#> 2 A gene2 8.744821 0.3147537 5 0.5
#> 3 A gene13 8.936462 0.3418472 5 0.5
#> 4 A gene21 6.584681 0.5279636 5 0.5
#> 5 A gene24 4.307301 0.7214700 5 0.5
#> 6 A gene74 4.028507 0.4940783 5 0.5
#> 7 A gene41 4.328516 0.6317005 5 0.5
#> 8 A gene95 6.873184 0.4462475 5 0.5
#> 9 A gene52 3.743798 0.5173769 5 0.5
#> 10 A gene66 7.008174 0.5883218 5 0.5
#> 11 B gene12 8.072284 0.4779155 5 0.5
#> 12 B gene13 9.938137 0.3418472 5 0.5
#> 13 B gene2 8.273987 0.3147537 5 0.5
#> 14 B gene24 3.370467 0.7214700 5 0.5
#> 15 B gene21 5.789423 0.5279636 5 0.5
#> 16 B gene41 5.518354 0.6317005 5 0.5
#> 17 B gene74 3.226598 0.4940783 5 0.5
#> 18 B gene52 2.438579 0.5173769 5 0.5
#> 19 B gene95 6.288173 0.4462475 5 0.5
#> 20 B gene66 7.891588 0.5883218 5 0.5
SummarizedExperiment
Some common formats for expression are
SummarizedExperiment
s and ExpressionSet
s:
se <- SummarizedExperiment(
synthetic_expression$expression,
colData = DataFrame(class = synthetic_expression$classes)
)
se
#> class: SummarizedExperiment
#> dim: 100 12
#> metadata(0):
#> assays(1): ''
#> rownames(100): gene1 gene2 ... gene99 gene100
#> rowData names(0):
#> colnames(12): sample1 sample2 ... sample11 sample12
#> colData names(1): class
We can specify the name of the colData
(pData
for ExpressionSet
s) column that
contains the classes with the classes
argument:
fit <- clanc(
se,
classes = "class",
active = 20,
assay = 1 # Index of assay - SummarizedExperiments only
)
fit
#> <clanc>
#> $centroids
#> class gene expression pooled_sd active prior
#> 1 A gene11 2.2992343 1.2044848 20 0.5
#> 2 A gene2 8.7448209 0.3147537 20 0.5
#> 3 A gene13 8.9364621 0.3418472 20 0.5
#> 4 A gene20 2.1925558 1.3104010 20 0.5
#> 5 A gene10 4.9557850 0.8571716 20 0.5
#> 6 A gene21 6.5846813 0.5279636 20 0.5
#> 7 A gene100 5.6455200 0.6175104 20 0.5
#> 8 A gene22 6.1650079 0.4699756 20 0.5
#> 9 A gene46 6.7344030 0.8233370 20 0.5
#> 10 A gene24 4.3073008 0.7214700 20 0.5
#> 11 A gene15 2.4254020 1.1910158 20 0.5
#> 12 A gene25 5.0353875 0.7498139 20 0.5
#> 13 A gene17 2.9424148 0.6628466 20 0.5
#> 14 A gene4 4.3839026 0.7144711 20 0.5
#> 15 A gene56 6.3441126 0.4078736 20 0.5
#> 16 A gene41 4.3285163 0.6317005 20 0.5
#> 17 A gene57 4.2237139 0.9531773 20 0.5
#> 18 A gene7 5.5545202 0.7875124 20 0.5
#> 19 A gene58 5.6162919 0.8161951 20 0.5
#> 20 A gene12 7.5147181 0.4779155 20 0.5
#> 21 A gene6 5.1534126 0.6194184 20 0.5
#> 22 A gene51 6.6256136 0.7737520 20 0.5
#> 23 A gene60 4.7434923 1.2945446 20 0.5
#> 24 A gene52 3.7437977 0.5173769 20 0.5
#> 25 A gene63 8.9293980 0.5635262 20 0.5
#> 26 A gene53 4.3774614 0.8370528 20 0.5
#> 27 A gene66 7.0081742 0.5883218 20 0.5
#> 28 A gene83 3.6532038 0.8444393 20 0.5
#> 29 A gene67 6.1384613 0.3677756 20 0.5
#> 30 A gene85 5.2179679 0.5930857 20 0.5
#> 31 A gene88 4.6008044 1.0603007 20 0.5
#> 32 A gene70 1.3073340 1.1264747 20 0.5
#> 33 A gene47 9.4528373 0.2030726 20 0.5
#> 34 A gene90 0.9794695 1.3272423 20 0.5
#> 35 A gene74 4.0285071 0.4940783 20 0.5
#> 36 A gene94 7.7773183 0.5375914 20 0.5
#> 37 A gene78 2.1763395 1.6805560 20 0.5
#> 38 A gene95 6.8731844 0.4462475 20 0.5
#> 39 A gene79 3.7138831 1.0587367 20 0.5
#> 40 A gene98 4.5710407 0.6798799 20 0.5
#> 41 B gene10 4.2378889 0.8571716 20 0.5
#> 42 B gene2 8.2739866 0.3147537 20 0.5
#> 43 B gene100 5.0435040 0.6175104 20 0.5
#> 44 B gene20 3.4781598 1.3104010 20 0.5
#> 45 B gene46 7.0200767 0.8233370 20 0.5
#> 46 B gene11 1.2780748 1.2044848 20 0.5
#> 47 B gene12 8.0722841 0.4779155 20 0.5
#> 48 B gene22 6.4609169 0.4699756 20 0.5
#> 49 B gene51 5.8920005 0.7737520 20 0.5
#> 50 B gene13 9.9381374 0.3418472 20 0.5
#> 51 B gene15 1.6008569 1.1910158 20 0.5
#> 52 B gene25 4.5015558 0.7498139 20 0.5
#> 53 B gene17 2.5005839 0.6628466 20 0.5
#> 54 B gene4 4.9225469 0.7144711 20 0.5
#> 55 B gene56 6.1067832 0.4078736 20 0.5
#> 56 B gene41 5.5183538 0.6317005 20 0.5
#> 57 B gene57 3.1175271 0.9531773 20 0.5
#> 58 B gene7 5.3367575 0.7875124 20 0.5
#> 59 B gene21 5.7894231 0.5279636 20 0.5
#> 60 B gene47 9.5903798 0.2030726 20 0.5
#> 61 B gene6 4.4655748 0.6194184 20 0.5
#> 62 B gene74 3.2265977 0.4940783 20 0.5
#> 63 B gene24 3.3704670 0.7214700 20 0.5
#> 64 B gene52 2.4385792 0.5173769 20 0.5
#> 65 B gene63 8.3234317 0.5635262 20 0.5
#> 66 B gene53 3.8479638 0.8370528 20 0.5
#> 67 B gene66 7.8915875 0.5883218 20 0.5
#> 68 B gene83 4.2757218 0.8444393 20 0.5
#> 69 B gene67 6.0190764 0.3677756 20 0.5
#> 70 B gene85 5.8877225 0.5930857 20 0.5
#> 71 B gene79 4.1894417 1.0587367 20 0.5
#> 72 B gene58 4.7194615 0.8161951 20 0.5
#> 73 B gene88 5.5945405 1.0603007 20 0.5
#> 74 B gene70 1.5987845 1.1264747 20 0.5
#> 75 B gene90 1.4036889 1.3272423 20 0.5
#> 76 B gene60 5.2336968 1.2945446 20 0.5
#> 77 B gene78 1.6625207 1.6805560 20 0.5
#> 78 B gene95 6.2881728 0.4462475 20 0.5
#> 79 B gene98 4.1346296 0.6798799 20 0.5
#> 80 B gene94 8.4222554 0.5375914 20 0.5
Predicting
The fit can then be used to predict classes of new samples from new
data. The new data can come in the form of a matrix
,
data.frame
, SummarizedExperiment
, or
ExpressionSet
, with the same expected input
Using type = "class"
will predict classes using the
metric provided by Alan Dabney in the original ClaNC paper.
predict(fit, new_data = se, type = "class")
#> 40/40 (100%) genes in centroids found in data
#> # A tibble: 12 × 1
#> .pred_class
#> <fct>
#> 1 A
#> 2 A
#> 3 A
#> 4 A
#> 5 A
#> 6 A
#> 7 B
#> 8 B
#> 9 B
#> 10 B
#> 11 B
#> 12 B
However, particularly if comparing across datasets that may have been transformed differently, it may be more accurate to use a correlation based metric:
predict(fit, new_data = se, type = "numeric", method = "spearman")
#> 40/40 (100%) genes in centroids found in data
#> # A tibble: 12 × 2
#> .pred_A .pred_B
#> <dbl> <dbl>
#> 1 0.901 0.811
#> 2 0.929 0.849
#> 3 0.932 0.840
#> 4 0.912 0.829
#> 5 0.862 0.770
#> 6 0.932 0.869
#> 7 0.776 0.904
#> 8 0.824 0.931
#> 9 0.828 0.924
#> 10 0.855 0.946
#> 11 0.805 0.915
#> 12 0.750 0.869