Skip to contents

Introduction

This vignette will provide a brief introduction into the basic usage of reclanc. If you’re interested in how reclanc works, I’d recommend reading the blog post I wrote and the original paper by Alan Dabney, who created the original ClaNC algorithm.

Fitting

To create new centroids from existing expression data, use the clanc function.

reclanc provides some synthetic expression data that we can work with:

lapply(synthetic_expression, head)
#> $expression
#>        sample1  sample2  sample3  sample4  sample5  sample6  sample7  sample8
#> gene1 8.097529 7.119188 7.304400 7.554689 7.953206 7.714925 7.512700 8.597547
#> gene2 8.641837 9.400416 8.500865 8.878687 8.318438 8.728683 7.812591 7.638167
#> gene3 3.436236 4.317915 3.435193 3.515755 3.024976 4.762209 5.048956 2.006646
#> gene4 4.368008 5.212750 4.618249 4.201365 3.195294 4.707750 5.126769 6.178658
#> gene5 2.423974 3.563816 4.062362 2.163278 2.021435 2.813873 0.000000 4.652358
#> gene6 5.371205 5.919809 4.366915 4.805534 4.834856 5.622157 3.883531 3.593082
#>        sample9 sample10 sample11 sample12
#> gene1 6.475641 7.648858 8.637526 7.345038
#> gene2 8.110285 7.906104 7.424728 7.927039
#> gene3 2.739211 3.111668 3.161077 4.306611
#> gene4 5.170265 4.259578 5.872855 6.159023
#> gene5 1.532242 3.399823 3.691250 1.932937
#> gene6 4.246205 4.637316 3.575837 2.730452
#> 
#> $classes
#> [1] A A A A A A
#> Levels: A B

These data include 12 samples, 6 of class “A” and 6 of class “B”, with 100 genes.

reclanc is able to ingest a variety of common formats for expression data. Objects like SummarizedExperiments and ExpressionSets are frequently used in bioinformatic analyses and arrange their data so samples are in columns and genes are in rows. This is in conflict with the expected formula input in base R, where predictors (genes) and outcomes (classes) are in columns. reclanc eases this friction by expecting input in its most common format, abstracting away the wrangling aspect of analysis.

Because of this, there are two broad categories of input - ‘wide data’ and ‘tall data’.

Wide inputs

Wide inputs require data that has both predictors and outcomes as columns, together, in a single data.frame.

Formula

form_data <- cbind(
  class = synthetic_expression$classes,
  as.data.frame(t(synthetic_expression$expression))
)
head(form_data[1:5])
#>         class    gene1    gene2    gene3    gene4
#> sample1     A 8.097529 8.641837 3.436236 4.368008
#> sample2     A 7.119188 9.400416 4.317915 5.212750
#> sample3     A 7.304400 8.500865 3.435193 4.618249
#> sample4     A 7.554689 8.878687 3.515755 4.201365
#> sample5     A 7.953206 8.318438 3.024976 3.195294
#> sample6     A 7.714925 8.728683 4.762209 4.707750
clanc(class ~ ., form_data, active = 5)
#> <clanc> 
#> $centroids
#>    class   gene expression pooled_sd active prior
#> 1      A gene12   7.514718 0.4779155      5   0.5
#> 2      A  gene2   8.744821 0.3147537      5   0.5
#> 3      A gene13   8.936462 0.3418472      5   0.5
#> 4      A gene21   6.584681 0.5279636      5   0.5
#> 5      A gene24   4.307301 0.7214700      5   0.5
#> 6      A gene74   4.028507 0.4940783      5   0.5
#> 7      A gene41   4.328516 0.6317005      5   0.5
#> 8      A gene95   6.873184 0.4462475      5   0.5
#> 9      A gene52   3.743798 0.5173769      5   0.5
#> 10     A gene66   7.008174 0.5883218      5   0.5
#> 11     B gene12   8.072284 0.4779155      5   0.5
#> 12     B gene13   9.938137 0.3418472      5   0.5
#> 13     B  gene2   8.273987 0.3147537      5   0.5
#> 14     B gene24   3.370467 0.7214700      5   0.5
#> 15     B gene21   5.789423 0.5279636      5   0.5
#> 16     B gene41   5.518354 0.6317005      5   0.5
#> 17     B gene74   3.226598 0.4940783      5   0.5
#> 18     B gene52   2.438579 0.5173769      5   0.5
#> 19     B gene95   6.288173 0.4462475      5   0.5
#> 20     B gene66   7.891588 0.5883218      5   0.5

recipe

reclanc also supports tidymodels workflows:

discrim_linear() |>
  set_engine("clanc", active = 5) |>
  fit(class ~ ., data = form_data)
#> parsnip model object
#> 
#> <clanc> 
#> $centroids
#>    class   gene expression pooled_sd active prior
#> 1      A gene12   7.514718 0.4779155      5   0.5
#> 2      A  gene2   8.744821 0.3147537      5   0.5
#> 3      A gene13   8.936462 0.3418472      5   0.5
#> 4      A gene21   6.584681 0.5279636      5   0.5
#> 5      A gene24   4.307301 0.7214700      5   0.5
#> 6      A gene74   4.028507 0.4940783      5   0.5
#> 7      A gene41   4.328516 0.6317005      5   0.5
#> 8      A gene95   6.873184 0.4462475      5   0.5
#> 9      A gene52   3.743798 0.5173769      5   0.5
#> 10     A gene66   7.008174 0.5883218      5   0.5
#> 11     B gene12   8.072284 0.4779155      5   0.5
#> 12     B gene13   9.938137 0.3418472      5   0.5
#> 13     B  gene2   8.273987 0.3147537      5   0.5
#> 14     B gene24   3.370467 0.7214700      5   0.5
#> 15     B gene21   5.789423 0.5279636      5   0.5
#> 16     B gene41   5.518354 0.6317005      5   0.5
#> 17     B gene74   3.226598 0.4940783      5   0.5
#> 18     B gene52   2.438579 0.5173769      5   0.5
#> 19     B gene95   6.288173 0.4462475      5   0.5
#> 20     B gene66   7.891588 0.5883218      5   0.5

Tall inputs

Tall inputs require genes as rows and samples as columns

data.frame/matrix

It is often convenient to supply a data.frame, particularly after some data-munging has been done. Both data.frame and matrix inputs require expression with genes as column names and sample IDs as rownames, as well as a factor vector of classes:

clanc(
  synthetic_expression$expression,
  classes = synthetic_expression$classes,
  active = 5
)
#> <clanc> 
#> $centroids
#>    class   gene expression pooled_sd active prior
#> 1      A gene12   7.514718 0.4779155      5   0.5
#> 2      A  gene2   8.744821 0.3147537      5   0.5
#> 3      A gene13   8.936462 0.3418472      5   0.5
#> 4      A gene21   6.584681 0.5279636      5   0.5
#> 5      A gene24   4.307301 0.7214700      5   0.5
#> 6      A gene74   4.028507 0.4940783      5   0.5
#> 7      A gene41   4.328516 0.6317005      5   0.5
#> 8      A gene95   6.873184 0.4462475      5   0.5
#> 9      A gene52   3.743798 0.5173769      5   0.5
#> 10     A gene66   7.008174 0.5883218      5   0.5
#> 11     B gene12   8.072284 0.4779155      5   0.5
#> 12     B gene13   9.938137 0.3418472      5   0.5
#> 13     B  gene2   8.273987 0.3147537      5   0.5
#> 14     B gene24   3.370467 0.7214700      5   0.5
#> 15     B gene21   5.789423 0.5279636      5   0.5
#> 16     B gene41   5.518354 0.6317005      5   0.5
#> 17     B gene74   3.226598 0.4940783      5   0.5
#> 18     B gene52   2.438579 0.5173769      5   0.5
#> 19     B gene95   6.288173 0.4462475      5   0.5
#> 20     B gene66   7.891588 0.5883218      5   0.5

SummarizedExperiment

Some common formats for expression are SummarizedExperiments and ExpressionSets:

se <- SummarizedExperiment(
  synthetic_expression$expression,
  colData = DataFrame(class = synthetic_expression$classes)
)
se
#> class: SummarizedExperiment 
#> dim: 100 12 
#> metadata(0):
#> assays(1): ''
#> rownames(100): gene1 gene2 ... gene99 gene100
#> rowData names(0):
#> colnames(12): sample1 sample2 ... sample11 sample12
#> colData names(1): class

We can specify the name of the colData (pData for ExpressionSets) column that contains the classes with the classes argument:

fit <- clanc(
  se,
  classes = "class",
  active = 20,
  assay = 1 # Index of assay - SummarizedExperiments only
)
fit
#> <clanc> 
#> $centroids
#>    class    gene expression pooled_sd active prior
#> 1      A  gene11  2.2992343 1.2044848     20   0.5
#> 2      A   gene2  8.7448209 0.3147537     20   0.5
#> 3      A  gene13  8.9364621 0.3418472     20   0.5
#> 4      A  gene20  2.1925558 1.3104010     20   0.5
#> 5      A  gene10  4.9557850 0.8571716     20   0.5
#> 6      A  gene21  6.5846813 0.5279636     20   0.5
#> 7      A gene100  5.6455200 0.6175104     20   0.5
#> 8      A  gene22  6.1650079 0.4699756     20   0.5
#> 9      A  gene46  6.7344030 0.8233370     20   0.5
#> 10     A  gene24  4.3073008 0.7214700     20   0.5
#> 11     A  gene15  2.4254020 1.1910158     20   0.5
#> 12     A  gene25  5.0353875 0.7498139     20   0.5
#> 13     A  gene17  2.9424148 0.6628466     20   0.5
#> 14     A   gene4  4.3839026 0.7144711     20   0.5
#> 15     A  gene56  6.3441126 0.4078736     20   0.5
#> 16     A  gene41  4.3285163 0.6317005     20   0.5
#> 17     A  gene57  4.2237139 0.9531773     20   0.5
#> 18     A   gene7  5.5545202 0.7875124     20   0.5
#> 19     A  gene58  5.6162919 0.8161951     20   0.5
#> 20     A  gene12  7.5147181 0.4779155     20   0.5
#> 21     A   gene6  5.1534126 0.6194184     20   0.5
#> 22     A  gene51  6.6256136 0.7737520     20   0.5
#> 23     A  gene60  4.7434923 1.2945446     20   0.5
#> 24     A  gene52  3.7437977 0.5173769     20   0.5
#> 25     A  gene63  8.9293980 0.5635262     20   0.5
#> 26     A  gene53  4.3774614 0.8370528     20   0.5
#> 27     A  gene66  7.0081742 0.5883218     20   0.5
#> 28     A  gene83  3.6532038 0.8444393     20   0.5
#> 29     A  gene67  6.1384613 0.3677756     20   0.5
#> 30     A  gene85  5.2179679 0.5930857     20   0.5
#> 31     A  gene88  4.6008044 1.0603007     20   0.5
#> 32     A  gene70  1.3073340 1.1264747     20   0.5
#> 33     A  gene47  9.4528373 0.2030726     20   0.5
#> 34     A  gene90  0.9794695 1.3272423     20   0.5
#> 35     A  gene74  4.0285071 0.4940783     20   0.5
#> 36     A  gene94  7.7773183 0.5375914     20   0.5
#> 37     A  gene78  2.1763395 1.6805560     20   0.5
#> 38     A  gene95  6.8731844 0.4462475     20   0.5
#> 39     A  gene79  3.7138831 1.0587367     20   0.5
#> 40     A  gene98  4.5710407 0.6798799     20   0.5
#> 41     B  gene10  4.2378889 0.8571716     20   0.5
#> 42     B   gene2  8.2739866 0.3147537     20   0.5
#> 43     B gene100  5.0435040 0.6175104     20   0.5
#> 44     B  gene20  3.4781598 1.3104010     20   0.5
#> 45     B  gene46  7.0200767 0.8233370     20   0.5
#> 46     B  gene11  1.2780748 1.2044848     20   0.5
#> 47     B  gene12  8.0722841 0.4779155     20   0.5
#> 48     B  gene22  6.4609169 0.4699756     20   0.5
#> 49     B  gene51  5.8920005 0.7737520     20   0.5
#> 50     B  gene13  9.9381374 0.3418472     20   0.5
#> 51     B  gene15  1.6008569 1.1910158     20   0.5
#> 52     B  gene25  4.5015558 0.7498139     20   0.5
#> 53     B  gene17  2.5005839 0.6628466     20   0.5
#> 54     B   gene4  4.9225469 0.7144711     20   0.5
#> 55     B  gene56  6.1067832 0.4078736     20   0.5
#> 56     B  gene41  5.5183538 0.6317005     20   0.5
#> 57     B  gene57  3.1175271 0.9531773     20   0.5
#> 58     B   gene7  5.3367575 0.7875124     20   0.5
#> 59     B  gene21  5.7894231 0.5279636     20   0.5
#> 60     B  gene47  9.5903798 0.2030726     20   0.5
#> 61     B   gene6  4.4655748 0.6194184     20   0.5
#> 62     B  gene74  3.2265977 0.4940783     20   0.5
#> 63     B  gene24  3.3704670 0.7214700     20   0.5
#> 64     B  gene52  2.4385792 0.5173769     20   0.5
#> 65     B  gene63  8.3234317 0.5635262     20   0.5
#> 66     B  gene53  3.8479638 0.8370528     20   0.5
#> 67     B  gene66  7.8915875 0.5883218     20   0.5
#> 68     B  gene83  4.2757218 0.8444393     20   0.5
#> 69     B  gene67  6.0190764 0.3677756     20   0.5
#> 70     B  gene85  5.8877225 0.5930857     20   0.5
#> 71     B  gene79  4.1894417 1.0587367     20   0.5
#> 72     B  gene58  4.7194615 0.8161951     20   0.5
#> 73     B  gene88  5.5945405 1.0603007     20   0.5
#> 74     B  gene70  1.5987845 1.1264747     20   0.5
#> 75     B  gene90  1.4036889 1.3272423     20   0.5
#> 76     B  gene60  5.2336968 1.2945446     20   0.5
#> 77     B  gene78  1.6625207 1.6805560     20   0.5
#> 78     B  gene95  6.2881728 0.4462475     20   0.5
#> 79     B  gene98  4.1346296 0.6798799     20   0.5
#> 80     B  gene94  8.4222554 0.5375914     20   0.5

Predicting

The fit can then be used to predict classes of new samples from new data. The new data can come in the form of a matrix, data.frame, SummarizedExperiment, or ExpressionSet, with the same expected input

Using type = "class" will predict classes using the metric provided by Alan Dabney in the original ClaNC paper.

predict(fit, new_data = se, type = "class")
#> 40/40 (100%) genes in centroids found in data
#> # A tibble: 12 × 1
#>    .pred_class
#>    <fct>      
#>  1 A          
#>  2 A          
#>  3 A          
#>  4 A          
#>  5 A          
#>  6 A          
#>  7 B          
#>  8 B          
#>  9 B          
#> 10 B          
#> 11 B          
#> 12 B

However, particularly if comparing across datasets that may have been transformed differently, it may be more accurate to use a correlation based metric:

predict(fit, new_data = se, type = "numeric", method = "spearman")
#> 40/40 (100%) genes in centroids found in data
#> # A tibble: 12 × 2
#>    .pred_A .pred_B
#>      <dbl>   <dbl>
#>  1   0.901   0.811
#>  2   0.929   0.849
#>  3   0.932   0.840
#>  4   0.912   0.829
#>  5   0.862   0.770
#>  6   0.932   0.869
#>  7   0.776   0.904
#>  8   0.824   0.931
#>  9   0.828   0.924
#> 10   0.855   0.946
#> 11   0.805   0.915
#> 12   0.750   0.869