Compiled on Jan 6, 2021 with scanpy v.1.6

Doublet identification in 10k PBMC dataset using scrublet

In this notebook, we start with a counts matrix that was processed with soupX. The processing is described in a separate R notebook available in the same repository. Overall suggested workflow is described in ReadTheDocs.

First, let's import the libraries we're going to use.

Then, read in the counts matrix generated by soupX.

Let's take a look at the dataset. The full matrix contains 10194 cells and 36601 genes.

For the next step, we would need the expected doublet formation rate. These values can be found here. Since our dataset has approximately 10,000 (recovered) cells, we take the highest value given in the table.

Calling the main function, scrub_doublets, calculates a score for each cell, and assigns the binary (doublet/singlet) prediction based on this score. We can store this data as metadata in the adata object.

The following histogram is an important diagnostic plot. Doublet score threshold should separate the two shoulders of the bimodal distribution. If this is not so, adjust the soublet score cutoff accordingly.

Let's visualize the doublet predictions in a 2-D embedding (e.g., UMAP or t-SNE). Predicted doublets should mostly co-localize (possibly in multiple clusters). If they do not, you may need to adjust the doublet score threshold, or change the pre-processing parameters to better resolve the cell states present in your data. In our case, everything appears to be in order.

Finally, let's save a tab-separated file of doublet calls and doublet scores - it will be used in other notebooks. Otherwise, this data can be used for simple filtering (doublet removal).