Discussion notes · Machine learning & cell biology

Self-Supervised Learning,
ViT, and Cell Painting

A summary of concepts covered: self-supervised learning, masked autoencoders for vision transformers, the CPJUMP1 dataset, performance metrics, practical applications.

Self-supervised learning

Self-supervised learning trains a model using unlabeled data by having the data itself generate the supervision signal. Instead of requiring human annotations, the model is given a task where part of the input is hidden, and it must predict the missing piece from the rest.

Example — language

Given the sentence "The cat sat on the ___", a model must predict the masked word (mat) from context. No human label is needed — the act of masking creates an automatic training signal. This is the core idea behind large language models.

Why it matters

Labeled data is expensive and slow to collect. Unlabeled data — text, images, audio — is essentially unlimited. Models trained this way learn rich, general-purpose representations that transfer well to downstream tasks.

This works through a two-phase structure that is fundamental to modern machine learning. In the first phase, the model is pre-trained on a large unlabeled dataset using a self-supervised task — such as predicting missing parts of the input. Model components used solely for pre-training are then discarded. In the second phase, the remaining encoder is fine-tuned on a much smaller labeled dataset for a specific task. Because the encoder has already learned to understand the structure of the data, far less labeled data is needed than if you were training from scratch. This is precisely why self-supervised learning is so valuable for domains like cell biology, where labeled data is scarce and expensive to produce.

Two families of self-supervised learning

There are two major approaches. The first — covered in detail in the next section — is masked prediction: hide part of the input and train the model to reconstruct it. The second is contrastive learning: rather than predicting missing content, the model learns by comparing pairs of samples. Two different views of the same data point — for example, two differently cropped or augmented versions of the same image — are treated as positives and their embeddings are pulled together in the representation space; embeddings of different data points are pushed apart. The model learns which features are consistent across views and therefore meaningful, without any labels.

DINO, a prominent ViT-based example, refines this idea using a teacher-student approach called self-distillation rather than explicit positive/negative pairs — but produces embeddings with the same desirable property: representations that cluster meaningfully by biological similarity. In recent benchmarks on cell microscopy data, DINO outperformed both classical hand-engineered features and masked prediction approaches on retrieval tasks, suggesting that contrastive-style training may be particularly well suited to the compound-gene matching problem — a point we return to in the practical applications section.

Masked autoencoders for ViT

Vision Transformers (ViTs)[3] divide an image into a grid of fixed-size patches. Masked Autoencoders (MAE, He et al. 2022)[2] apply self-supervised learning directly to this patch structure.

MAE pre-training pipeline Four steps: original image patches, 75% masked, all visible tokens processed jointly through encoder, transformer, and decoder stages, then pixels reconstructed. The three stages span the full width of all tokens, showing they are processed collectively rather than individually. 1. Original image 2. Patch + mask 75% 3. MAE (ViT encoder) 4. Reconstruct Encoder Transformer Decoder masked visible predicted given Reconstruction loss: predicted pixels vs. original masked patches reconstructed After pre-training, discard the decoder The ViT encoder has learned rich visual representations — fine-tune it on any downstream task (classification, detection, segmentation).

Step 3 shows all visible tokens entering at the top together, flowing down through Encoder → Transformer → Decoder — each stage spans the full width of all tokens, reflecting that every token attends to every other token at each layer.

The key insight is that masking 75% of patches is deliberately aggressive. Unlike language, where neighboring words give strong hints, a missing image patch cannot easily be inferred from its immediate neighbors alone — the encoder must build a global understanding to fill in the gaps.

After pre-training

The decoder (used only during training) is discarded. The ViT encoder has learned compact, semantically rich representations and can be fine-tuned on any downstream task — classification, detection, segmentation — with very little labeled data.

CPJUMP1 — the Cell Painting dataset

The output of the trained ViT encoder is itself an image-based profile — a compact numerical summary of cell morphology learned directly from pixels. Morphological profiling is the long-established practice of quantifying cell appearance from microscopy images to compare biological states; image-based profiling is the modern, high-throughput version of that practice, typically using software like CellProfiler to extract thousands of hand-engineered measurements per cell. The ViT embedding represents a newer approach to the same goal — letting the model discover what to measure rather than specifying it in advance. CPJUMP1 was designed precisely at this inflection point: it benchmarks classical image-based profiles while making the case that learned representations could do substantially better.

Chandrasekaran et al. (Nature Methods, 2024)[1] introduce CPJUMP1, a benchmark dataset from the JUMP Cell Painting Consortium. It contains approximately 3 million images of cells, profiling 75 million single cells treated with matched chemical and genetic perturbations.

The Cell Painting assay

Cells are stained with six fluorescent dyes imaged across five channels — a key difference from the RGB images standard ViTs expect — in the Cell Painting assay:[9]

Mito (mitochondria)
AGP (actin, Golgi, plasma membrane)
RNA (nucleoli & cytoplasmic RNA)
ER (endoplasmic reticulum)
DNA (nucleus)

Three perturbation modalities

The dataset pairs 303 compounds and 160 genes, with each gene targeted by at least two compounds and two types of genetic perturbation:

Compound
Small molecule drugs (from the Drug Repurposing Hub) applied at 5 µM. Tend to produce the strongest, most distinguishable morphological phenotypes.
CRISPR
Two guide RNAs per gene — CRISPR-Cas9 knockout reduces the amount of a gene's protein product. Two guides per gene allows within-modality matching.
ORF
Open reading frame overexpression — increases the amount of a gene's protein product. One reagent per gene; produces the weakest/noisiest signal.

Benchmark retrieval tasks

The paper defines three progressively harder retrieval tasks, evaluated using mean average precision (mAP):

Task Question asked Difficulty Result
Replicate detection Can a perturbation's profile retrieve its own replicates? easy ~75–95% mAP for compounds
Sister matching Do two compounds targeting the same protein match each other? medium 5–25% retrieved
Compound-gene matching Does compound X match the CRISPR/ORF profile for its target gene? hard ~3–8% — barely above chance
Surprising finding

CRISPR knockouts and ORF overexpressions targeting the same gene are slightly positively correlated — not anti-correlated as simple logic would predict (reduce ↔ increase should oppose). Biology is messier: many compounds activate rather than inhibit, and overexpressed genes can have dominant-negative or compensatory effects.

Why this matters for ML

Cross-modality compound-gene matching is barely above chance with classical CellProfiler features. Better representation learning — such as applying MAE-trained ViTs to the 5-channel images — could meaningfully improve this rate, accelerating drug mechanism-of-action discovery. Even a small improvement reduces the experimental search space for biologists.

Performance metrics

mAP measures how well a retrieval system ranks relevant items above irrelevant ones. It rewards finding correct matches early in the ranked list.

Precision@k

Of the top-k results returned, what fraction are true matches? e.g. 3 correct out of 10 retrieved = precision@10 of 0.3.

Average precision

For a single query, take the mean of precision values at each rank where a true match appears. Early matches score higher. AP = 1.0 is perfect.

mAP

Mean of AP scores across all queries (all perturbations). Summarises retrieval quality across the whole dataset in a single number.

In CPJUMP1, each perturbation profile is a query. Profiles are ranked by cosine similarity. The relevant items are replicates (for replicate detection) or sister/matched perturbations (for the harder tasks). mAP near 1.0 means matches reliably appear at the top; mAP near 0 means the ranking is essentially random.

The paper also reports fraction retrieved — the share of perturbations whose average precision passes a significance threshold (q < 0.05 after permutation testing and Benjamini–Hochberg correction) — as a more interpretable summary statistic.

Practical applications of CPJUMP1

Mechanism-of-action discovery

The most direct application is identifying what a compound is doing inside a cell. Suppose a compound shows promising activity in a phenotypic screen — it kills cancer cells, for example — but its molecular target is unknown. Conventional target identification relies on techniques like affinity chromatography or mass spectrometry proteomics, which require significant experimental setup and can take months to yield even a candidate. A researcher using image-based profiling instead images cells treated with the compound, extracts a morphological profile (a numerical fingerprint of the cells' size, shape, texture, and staining intensity across all five channels), and then queries the CPJUMP1 database: which genetic perturbations produce the most similar cell appearance? If CRISPR knockout of gene X consistently produces cells that look like cells treated with the compound, that is evidence the compound may work by inhibiting the protein product of gene X. The result is a ranked shortlist of candidate targets in hours rather than months — not a confirmed answer, but a prioritised set of hypotheses to test experimentally. With classical CellProfiler features, compound-gene matching rates are barely above chance; a ViT encoder pre-trained on cell images could produce richer embeddings that make this shortlist meaningfully more reliable. This has already been demonstrated: Kim et al. (2025) trained SSL models including DINO and MAE on the broader JUMP Cell Painting dataset (of which CPJUMP1 is the pilot) and benchmarked them against CPJUMP1's CellProfiler baseline — their best model surpassed CellProfiler by 16–29% in perturbation matching mAP and 22% in drug target classification.[8]

The reverse query is equally useful. Starting from a gene of interest — say, a newly validated disease target — a researcher asks which compounds in the library produce a morphology resembling CRISPR knockout of that gene. This is called virtual screening, and it can narrow a library of hundreds of compounds down to a handful worth testing experimentally, saving significant time and cost.

Example from CPJUMP1

The PLK1 inhibitor BI-2536 causes dramatic cell death. In CPJUMP1, CRISPR knockout of PLK1 produces a morphologically similar phenotype — a clean positive match. By contrast, the BRD4-specific inhibitor PFI-1 produces no clear phenotype in this assay, and BRD4 knockout likewise looks like a negative control. This illustrates both the promise of the approach (when it works, the match is visually striking) and a key complication: if a compound's annotated target doesn't actually drive its phenotype, the profile won't match the corresponding genetic perturbation.

Polypharmacology as a complication

Many compounds bind more than one protein target — a property known as polypharmacology. BI-2536 is a good example: it is annotated as a BRD4 inhibitor, but its dominant phenotype in cells is driven by PLK1 inhibition, a secondary target. When a compound hits multiple targets simultaneously, its morphological profile reflects a mixture of all those effects, making it difficult to match cleanly to any single gene's CRISPR profile. The CPJUMP1 dataset was explicitly designed to expose this problem, and the low cross-modality matching rates reported partly reflect how common polypharmacology is among drug-like compounds.

Broader applications

Beyond MoA discovery, Cell Painting profiles have been applied to predict compound toxicity, identify disease-relevant morphological signatures for patient stratification, and cluster genetically perturbed cells to map functional relationships between genes. A 2024 systematic review covering a decade of Cell Painting research documented more than 90 studies spanning drug discovery, toxicology, and functional genomics.[11] The current challenge — and the reason CPJUMP1 was created as a benchmark — is that existing classical CellProfiler-based methods leave most of this potential untapped. Improved representation learning, of the kind that MAE-trained vision transformers could provide, is widely seen as the most promising path forward.[4]

Glossary

Cosine similarity
A way of measuring how alike two vectors are, regardless of their length. It computes the cosine of the angle between them: 1 = identical direction (matching profiles), 0 = perpendicular (unrelated), −1 = opposite directions (anti-correlated). It is the standard metric used in CPJUMP1 to compare morphological profiles.
Contrastive learning
A family of self-supervised learning methods that train a model by comparing pairs of samples. Two different views of the same data point (e.g. two cropped or augmented versions of the same image) are treated as positives and their embeddings are pulled together in representation space; embeddings of different data points are pushed apart. The model learns which features are invariant across views — and therefore meaningful — without any labels. DINO, a prominent ViT-based example, uses a teacher-student variant called self-distillation rather than explicit positive/negative pairs, but produces embeddings with the same property: representations that cluster by semantic or biological similarity. Contrastive methods tend to produce more globally discriminative embeddings than masked prediction approaches, which matters for retrieval tasks like compound-gene matching.
Decoder
The complementary part that reconstructs the original input from the encoder's compressed representation. In MAE, its sole job during pre-training is to predict the masked patches — forcing the encoder to learn good representations. Once pre-training is complete the decoder is discarded, much like a scaffold removed after a building is complete.
Encoder
The part of a neural network that compresses an input (e.g. an image) into a compact representation — the "feature extractor." In the MAE framework, the encoder only sees visible (unmasked) patches. It is the component kept after pre-training and used for all downstream tasks.
Fine-tuning
Taking a pre-trained model and continuing to train it on a smaller, labeled dataset for a specific task. Fine-tuning is much cheaper than training from scratch because the model already understands the structure of the data; usually only the final layers need significant adjustment.
Image-based profiling & morphological profiling
Morphological profiling is the long-established practice of quantifying cell appearance — size, shape, texture, staining intensity — to compare biological states. Image-based profiling is the modern, high-throughput version: microscopy images of cells are processed by software (typically CellProfiler) to extract thousands of hand-engineered numerical measurements per cell, producing a profile that acts as a quantitative fingerprint of the sample's state. A pre-trained ViT encoder does the same thing differently — instead of hand-crafted measurements, it produces a dense learned embedding directly from pixels. Both approaches yield a numerical profile suitable for comparison; the key question CPJUMP1 is designed to answer is whether learned profiles can outperform classical ones at matching chemical and genetic perturbations.
Patch (in vision transformers)
A small, fixed-size square region of an image — typically 16×16 pixels. ViTs divide every image into a regular grid of patches and treat each as a single token, analogous to a word in a text model. The model learns relationships between patches across the whole image rather than processing pixels one at a time.
Pre-training
An initial training phase on a large, often unlabeled dataset to give the model general-purpose knowledge — in this case, learning to reconstruct masked image patches across millions of cell images. The resulting model is not yet specialised for any particular task; that comes later via fine-tuning.
Representation, embedding & embedding space
These terms are closely related and often used interchangeably. A representation refers to the abstract idea of encoding a data point as a vector of numbers that captures its meaningful structure — in this context, a compact numerical summary of a cell sample's morphology. An embedding is the concrete realisation of that idea: the specific vector a trained model produces for a given sample. The embedding space is the high-dimensional space in which all those vectors live. The distinction in emphasis is subtle: representation describes the learned mapping from raw data to numbers; embedding describes the output of applying that mapping to one sample. In practice, saying a model "learned a good representation" and "produces informative embeddings" mean the same thing. A good representation/embedding places biologically similar samples close together in the embedding space, making downstream comparisons straightforward. Visualisations like UMAP compress this high-dimensional space to two dimensions so clustering can be seen by eye.

Grouped by topic for easier navigation.

Primary paper
  1. Chandrasekaran, S.N., Cimini, B.A., Goodale, A., et al. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nature Methods 21, 1114–1121 (2024). https://doi.org/10.1038/s41592-024-02241-6
Core ML methods
  1. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF CVPR, 16000–16009 (2022). https://arxiv.org/abs/2111.06377
  2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. An image is worth 16×16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR) (2021). https://arxiv.org/abs/2010.11929
  3. Moshkov, N., Bornholdt, M., Benoit, S., Smith, M., McQuin, C., Goodman, A., Senft, R.A., Han, Y., Babadi, M., Horvath, P., Cimini, B.A., Carpenter, A.E., Singh, S., & Caicedo, J.C. Learning representations for image-based profiling of perturbations. Nature Communications 15, 1594 (2024). https://doi.org/10.1038/s41467-024-45999-1
  4. Kim, V., Adaloglou, N., Osterland, M., Morelli, F.M., Halawa, M., König, T., Gnutt, D., & Marin Zapata, P.A. Self-supervision advances morphological profiling by unlocking powerful image representations. Scientific Reports 15, 4876 (2025). https://doi.org/10.1038/s41598-025-88825-4
Background & further reading
  1. Bray, M.-A., Singh, S., Han, H., et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature Protocols 11, 1757–1774 (2016). https://doi.org/10.1038/nprot.2016.105
  2. Chandrasekaran, S.N., Ceulemans, H., Boyd, J.D., & Carpenter, A.E. Image-based profiling for drug discovery: due for a machine-learning upgrade? Nature Reviews Drug Discovery 20, 145–159 (2021). https://doi.org/10.1038/s41573-020-00117-w
  3. Seal, S., Trapotsi, M.-A., Spjuth, O., Singh, S., Genheden, S., Greene, N., Engkvist, O., Bender, A., Carpenter, A.E., & Faber, J. Cell Painting: a decade of discovery and innovation in cellular imaging. Nature Methods (2024). https://doi.org/10.1038/s41592-024-02528-8