Discussion notes · Machine learning & cell biology

Drug Discovery, CPJUMP1,
and Self-Supervised Learning

Three concepts that together describe a new approach to one of drug discovery's oldest problems: figuring out what a compound actually does inside a cell.

A summary of concepts covered: Cell Painting & CPJUMP1, performance metrics, better shortlists with SSL, self-supervised learning, masked autoencoders for ViT.

Every drug has a molecular story that describes which protein it binds, which pathway it disrupts, which cellular process it redirects. Figuring out that story is one of the hardest problems in drug discovery. A compound might kill cancer cells reliably in the lab, yet its actual target inside the cell remains unknown. Conventional approaches to answering that question — affinity chromatography, mass spectrometry proteomics — are slow, expensive, and require educated guesses about where to look.

Image-based profiling offers a different path. Cells treated with a compound are photographed under a fluorescence microscope, and software extracts thousands of measurements from each image — the size, shape, texture, and staining intensity of every cell across five fluorescent channels. The result is a numerical fingerprint of what the compound does to a cell. Compounds with similar mechanisms tend to produce similar fingerprints. So do genetic perturbations that hit the same target.

Classical image profiling works well enough for the most obvious cases — detecting strong phenotypes, clustering compounds whose shared mechanism is written clearly in cell morphology. But most drug discovery problems aren't obvious. The real prize is cross-modality matching: determining whether a small molecule and a genetic perturbation, two completely different ways of interfering with a cell, produce similar enough morphologies to suggest they share a target. If they do, a compound's fingerprint becomes a key into the genome. CPJUMP1 was built specifically to enable and benchmark this harder task — pairing 303 compounds with 160 genes (each perturbed by both CRISPR knockout and ORF overexpression), with each gene's protein product a known target of at least two compounds in the dataset.

Initial applications of CPJUMP1 were humbling. With classical hand-engineered features, compound-gene matching rates in CPJUMP1 are barely above chance. This will surprise no one familiar with the history of machine learning: hand-crafted features have a well-known ceiling. Telling software what to measure works up to a point; beyond that point, you need a model that can discover what to measure for itself.

That is precisely what self-supervised learning offers. Trained on millions of unlabeled cell images, a vision transformer learns to encode cell morphology into compact, discriminating representations — not by following rules written by a biologist or an engineer, but by learning the structure of the images directly. The question CPJUMP1 was designed to answer is whether those learned representations can crack cross-modality matching where classical features could not.

CPJUMP1 — the Cell Painting dataset

Chandrasekaran et al. (Nature Methods, 2024)[1] created CPJUMP1 as a benchmark dataset from the JUMP Cell Painting Consortium, specifically designed to test cross-modality matching at scale. It contains approximately 3 million images of cells, profiling 75 million single cells treated with matched chemical and genetic perturbations.

The Cell Painting assay

Cells are stained with six fluorescent dyes imaged across five channels in the Cell Painting assay:[9]

Mito (mitochondria)
AGP (actin, Golgi, plasma membrane)
RNA (nucleoli & cytoplasmic RNA)
ER (endoplasmic reticulum)
DNA (nucleus)

Three perturbation modalities

The dataset pairs 303 compounds and 160 genes, with each gene's protein product a known target of at least two compounds in the dataset:

Compound
Small molecule drugs (from the Drug Repurposing Hub) applied at 5 µM. Tend to produce the strongest, most distinguishable morphological phenotypes.
CRISPR
Two guide RNAs per gene — CRISPR-Cas9 knockout reduces the amount of a gene's protein product. Two guides per gene allows within-modality matching.
ORF
Open reading frame overexpression — increases the amount of a gene's protein product. One reagent per gene; produces the weakest/noisiest signal.

Benchmark retrieval tasks

The paper defines three progressively harder retrieval tasks, evaluated using mean average precision (mAP):

Task Question asked Difficulty Result
Replicate detection Can a perturbation's profile retrieve its own replicates? easy ~75–95% mAP for compounds
Sister matching Do two compounds targeting the same protein match each other? medium 5–25% retrieved
Compound-gene matching Does compound X match the CRISPR/ORF profile for its target gene? hard ~3–8% — barely above chance
Surprising finding

CRISPR knockouts and ORF overexpressions targeting the same gene are slightly positively correlated — not anti-correlated as simple logic would predict (reduce ↔ increase should oppose). Biology is messier: many compounds activate rather than inhibit, and overexpressed genes can have dominant-negative or compensatory effects.

Polypharmacology case study: the good and the bad

The PLK1 inhibitor BI-2536 illustrates cross-modality matching at its best and most complicated. When queried against CRISPR knockouts, its profile matches PLK1 knockout cleanly — a visually striking positive result. But BI-2536 is also annotated as a BRD4 inhibitor, and against BRD4 knockout it fails entirely. For the ML researcher this is a label noise problem: the compound's dominant phenotype is driven by PLK1, a secondary target, not the annotated one. A model working from the BRD4 label will fail — but for the right reasons.

Performance metrics

mAP measures how well a retrieval system ranks relevant items above irrelevant ones. It rewards finding correct matches early in the ranked list.

Precision@k

Of the top-k results returned, what fraction are true matches? e.g. 3 correct out of 10 retrieved = precision@10 of 0.3.

Average precision

For a single query, take the mean of precision values at each rank where a true match appears. Early matches score higher. AP = 1.0 is perfect.

mAP

Mean of AP scores across all queries (all perturbations). Summarises retrieval quality across the whole dataset in a single number.

In CPJUMP1, each perturbation profile is a query. Profiles are ranked by cosine similarity. The relevant items are replicates (for replicate detection) or sister/matched perturbations (for the harder tasks). mAP near 1.0 means matches reliably appear at the top; mAP near 0 means the ranking is essentially random.

The paper also reports fraction retrieved — the share of perturbations whose average precision passes a significance threshold (q < 0.05 after permutation testing and Benjamini–Hochberg correction) — as a more interpretable summary statistic.

Better shortlists with self-supervised representations

The reliability of both queries — forward and reverse — depends entirely on the quality of the morphological profile used as the query. With classical CellProfiler features, compound-gene matching rates are barely above chance, meaning the shortlist of candidate targets is not yet reliable enough to meaningfully reduce experimental effort. Kim et al. (2025) demonstrate a concrete improvement: by training SSL models — including DINO and MAE — directly on the broader JUMP Cell Painting dataset, they produced embeddings that surpassed CellProfiler by 16–29% in perturbation matching and 22% in drug target classification.[8] For a biologist running the MoA query workflow, this translates directly into fewer dead ends: more correct candidate targets appear near the top of the ranked list, and fewer experiments are wasted chasing false leads. Notably, DINO achieved this without any fine-tuning on labeled data and generalised to cell images it had never seen during training — suggesting the improvement is robust and not specific to CPJUMP1's experimental conditions.

Self-supervised learning

The case for self-supervised learning in cell biology rests on a familiar asymmetry: unlabeled images are essentially unlimited; labeled ones are scarce and expensive to produce. Self-supervised learning trains a model using unlabeled data by having the data itself generate the supervision signal. Instead of requiring human annotations, the model is given a task where part of the input is hidden, and it must predict the missing piece from the rest.

Example — language

Given the sentence "The cat sat on the ___", a model must predict the masked word (mat) from context. No human label is needed — the act of masking creates an automatic training signal. This is the core idea behind large language models.

Why it matters

Labeled data is expensive and slow to collect. Unlabeled data — text, images, audio — is essentially unlimited. Models trained this way learn rich, general-purpose representations that transfer well to downstream tasks.

This works through a two-phase structure that is fundamental to modern machine learning. In the first phase, the model is pre-trained on a large unlabeled dataset using a self-supervised task — such as predicting missing parts of the input. Model components used solely for pre-training are then discarded. In the second phase, the remaining encoder is fine-tuned on a much smaller labeled dataset for a specific task. Because the encoder has already learned to understand the structure of the data, far less labeled data is needed than if you were training from scratch. This is precisely why self-supervised learning is so valuable for domains like cell biology, where labeled data is scarce and expensive to produce.

Two families of self-supervised learning

There are two major approaches. The first — covered in detail in the next section — is masked prediction: hide part of the input and train the model to reconstruct it. The second is contrastive learning: rather than predicting missing content, the model learns by comparing pairs of samples. Two different views of the same data point — for example, two differently cropped or augmented versions of the same image — are treated as positives and their embeddings are pulled together in the representation space; embeddings of different data points are pushed apart. The model learns which features are consistent across views and therefore meaningful, without any labels.

DINO, a prominent ViT-based example, refines this idea using a teacher-student approach called self-distillation rather than explicit positive/negative pairs — but produces embeddings with the same desirable property: representations that cluster meaningfully by biological similarity. In recent benchmarks on cell microscopy data, DINO outperformed both classical hand-engineered features and masked prediction approaches on retrieval tasks, suggesting that contrastive-style training may be particularly well suited to the compound-gene matching problem — a point demonstrated concretely by Kim et al. (2025) in the better shortlists section above.

Masked autoencoders for ViT

Vision Transformers (ViTs)[3] divide an image into a grid of fixed-size patches. Masked Autoencoders (MAE, He et al. 2022)[2] apply self-supervised learning directly to this patch structure.

MAE pre-training pipeline Four steps: original image patches, 75% masked, all visible tokens processed jointly through encoder, transformer, and decoder stages, then pixels reconstructed. The three stages span the full width of all tokens, showing they are processed collectively rather than individually. 1. Original image 2. Patch + mask 75% 3. MAE (ViT encoder) 4. Reconstruct Encoder Transformer Decoder masked visible predicted given Reconstruction loss: predicted pixels vs. original masked patches reconstructed After pre-training, discard the decoder The ViT encoder has learned rich visual representations — fine-tune it on any downstream task (classification, detection, segmentation).

Step 3 shows all visible tokens entering at the top together, flowing down through Encoder → Transformer → Decoder — each stage spans the full width of all tokens, reflecting that every token attends to every other token at each layer.

The key insight is that masking 75% of patches is deliberately aggressive. Unlike language, where neighboring words give strong hints, a missing image patch cannot easily be inferred from its immediate neighbors alone — the encoder must build a global understanding to fill in the gaps.

After pre-training

The decoder (used only during training) is discarded. The ViT encoder has learned compact, semantically rich representations and can be fine-tuned on any downstream task — classification, detection, segmentation, or compound-gene matching — with very little labeled data.

Glossary

Cosine similarity
A way of measuring how alike two vectors are, regardless of their length. It computes the cosine of the angle between them: 1 = identical direction (matching profiles), 0 = perpendicular (unrelated), −1 = opposite directions (anti-correlated). It is the standard metric used in CPJUMP1 to compare morphological profiles.
Cross-modality matching
The task of determining whether a chemical perturbation (a small molecule compound) and a genetic perturbation (a CRISPR knockout or gene overexpression) produce sufficiently similar morphological profiles to suggest they share a mechanism — that is, that the compound works by targeting the protein the gene encodes. "Cross-modality" reflects the fact that these are fundamentally different ways of interfering with a cell, yet both leave a fingerprint in the Cell Painting assay. It is the hardest retrieval task in CPJUMP1, and the one where classical CellProfiler features fall furthest short.
Contrastive learning
A family of self-supervised learning methods that train a model by comparing pairs of samples. Two different views of the same data point (e.g. two cropped or augmented versions of the same image) are treated as positives and their embeddings are pulled together in representation space; embeddings of different data points are pushed apart. The model learns which features are invariant across views — and therefore meaningful — without any labels. DINO, a prominent ViT-based example, uses a teacher-student variant called self-distillation rather than explicit positive/negative pairs, but produces embeddings with the same property: representations that cluster by semantic or biological similarity. Contrastive methods tend to produce more globally discriminative embeddings than masked prediction approaches, which matters for retrieval tasks like compound-gene matching.
Decoder
The complementary part that reconstructs the original input from the encoder's compressed representation. In MAE, its sole job during pre-training is to predict the masked patches — forcing the encoder to learn good representations. Once pre-training is complete the decoder is discarded, much like a scaffold removed after a building is complete.
Encoder
The part of a neural network that compresses an input (e.g. an image) into a compact representation — the "feature extractor." In the MAE framework, the encoder only sees visible (unmasked) patches. It is the component kept after pre-training and used for all downstream tasks.
Fine-tuning
Taking a pre-trained model and continuing to train it on a smaller, labeled dataset for a specific task. Fine-tuning is much cheaper than training from scratch because the model already understands the structure of the data; usually only the final layers need significant adjustment.
Image-based profiling & morphological profiling
Morphological profiling is the long-established practice of quantifying cell appearance — size, shape, texture, staining intensity — to compare biological states. Image-based profiling is the modern, high-throughput version: microscopy images of cells are processed by software (typically CellProfiler) to extract thousands of hand-engineered numerical measurements per cell, producing a profile that acts as a quantitative fingerprint of the sample's state. A pre-trained ViT encoder does the same thing differently — instead of hand-crafted measurements, it produces a dense learned embedding directly from pixels. Both approaches yield a numerical profile suitable for comparison; the key question CPJUMP1 is designed to answer is whether learned profiles can outperform classical ones at matching chemical and genetic perturbations.
Patch (in vision transformers)
A small, fixed-size square region of an image — typically 16×16 pixels. ViTs divide every image into a regular grid of patches and treat each as a single token, analogous to a word in a text model. The model learns relationships between patches across the whole image rather than processing pixels one at a time.
Pre-training
An initial training phase on a large, often unlabeled dataset to give the model general-purpose knowledge — in this case, learning to reconstruct masked image patches across millions of cell images. The resulting model is not yet specialised for any particular task; that comes later via fine-tuning.
Representation, embedding & embedding space
These terms are closely related and often used interchangeably. A representation refers to the abstract idea of encoding a data point as a vector of numbers that captures its meaningful structure — in this context, a compact numerical summary of a cell sample's morphology. An embedding is the concrete realisation of that idea: the specific vector a trained model produces for a given sample. The embedding space is the high-dimensional space in which all those vectors live. The distinction in emphasis is subtle: representation describes the learned mapping from raw data to numbers; embedding describes the output of applying that mapping to one sample. In practice, saying a model "learned a good representation" and "produces informative embeddings" mean the same thing. A good representation/embedding places biologically similar samples close together in the embedding space, making downstream comparisons straightforward. Visualisations like UMAP compress this high-dimensional space to two dimensions so clustering can be seen by eye.

Grouped by topic for easier navigation.

Primary paper
  1. Chandrasekaran, S.N., Cimini, B.A., Goodale, A., et al. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nature Methods 21, 1114–1121 (2024). https://doi.org/10.1038/s41592-024-02241-6
Core ML methods
  1. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF CVPR, 16000–16009 (2022). https://arxiv.org/abs/2111.06377
  2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. An image is worth 16×16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR) (2021). https://arxiv.org/abs/2010.11929
  3. Moshkov, N., Bornholdt, M., Benoit, S., Smith, M., McQuin, C., Goodman, A., Senft, R.A., Han, Y., Babadi, M., Horvath, P., Cimini, B.A., Carpenter, A.E., Singh, S., & Caicedo, J.C. Learning representations for image-based profiling of perturbations. Nature Communications 15, 1594 (2024). https://doi.org/10.1038/s41467-024-45999-1
  4. Kim, V., Adaloglou, N., Osterland, M., Morelli, F.M., Halawa, M., König, T., Gnutt, D., & Marin Zapata, P.A. Self-supervision advances morphological profiling by unlocking powerful image representations. Scientific Reports 15, 4876 (2025). https://doi.org/10.1038/s41598-025-88825-4
Background & further reading
  1. Bray, M.-A., Singh, S., Han, H., et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature Protocols 11, 1757–1774 (2016). https://doi.org/10.1038/nprot.2016.105
  2. Chandrasekaran, S.N., Ceulemans, H., Boyd, J.D., & Carpenter, A.E. Image-based profiling for drug discovery: due for a machine-learning upgrade? Nature Reviews Drug Discovery 20, 145–159 (2021). https://doi.org/10.1038/s41573-020-00117-w
  3. Seal, S., Trapotsi, M.-A., Spjuth, O., Singh, S., Genheden, S., Greene, N., Engkvist, O., Bender, A., Carpenter, A.E., & Faber, J. Cell Painting: a decade of discovery and innovation in cellular imaging. Nature Methods (2024). https://doi.org/10.1038/s41592-024-02528-8