Three concepts that together describe a new approach to one of drug discovery's oldest problems: figuring out what a compound actually does inside a cell.
A summary of concepts covered: Cell Painting & CPJUMP1, performance metrics, better shortlists with SSL, self-supervised learning, masked autoencoders for ViT.
Every drug has a molecular story that describes which protein it binds, which pathway it disrupts, which cellular process it redirects. Figuring out that story is one of the hardest problems in drug discovery. A compound might kill cancer cells reliably in the lab, yet its actual target inside the cell remains unknown. Conventional approaches to answering that question — affinity chromatography, mass spectrometry proteomics — are slow, expensive, and require educated guesses about where to look.
Image-based profiling offers a different path. Cells treated with a compound are photographed under a fluorescence microscope, and software extracts thousands of measurements from each image — the size, shape, texture, and staining intensity of every cell across five fluorescent channels. The result is a numerical fingerprint of what the compound does to a cell. Compounds with similar mechanisms tend to produce similar fingerprints. So do genetic perturbations that hit the same target.
Classical image profiling works well enough for the most obvious cases — detecting strong phenotypes, clustering compounds whose shared mechanism is written clearly in cell morphology. But most drug discovery problems aren't obvious. The real prize is cross-modality matching: determining whether a small molecule and a genetic perturbation, two completely different ways of interfering with a cell, produce similar enough morphologies to suggest they share a target. If they do, a compound's fingerprint becomes a key into the genome. CPJUMP1 was built specifically to enable and benchmark this harder task — pairing 303 compounds with 160 genes (each perturbed by both CRISPR knockout and ORF overexpression), with each gene's protein product a known target of at least two compounds in the dataset.
Initial applications of CPJUMP1 were humbling. With classical hand-engineered features, compound-gene matching rates in CPJUMP1 are barely above chance. This will surprise no one familiar with the history of machine learning: hand-crafted features have a well-known ceiling. Telling software what to measure works up to a point; beyond that point, you need a model that can discover what to measure for itself.
That is precisely what self-supervised learning offers. Trained on millions of unlabeled cell images, a vision transformer learns to encode cell morphology into compact, discriminating representations — not by following rules written by a biologist or an engineer, but by learning the structure of the images directly. The question CPJUMP1 was designed to answer is whether those learned representations can crack cross-modality matching where classical features could not.
Chandrasekaran et al. (Nature Methods, 2024)[1] created CPJUMP1 as a benchmark dataset from the JUMP Cell Painting Consortium, specifically designed to test cross-modality matching at scale. It contains approximately 3 million images of cells, profiling 75 million single cells treated with matched chemical and genetic perturbations.
Cells are stained with six fluorescent dyes imaged across five channels in the Cell Painting assay:[9]
The dataset pairs 303 compounds and 160 genes, with each gene's protein product a known target of at least two compounds in the dataset:
The paper defines three progressively harder retrieval tasks, evaluated using mean average precision (mAP):
| Task | Question asked | Difficulty | Result |
|---|---|---|---|
| Replicate detection | Can a perturbation's profile retrieve its own replicates? | easy | ~75–95% mAP for compounds |
| Sister matching | Do two compounds targeting the same protein match each other? | medium | 5–25% retrieved |
| Compound-gene matching | Does compound X match the CRISPR/ORF profile for its target gene? | hard | ~3–8% — barely above chance |
CRISPR knockouts and ORF overexpressions targeting the same gene are slightly positively correlated — not anti-correlated as simple logic would predict (reduce ↔ increase should oppose). Biology is messier: many compounds activate rather than inhibit, and overexpressed genes can have dominant-negative or compensatory effects.
The PLK1 inhibitor BI-2536 illustrates cross-modality matching at its best and most complicated. When queried against CRISPR knockouts, its profile matches PLK1 knockout cleanly — a visually striking positive result. But BI-2536 is also annotated as a BRD4 inhibitor, and against BRD4 knockout it fails entirely. For the ML researcher this is a label noise problem: the compound's dominant phenotype is driven by PLK1, a secondary target, not the annotated one. A model working from the BRD4 label will fail — but for the right reasons.
mAP measures how well a retrieval system ranks relevant items above irrelevant ones. It rewards finding correct matches early in the ranked list.
Of the top-k results returned, what fraction are true matches? e.g. 3 correct out of 10 retrieved = precision@10 of 0.3.
For a single query, take the mean of precision values at each rank where a true match appears. Early matches score higher. AP = 1.0 is perfect.
Mean of AP scores across all queries (all perturbations). Summarises retrieval quality across the whole dataset in a single number.
In CPJUMP1, each perturbation profile is a query. Profiles are ranked by cosine similarity. The relevant items are replicates (for replicate detection) or sister/matched perturbations (for the harder tasks). mAP near 1.0 means matches reliably appear at the top; mAP near 0 means the ranking is essentially random.
The paper also reports fraction retrieved — the share of perturbations whose average precision passes a significance threshold (q < 0.05 after permutation testing and Benjamini–Hochberg correction) — as a more interpretable summary statistic.
The reliability of both queries — forward and reverse — depends entirely on the quality of the morphological profile used as the query. With classical CellProfiler features, compound-gene matching rates are barely above chance, meaning the shortlist of candidate targets is not yet reliable enough to meaningfully reduce experimental effort. Kim et al. (2025) demonstrate a concrete improvement: by training SSL models — including DINO and MAE — directly on the broader JUMP Cell Painting dataset, they produced embeddings that surpassed CellProfiler by 16–29% in perturbation matching and 22% in drug target classification.[8] For a biologist running the MoA query workflow, this translates directly into fewer dead ends: more correct candidate targets appear near the top of the ranked list, and fewer experiments are wasted chasing false leads. Notably, DINO achieved this without any fine-tuning on labeled data and generalised to cell images it had never seen during training — suggesting the improvement is robust and not specific to CPJUMP1's experimental conditions.
The case for self-supervised learning in cell biology rests on a familiar asymmetry: unlabeled images are essentially unlimited; labeled ones are scarce and expensive to produce. Self-supervised learning trains a model using unlabeled data by having the data itself generate the supervision signal. Instead of requiring human annotations, the model is given a task where part of the input is hidden, and it must predict the missing piece from the rest.
Given the sentence "The cat sat on the ___", a model must predict the masked word (mat) from context. No human label is needed — the act of masking creates an automatic training signal. This is the core idea behind large language models.
Labeled data is expensive and slow to collect. Unlabeled data — text, images, audio — is essentially unlimited. Models trained this way learn rich, general-purpose representations that transfer well to downstream tasks.
This works through a two-phase structure that is fundamental to modern machine learning. In the first phase, the model is pre-trained on a large unlabeled dataset using a self-supervised task — such as predicting missing parts of the input. Model components used solely for pre-training are then discarded. In the second phase, the remaining encoder is fine-tuned on a much smaller labeled dataset for a specific task. Because the encoder has already learned to understand the structure of the data, far less labeled data is needed than if you were training from scratch. This is precisely why self-supervised learning is so valuable for domains like cell biology, where labeled data is scarce and expensive to produce.
There are two major approaches. The first — covered in detail in the next section — is masked prediction: hide part of the input and train the model to reconstruct it. The second is contrastive learning: rather than predicting missing content, the model learns by comparing pairs of samples. Two different views of the same data point — for example, two differently cropped or augmented versions of the same image — are treated as positives and their embeddings are pulled together in the representation space; embeddings of different data points are pushed apart. The model learns which features are consistent across views and therefore meaningful, without any labels.
DINO, a prominent ViT-based example, refines this idea using a teacher-student approach called self-distillation rather than explicit positive/negative pairs — but produces embeddings with the same desirable property: representations that cluster meaningfully by biological similarity. In recent benchmarks on cell microscopy data, DINO outperformed both classical hand-engineered features and masked prediction approaches on retrieval tasks, suggesting that contrastive-style training may be particularly well suited to the compound-gene matching problem — a point demonstrated concretely by Kim et al. (2025) in the better shortlists section above.
Vision Transformers (ViTs)[3] divide an image into a grid of fixed-size patches. Masked Autoencoders (MAE, He et al. 2022)[2] apply self-supervised learning directly to this patch structure.
Step 3 shows all visible tokens entering at the top together, flowing down through Encoder → Transformer → Decoder — each stage spans the full width of all tokens, reflecting that every token attends to every other token at each layer.
The key insight is that masking 75% of patches is deliberately aggressive. Unlike language, where neighboring words give strong hints, a missing image patch cannot easily be inferred from its immediate neighbors alone — the encoder must build a global understanding to fill in the gaps.
The decoder (used only during training) is discarded. The ViT encoder has learned compact, semantically rich representations and can be fine-tuned on any downstream task — classification, detection, segmentation, or compound-gene matching — with very little labeled data.
Grouped by topic for easier navigation.