A summary of concepts covered: self-supervised learning, masked autoencoders for vision transformers, the CPJUMP1 dataset, performance metrics, practical applications.
Self-supervised learning trains a model using unlabeled data by having the data itself generate the supervision signal. Instead of requiring human annotations, the model is given a task where part of the input is hidden, and it must predict the missing piece from the rest.
Given the sentence "The cat sat on the ___", a model must predict the masked word (mat) from context. No human label is needed — the act of masking creates an automatic training signal. This is the core idea behind large language models.
Labeled data is expensive and slow to collect. Unlabeled data — text, images, audio — is essentially unlimited. Models trained this way learn rich, general-purpose representations that transfer well to downstream tasks.
This works through a two-phase structure that is fundamental to modern machine learning. In the first phase, the model is pre-trained on a large unlabeled dataset using a self-supervised task — such as predicting missing parts of the input. Model components used solely for pre-training are then discarded. In the second phase, the remaining encoder is fine-tuned on a much smaller labeled dataset for a specific task. Because the encoder has already learned to understand the structure of the data, far less labeled data is needed than if you were training from scratch. This is precisely why self-supervised learning is so valuable for domains like cell biology, where labeled data is scarce and expensive to produce.
There are two major approaches. The first — covered in detail in the next section — is masked prediction: hide part of the input and train the model to reconstruct it. The second is contrastive learning: rather than predicting missing content, the model learns by comparing pairs of samples. Two different views of the same data point — for example, two differently cropped or augmented versions of the same image — are treated as positives and their embeddings are pulled together in the representation space; embeddings of different data points are pushed apart. The model learns which features are consistent across views and therefore meaningful, without any labels.
DINO, a prominent ViT-based example, refines this idea using a teacher-student approach called self-distillation rather than explicit positive/negative pairs — but produces embeddings with the same desirable property: representations that cluster meaningfully by biological similarity. In recent benchmarks on cell microscopy data, DINO outperformed both classical hand-engineered features and masked prediction approaches on retrieval tasks, suggesting that contrastive-style training may be particularly well suited to the compound-gene matching problem — a point we return to in the practical applications section.
Vision Transformers (ViTs)[3] divide an image into a grid of fixed-size patches. Masked Autoencoders (MAE, He et al. 2022)[2] apply self-supervised learning directly to this patch structure.
Step 3 shows all visible tokens entering at the top together, flowing down through Encoder → Transformer → Decoder — each stage spans the full width of all tokens, reflecting that every token attends to every other token at each layer.
The key insight is that masking 75% of patches is deliberately aggressive. Unlike language, where neighboring words give strong hints, a missing image patch cannot easily be inferred from its immediate neighbors alone — the encoder must build a global understanding to fill in the gaps.
The decoder (used only during training) is discarded. The ViT encoder has learned compact, semantically rich representations and can be fine-tuned on any downstream task — classification, detection, segmentation — with very little labeled data.
The output of the trained ViT encoder is itself an image-based profile — a compact numerical summary of cell morphology learned directly from pixels. Morphological profiling is the long-established practice of quantifying cell appearance from microscopy images to compare biological states; image-based profiling is the modern, high-throughput version of that practice, typically using software like CellProfiler to extract thousands of hand-engineered measurements per cell. The ViT embedding represents a newer approach to the same goal — letting the model discover what to measure rather than specifying it in advance. CPJUMP1 was designed precisely at this inflection point: it benchmarks classical image-based profiles while making the case that learned representations could do substantially better.
Chandrasekaran et al. (Nature Methods, 2024)[1] introduce CPJUMP1, a benchmark dataset from the JUMP Cell Painting Consortium. It contains approximately 3 million images of cells, profiling 75 million single cells treated with matched chemical and genetic perturbations.
Cells are stained with six fluorescent dyes imaged across five channels — a key difference from the RGB images standard ViTs expect — in the Cell Painting assay:[9]
The dataset pairs 303 compounds and 160 genes, with each gene targeted by at least two compounds and two types of genetic perturbation:
The paper defines three progressively harder retrieval tasks, evaluated using mean average precision (mAP):
| Task | Question asked | Difficulty | Result |
|---|---|---|---|
| Replicate detection | Can a perturbation's profile retrieve its own replicates? | easy | ~75–95% mAP for compounds |
| Sister matching | Do two compounds targeting the same protein match each other? | medium | 5–25% retrieved |
| Compound-gene matching | Does compound X match the CRISPR/ORF profile for its target gene? | hard | ~3–8% — barely above chance |
CRISPR knockouts and ORF overexpressions targeting the same gene are slightly positively correlated — not anti-correlated as simple logic would predict (reduce ↔ increase should oppose). Biology is messier: many compounds activate rather than inhibit, and overexpressed genes can have dominant-negative or compensatory effects.
Cross-modality compound-gene matching is barely above chance with classical CellProfiler features. Better representation learning — such as applying MAE-trained ViTs to the 5-channel images — could meaningfully improve this rate, accelerating drug mechanism-of-action discovery. Even a small improvement reduces the experimental search space for biologists.
mAP measures how well a retrieval system ranks relevant items above irrelevant ones. It rewards finding correct matches early in the ranked list.
Of the top-k results returned, what fraction are true matches? e.g. 3 correct out of 10 retrieved = precision@10 of 0.3.
For a single query, take the mean of precision values at each rank where a true match appears. Early matches score higher. AP = 1.0 is perfect.
Mean of AP scores across all queries (all perturbations). Summarises retrieval quality across the whole dataset in a single number.
In CPJUMP1, each perturbation profile is a query. Profiles are ranked by cosine similarity. The relevant items are replicates (for replicate detection) or sister/matched perturbations (for the harder tasks). mAP near 1.0 means matches reliably appear at the top; mAP near 0 means the ranking is essentially random.
The paper also reports fraction retrieved — the share of perturbations whose average precision passes a significance threshold (q < 0.05 after permutation testing and Benjamini–Hochberg correction) — as a more interpretable summary statistic.
The most direct application is identifying what a compound is doing inside a cell. Suppose a compound shows promising activity in a phenotypic screen — it kills cancer cells, for example — but its molecular target is unknown. Conventional target identification relies on techniques like affinity chromatography or mass spectrometry proteomics, which require significant experimental setup and can take months to yield even a candidate. A researcher using image-based profiling instead images cells treated with the compound, extracts a morphological profile (a numerical fingerprint of the cells' size, shape, texture, and staining intensity across all five channels), and then queries the CPJUMP1 database: which genetic perturbations produce the most similar cell appearance? If CRISPR knockout of gene X consistently produces cells that look like cells treated with the compound, that is evidence the compound may work by inhibiting the protein product of gene X. The result is a ranked shortlist of candidate targets in hours rather than months — not a confirmed answer, but a prioritised set of hypotheses to test experimentally. With classical CellProfiler features, compound-gene matching rates are barely above chance; a ViT encoder pre-trained on cell images could produce richer embeddings that make this shortlist meaningfully more reliable. This has already been demonstrated: Kim et al. (2025) trained SSL models including DINO and MAE on the broader JUMP Cell Painting dataset (of which CPJUMP1 is the pilot) and benchmarked them against CPJUMP1's CellProfiler baseline — their best model surpassed CellProfiler by 16–29% in perturbation matching mAP and 22% in drug target classification.[8]
The reverse query is equally useful. Starting from a gene of interest — say, a newly validated disease target — a researcher asks which compounds in the library produce a morphology resembling CRISPR knockout of that gene. This is called virtual screening, and it can narrow a library of hundreds of compounds down to a handful worth testing experimentally, saving significant time and cost.
The PLK1 inhibitor BI-2536 causes dramatic cell death. In CPJUMP1, CRISPR knockout of PLK1 produces a morphologically similar phenotype — a clean positive match. By contrast, the BRD4-specific inhibitor PFI-1 produces no clear phenotype in this assay, and BRD4 knockout likewise looks like a negative control. This illustrates both the promise of the approach (when it works, the match is visually striking) and a key complication: if a compound's annotated target doesn't actually drive its phenotype, the profile won't match the corresponding genetic perturbation.
Many compounds bind more than one protein target — a property known as polypharmacology. BI-2536 is a good example: it is annotated as a BRD4 inhibitor, but its dominant phenotype in cells is driven by PLK1 inhibition, a secondary target. When a compound hits multiple targets simultaneously, its morphological profile reflects a mixture of all those effects, making it difficult to match cleanly to any single gene's CRISPR profile. The CPJUMP1 dataset was explicitly designed to expose this problem, and the low cross-modality matching rates reported partly reflect how common polypharmacology is among drug-like compounds.
Beyond MoA discovery, Cell Painting profiles have been applied to predict compound toxicity, identify disease-relevant morphological signatures for patient stratification, and cluster genetically perturbed cells to map functional relationships between genes. A 2024 systematic review covering a decade of Cell Painting research documented more than 90 studies spanning drug discovery, toxicology, and functional genomics.[11] The current challenge — and the reason CPJUMP1 was created as a benchmark — is that existing classical CellProfiler-based methods leave most of this potential untapped. Improved representation learning, of the kind that MAE-trained vision transformers could provide, is widely seen as the most promising path forward.[4]
Grouped by topic for easier navigation.