Speaker
Description
Learning counterfactual representations for cellular perturbations is a fundamental challenge in representation learning, significantly hindered by the fundamentally unpaired nature of interventional data. Current state-of-the-art generative approaches (e.g., GEARS) circumvent this by relying heavily on domain-specific heuristics, such as masking the input space to a subset of highly variable features or injecting external knowledge graphs. These pre-processing steps inherently discard the global data manifold and mask subtle, rare distributional shifts. Here, we introduce REGINA, a fully data-driven framework that formulates perturbation modeling as an unpaired distribution matching problem. By coupling a Regularized Encoder with a Latent Cycle-GAN architecture, REGINA natively processes the complete, unmasked high-dimensional data vector. Our approach projects observations into a structured latent space where interventional shifts are simulated via conditional prompting, eliminating the need for paired ground truth or artificial dimensionality reduction. Empirical evaluations demonstrate that REGINA achieves competitive local target precision (Top-K Pearson correlation and directional error) against established baselines. More importantly, by retaining the full feature landscape, REGINA establishes superior performance in global distribution matching, yielding significant improvements in Wasserstein distance and Maximum Mean Discrepancy (MMD) while successfully preserving the signatures of rare subpopulations.