P-glycoprotein efflux can strongly constrain oral absorption, brain penetration, and intracellular drug exposure. Computational substrate prediction is therefore an important early filter for molecules likely to face transporter-mediated disposition liabilities. Most transporter models rely on limited labeled assay data and are often trained directly on endpoint-specific measurements. This ignores the broader chemical information contained in large collections of unlabeled molecular structures. This MDL article proposes a self-supervised molecular model for P-glycoprotein substrate prediction. The model pre-trains on large unlabeled chemical databases and is then adapted to a limited set of validated transporter assay labels. A molecular encoder would be pre-trained using contrastive and masked-structure objectives over graph or SMILES representations. The pre-trained encoder would then be coupled to a lightweight classifier for binary substrate prediction using curated P-glycoprotein assay labels.
Conceptually, the self-supervised model would be expected to offer better data efficiency than a model trained only from limited labeled transporter data. Attribution methods could also highlight molecular features associated with P-glycoprotein recognition. Self-supervised molecular learning could make transporter prediction more accessible when labeled assay data are scarce. This approach may support earlier design of molecules with more favorable absorption and distribution profiles.