Identifying the proteins that a drug interacts with is essential for understanding its efficacy, selectivity, and safety, yet experimental profiling cannot feasibly screen all possible drug–target pairs across the proteome, highlighting the need for scalable computational prediction. Current in silico models often rely on either ligand-based or protein-based features in isolation, which can overlook complementary information arising from joint modeling of molecular structure and target biology. To address this, we propose a conceptual multimodal deep learning model that learns from molecular graphs on the ligand side and protein sequence embeddings on the target side, enabling prediction of both binding affinity and binary interaction status. The model employs a graph attention network to encode the molecular graph of each compound and a pre-trained protein language model to encode the target sequence, with a bilinear attention mechanism fusing ligand and protein representations into a joint embedding for downstream affinity regression and interaction classification. This approach is expected to deliver strong predictive performance on drug–target interaction benchmarks, generalize to unseen targets through protein language representations, and enhance interpretability via attention maps that highlight pharmacophoric substructures and relevant protein regions. By combining predictive modeling, biological generalization, and interpretable ligand–target reasoning, this multimodal framework has the potential to accelerate drug repurposing, selectivity profiling, and virtual screening.