At CSAIL, I worked on improving how ML models predict molecular docking poses for drug discovery. Traditional docking tools lean heavily on simplified physics, but new generative models like DiffDock and HarmonicFlow use data-driven approaches to predict how molecules bind to proteins. My focus was on understanding why these models, despite generating physically plausible structures, often produced binding poses that were overly compact and lacked accurate long-range atomic information. We found that the newest model, HarmonicFlow, produced predictions that consistently underestimated the radius of gyration, meaning the predicted molecules were more folded-in compared to real structures, which ultimately impacted docking accuracy.
Together, we designed new structural priors for the model, one incorporating full 3D conformer data from RDKit and another blending conformer data with controlled noise. The goal was to provide the model with richer long-range information about the molecule's structure, while testing whether more chemically-informed starting points improved performance. Even when given better structural information upfront, the model’s limitations persisted, suggesting the neural network itself needed to integrate long-range patterns more effectively.