Journal of Chemical Information and Modeling vol:50 issue:9 pages:1660-1668
Molecular graphs are a compact representation of molecules but may be too concise to obtain optimal generalization performance from graph-based machine learning algorithms. Over centuries, chemists have learned what are the important functional groups in molecules. This knowledge is normally not manifest in molecular graphs. In this paper, we introduce a simple method to incorporate this type of background knowledge: we insert additional vertices with corresponding edges for each functional group and ring structure identified in the molecule. We present experimental evidence that, on a wide range of ligand-based tasks and data sets, the proposed augmentation method improves the predictive performance over several graph kernel-based quantitative structure−activity relationship models. When the augmentation technique is used with the recent pairwise maximal common subgraphs kernel, we achieve a significant improvement over the current state-of-the-art on the NCI-60 cancer data set in 28 out of 60 cell lines, with the other 32 cell lines showing no significant difference in accuracy. Finally, on the Bursi mutagenicity data set, we obtain near-optimal predictions.