The objective of this study was to verify how valid misclassification measurements obtained from a 'pre-survey' calibration exercise are by comparing them to validation scores obtained in 'field' conditions. Validation data were collected from the 'Smile for Life' project, an oral health intervention study in Flemish children. A calibration exercise was organized under 'pre-survey' conditions (32 age-matched children examined by eight examiners and the benchmark scorer). In addition, using a pre-determined sampling scheme blinded to the examiners, the benchmark scorer re-examined between six and 11 children screened by each of the dentists during the survey. Factors influencing sensitivity and specificity for scoring caries experience (CE) were investigated, including examiner, tooth type, surface type, tooth position (upper/lower jaw, right/left side) and validation setting (pre-survey versus field). In order to account for the clustering effect in the data, a generalized estimating equations approach was applied. Sensitivity scores were influenced not only by the calibration setting (lower sensitivity in field conditions, p < 0.01), but also by examiner, tooth type (lower sensitivity in molar teeth, p < 0.01) and tooth position (lower sensitivity in the lower jaw, p < 0.01). Factors influencing specificity were examiner, tooth type (lower specificity in molar teeth, p < 0.01) and surface type (the occlusal surface with a lower specificity than other surfaces) but not the validation setting. Misclassification measurements for scoring CE are influenced by several factors. In this study, the validation setting influenced sensitivity, with lower scores obtained when measuring data validity in 'field' conditions. Results obtained in a pre-survey calibration setting need to be interpreted with caution and do not (always) reflect the actual performance of examiners during the field work.