Data obtained from calibration exercises are used to assess the level of agreement between examiners (and the benchmark examiner) and/or between repeated examinations by the same examiner in epidemiological surveys or large-scale clinical studies. Agreement can be measured using different techniques: kappa statistic, percentage agreement, dice coefficient, sensitivity and specificity. Each of these methods shows specific characteristics and has its own shortcomings. The aim of this contribution is to critically review techniques for the measurement and analysis of examiner agreement and to illustrate this using data from a recent survey in young children, the Smile for Life project. The above-mentioned agreement measures are influenced (in differing ways and extents) by the unit of analysis (subject, tooth, surface level) and the disease level in the validation sample. These effects are more pronounced for percentage agreement and kappa than for sensitivity and specificity. It is, therefore, important to include information on unit of analysis and disease level (in validation sample) when reporting agreement measures. Also, confidence intervals need to be included since they indicate the reliability of the estimate. When dependency among observations is present [as is the case in caries experience data sets with typical hierarchical structure (surface-tooth-subject)], this will influence the width of the confidence interval and should therefore not be ignored. In this situation, the use of multilevel modelling is necessary. This review clearly shows that there is a need for the development of guidelines for the measurement, interpretation and reporting of examiner reliability in caries experience surveys.