Planktonic foraminifera are widely used in biostratigraphic,
palaeoceanographic and evolutionary studies, but the strength of many
study conclusions could be weakened if taxonomic identifications are not
reproducible by different workers. In this study, to assess the relative
importance of a range of possible reasons for among-worker disagreement in
identification, 100 specimens of 26 species of macroperforate planktonic
foraminifera were selected from a core-top site in the subtropical Pacific
Ocean. Twenty-three scientists at different career stages – including some
with only a few days experience of planktonic foraminifera – were asked to
identify each specimen to species level, and to indicate their confidence in each
identification. The participants were provided with a species list and had
access to additional reference materials. We use generalised linear
mixed-effects models to test the relevance of three sets of factors in
identification accuracy: participant-level characteristics (including
experience), species-level characteristics (including a participant's
knowledge of the species) and specimen-level characteristics (size,
confidence in identification). The 19 less experienced scientists achieve a
median accuracy of 57 %, which rises to 75 % for specimens they are
confident in. For the 4 most experienced participants, overall accuracy is
79 %, rising to 93 % when they are confident. To obtain maximum
comparability and ease of analysis, everyone used a standard microscope with
only 35
The taxonomy of planktonic foraminifera is the foundation for understanding many geochemical proxy measurements, biostratigraphic analyses and evolutionary studies. Taxonomic disagreements are particularly problematic in studies that combine data from multiple sources (e.g. Kučera et al., 2005a; Rutherford et al., 1999; Siccha and Kučera, 2017) because such studies implicitly assume that the different researchers used the same taxonomic concepts. Experienced participants in the field are often assumed to be accurate and consistent in any taxonomic identification they perform. Incorrect identifications could lead to the propagation of errors through any further analysis (see Al-Sabouni et al., 2018). Disagreements in identifications (or characterisations of a community of individuals) between scientists can come about for a range of reasons: disagreements over the species list to be used for a study, differences in sampling protocols or choices on how to apply the agreed taxonomic concepts. Each of these could produce differences in the list of species described by a study and some are easier to address than others. However, separating out the relative importance of these different factors has rarely been attempted.
One major reason for disagreements in identifications by different scientists
depends on the list of species they recognise and the descriptions they are
using to describe specimens. For some species, e.g.
A study using planktonic foraminifera, the El Kef blind test (Lipps, 1997),
was explicitly set up to investigate the implications of different species
lists and taxonomic concepts on the interpretation of diversity patterns. The
taxonomy of the study interval chosen, the Cretaceous–Paleogene boundary, was
known to be particularly unstable with no consensus amongst foraminifera
workers (e.g. Canudo et al., 1991; Olsson et al., 1999). Four participants
from different taxonomic schools produced species lists which showed large
differences (mean correlation among participants for taxa identified by at
least two workers was 0.478; Keller, 1997). The participants clearly had
very different taxonomic concepts, although they inferred relatively similar
diversity patterns. To investigate the implications of different taxonomic
concepts for accuracy in modern planktonic foraminifera, Al-Sabouni et
al. (2018) asked 21 planktonic foraminifera workers to identify sets of
300 specimens. Although they were all told to follow a specific taxonomy they
came from a number of different taxonomic schools, leading to differences in
their taxonomic concepts. Fewer than one-quarter of specimens had agreement
from more than 50 % of participants, and the average agreement of
participants' identifications with the consensus was 77 % of specimens
when sieving at > 150
When comparing results between studies which intend to characterise the
planktonic foraminifer community of a site, it is important to make sure that
sampling protocols were identical, as, for example, sieving at different
sizes will produce different communities (Al-Sabouni et al., 2007; Weinkauf
and Milker, 2018). Additionally, smaller specimens tend to be more
challenging to identify. With recent tropical planktonic foraminifera,
samples should be sieved at > 125
Even with agreement on a species list with its associated taxonomic concepts
and a standardised sampling protocol, some disagreements are likely among
scientists. Taxonomy is based on types, or typical examples of the
morphospecies concept, but assigning specimens to these types is not always
easy. If the specimen is poorly preserved, or a juvenile, or has an atypical
morphology, then it may not fit any taxonomic type. Additionally, the
preservation of the type itself or the quality of images of it can be very
variable making some species concepts more open to interpretation. In such
cases, how a person chooses to assign the specimen is likely to vary. That
variation can be studied in relationship to an individuals' identification
over time, or by comparing consistency among a set of foraminiferal workers.
To investigate the consistency of identifications of a single researcher over
time, Zachariasse et al. (1978) used sets of 200 specimens from a lower
Pliocene subtropical sample sieved at > 63
Previous studies on repeatability have mostly conflated the influence of multiple causes of repeatability, combining differences in species lists, sampling protocols or the application of concepts (Bé, 1959; Ginsburg, 1997; Al-Sabouni et al., 2018). They indicate that agreement is greater within taxonomic schools where species concepts are expected to be more similar, but by combining taxonomic disagreement with other factors, it is not clear what level of consistency could be expected when scientists are using an agreed set of taxonomic concepts. In this study, we investigate how the training of a set of taxonomic concepts relates to the accuracy of the identification of specimens. Participants were taught a standard taxonomy and provided with a species list. By modelling a set of factors thought to be important in the accuracy of taxonomic identifications, we aim to identify the relative contributions of scientist-level characteristics (such as their experience), species-level characteristics (such as whether the species had been taught) and specimen-level characteristics (such as its size) on the accuracy of identifications of planktonic foraminifera. We also investigate whether a person's confidence in their identification is reflected in the accuracy.
The specimens in this study were taken from the Ocean Drilling Project (ODP) Site 872 in the west Pacific
gyre. Specifically, they were from core 144-872C-1H-1W 80–82 cm which is
located at 10.1
This analysis was run as part of a NERC funded advanced training short course on “Taxonomy and Biostratigraphy of Cenozoic Planktonic Foraminifera”, taught at the Natural History Museum, London, in February 2017. This course was aimed at PhD and early-career researchers who wished to acquire or enhance their understanding of the taxonomy of Cenozoic planktonic foraminifera and its applications. The attendees of this course made up the majority of the less experienced participants, whilst the four course conveners who also took part made the more experienced group. Some of these attendees had never worked with planktonic foraminifera before, whilst others already had some experience in their taxonomy. The study focuses on the macroperforate species, which were the main group taught during the course, although a few examples of microperforate and benthic foraminifera were included to assess whether they could be distinguished from macroperforate species. The species list (Supplement Sect. S1) was developed based on Kučera et al. (2005b) and Aze et al. (2011), supplemented with the newly described species in Darling et al. (2006), Aurahs et al. (2011), Weiner et al. (2015) and Spezzaferri et al. (2015).
A set of 25 four-well slides was numbered to receive the specimens. The selected specimens were then placed randomly into these slides (but not stuck down). Random sampling without replacement of a sequence of 1–100 determined the order in which they should be placed. Using randomisation prevented second-guessing of the identifications, and meant that any loss of specimens would not alter the validity of the conclusions. All specimens were then imaged and measured (using Image-Pro Premier), to obtain their mean diameter.
Everyone who undertook the identifications was first asked to fill in a
checklist of the extant species that they thought they could identify with
confidence (see Sect. S1). They then worked their way through the specimens
(in no particular order), assigning a species name and a level of confidence
in their identification (confident,
Obtaining a “correct” identification is challenging (Al-Sabouni et al., 2018). In this analysis a definitive identification for each specimen was
obtained using only the results of the course conveners (i.e. the more
experienced participants). Where there was complete consensus between these
participants, identifications were taken as correct. Where there was
disagreement, a more powerful microscope (Olympus SZX10, with 63
The results were then compiled for analysis in R v. 3.0.5 (R Core Team,
2015). Where confidence was originally marked as between two levels (e.g.
“yes” – “maybe”) it was changed to the lower of the two levels (i.e.
“maybe”). In the few cases (3.2 %) where no taxonomic identification
was given, the specimen was scored as “UnIDd” with a confidence of “
Each identification was scored as correct (if it agreed with the definitive
identification) or incorrect. Only the specimens that had been lost were
excluded from all the analyses; by the end of the identification process 14
specimens had been lost. Initially the median percentage accuracy was
calculated (the median rather than mean was used so it is not biased by the
extreme values). The accuracy was then calculated separately for the more
experienced and less experienced participants; the former being the course
conveners and the latter including the course students. As the confidence of
the participant is expected to be correlated with accuracy, we determined the
influence of both their species-level confidence and their specimen-level
confidence. For the species-level confidence estimates, non-macroperforate
specimens were not included as a species-level confidence is not meaningful
for these. In this study we used relatively low-powered microscopes, so
smaller specimens are likely to have been more challenging to identify
accurately; we therefore additionally split accuracy by mean diameter
(125–200, 200–400, > 400
The identification of each specimen by each participant was then used to
create a confusion matrix, or error matrix, using the package “caret
6.0–80” (Kuhn, 2016). For each species, this calculates the fraction of
cases where that species was identified as each of the different taxonomic
names, highlighting which taxonomic concepts are being confused. Inter-rater
consistency was estimated using Cohen's (1960) kappa (
To quantify whether a species' morphological uniqueness affects the accuracy
with which it is identified, a measure of distinctiveness was calculated for
each species. Species were scored for a set of traits (trait data from Aze et
al., 2011):
Chamber arrangement: angulo-conical, clavate, flat, globorotaliform,
globular, planispiral (which includes low trochospiral), spherical; Colour: pink, white; Keel: yes, no; Supplementary apertures: yes, no; Wall texture: cancellate (either irregularly or coarsely), hispid, smooth,
cancellate with smooth cortex.
These traits were used to create a dendrogram, from which the ED score (evolutionary distinctiveness: the metric was first applied to phylogenies; Isaac et al., 2007) was calculated; larger values are more unique. For modelling purposes, this score was centred on the mean and scaled by the standard deviation.
In the consistency analysis, the researchers were identified as either more
or less experienced. However, this split conflates several different aspects.
So for the modelling, researchers' experience was instead quantified in two
ways. The number of years a person had been working on planktonic
foraminifera was measured as a four-level ordered factor split by quantiles:
A generalised linear mixed-effects model was run to investigate the predictors of accuracy. The response variable was whether the specimen was correctly identified; as it is a true/false value, binomial errors were used with a logit link function. Specimens identified as “juvenile” or “nonmacro” were not included in this analysis, unlike the consistency analyses, as many of the species-level explanatory variables do not apply to them. The eight explanatory variables can be grouped into three categories. At the species level, they were distinctiveness of that species, the participant's confidence in identifying that species and whether that species was taught on the course (see Sect. S1). At the scientist level, variables included how long that person had been working with planktonic foraminifera, their experience with a tropical extant community and their gender. The specimen-level variables were the participant's confidence in identification of that specimen and the log of the mean diameter (which was centred on the mean and scaled by the standard deviation before analysis). An interaction between log size (measured as the mean diameter) and the other variables was included in the initial model, as the influence of size is likely to depend on the other parameter values. For example, size may be a less strong predictor of accuracy for more experienced researchers. Participant identity, the definitive species identity and (nested within that) the number of the specimen were initially included as random effects. These were modelled as random effects as they are likely to contribute to the accuracy of the identification, but we are not interested in estimating them from the model (Crawley, 2007).
Box plots showing the accuracy of the identifications of the
different groups of participants split by different categories.
To determine the optimal random-effects structure, following Zuur et
al. (2009), the AIC (Akaike information criterion) value was used to compare all combinations of the random
effects fitted to a maximal model. Specimen number nested within species
identity was tested with a random slope versus size as well as a random
intercept to allow for the possibility that the effect of size on accuracy
could be species-specific. Using the optimal random-effects structure and the
maximal model, model simplification of the fixed effects was then performed
to remove nonsignificant terms (following Crawley, 2007). With the final
model, the marginal effects of the variables were determined by removing each
explanatory variable in turn from the model and calculating the difference in
the
The percentage accuracy of the different groups of participants, split by species- or specimen-level confidence and size.
Participants achieved a median percentage accuracy (compared to the
definitive ID) of 59 %; the value was 79 % for the four more
experienced participants and 57 % for the 19 less experienced
participants including students on the course (Table 1, Fig. 1a). When the
results are restricted to only include those species the participant is
confident in identifying, the median accuracy is 77 % overall (85 %
for experienced workers and 75 % for students; Table 1, Fig. 1b). Only
5 of the 26 participants used “maybe” to classify their species-level
confidence, so there are few data for that category. Additionally, accuracy
was highest (86 %) for the person who was confident in all the species.
The percentage accuracy for only those specimens which the participant
identified confidently rises to 77 % (93 % for experienced
participants, 75 % for students; Table 1, Fig. 1c). Focussing only on
those specimens where the participant expressed confidence in both their
knowledge of the species and their identification of the specimen, accuracy
rises to 84 % overall, and 97 % for experienced participants
(Table 2). Larger specimens were more consistently identified correctly
(Table 1, Fig. 1d), with accuracy for the largest size fraction
(> 400
The percentage accuracy of the different groups of participants, split by their confidence at both species and specimen level. The numbers in brackets show the median number of specimens (first) for number of participants who used that category (second).
The confusion matrix (Fig. 2) shows the fraction of specimens that were
classified under different taxonomic names, with all data included. This
matrix had a kappa value of 0.58 which is classified as fair/moderate
agreement (Fleiss et al., 2013; Landis and Koch, 1977). Some species, e.g.
The ANOVA for the fixed effects of the final model, showing the
degrees of freedom (df), the Chi-squared value (
The best random-effects structure, based on AIC, had random slopes versus
size for the specimen number nested within species identity and random
intercepts for participant identity (see Sect. S4). Following model
simplification, the evolutionary distinctiveness and gender terms drop out
(for the fixed effects included in the final model and their significance,
see Table 3). This model had a marginal
The marginal effects of each explanatory variable. The marginal
Size interacts with a set of variables, so its relationship with agreement is more complex. Generally, larger specimens had a higher level of agreement but there are a few exceptions (Fig. 3). Where the species had not been taught on the course, larger specimens were more likely to be identified incorrectly (Fig. 3a). The impact of specimen size is less important for the more practised participants (the relationship levels off at larger sizes, Fig. 3b). Participants with a greater experience of working with the modern planktonic foraminifera tended to be more accurate in their identifications, although the effect is more pronounced at the smaller size fractions (Fig. 3c).
A confusion matrix showing the species that are most
frequently confused for all participants. The definitive ID is the taxonomic
name considered correct in this study. The individual ID is the name which
was given by the participant. For each definitive ID the coloured squares in
that row indicate names which were used by the participants for that
species. Grey cells indicate that combination did not occur. If all
specimens of all species were accurately identified then all the points
would plot along the diagonal, with a fraction of 1; additionally each row
sums to a fraction of 1. The numbers on the right hand side refer to the
number of specimens of that species in the study in the definitive ID. The
numbers along the top refer to the number of times that species was
identified in the study (n.b. specimens that were lost are excluded from
this analysis).
Providing accurate identifications of planktonic foraminifera is important for a wide range of subjects, including biostratigraphy, geochemistry and biological research. Our results suggest that, with only a short period of training and relatively low-powered microscopes, researchers are able, on average, to correctly identify 75 % of the specimens belonging to the species they know (Table 1, Fig. 1b). Considering only those specimens of these species for which they express confidence, their accuracy rises to 84 % (Table 2). Accuracy was higher among more experienced participants, for whom the corresponding values are 79 % and 97 %, respectively (Tables 1, 2 and Fig. 1). These results suggest that projects requiring identification of only a few species can be performed well with relatively little training. However, for a complete community analysis of a sample, additional experience and/or more in-depth training are likely to be required.
The effects of the interaction terms in the generalised linear
mixed-effects model, showing how the size–accuracy relationship is
influenced by the different factor levels.
By looking further into these results, with a mixed-effects model, we find
that the biggest effects on accuracy come from the participants having been
taught the species and on the confidence level in the identification of that
specimen (Table 4). More generally this indicates that spending time
immediately before starting a project refreshing the key characteristics of
species that will be the focus of the study is particularly beneficial.
Usually, larger specimens have a greater chance of being identified
correctly. However, the direction of this trend is reversed in species that
were not taught; the largest untaught specimens were likely to be incorrectly
identified (Fig. 3a). These results come mainly from two species –
The confusion matrices (Fig. 2, Sect. S3) are particularly useful for
identifying the species where people are unsure. These matrices highlight
which species are most easily confused; if a participant is focussing on
particular species for their study they would obviously do well to consider
the distinguishing characteristics from similar species. Often this confusion
is within a genus, e.g. the
The measures of consistency obtained from this study rely on the “definitive identification” being correct. Without performing DNA analyses (something that would be impossible on this particular set of specimens as they were taken from sediment cores) there is no way of being absolutely certain of the species of a specimen. However, by using the consensus of the more experienced foraminiferal workers (see Sect. 2.1), we have aimed to obtain as “correct” a taxonomy as possible (see Al-Sabouni et al., 2018, for further discussion of this point). This method might tend to cause a slight inflation in the accuracy of the experienced workers, as they are the ones who defined what is correct; however, having an external judge (who was not otherwise involved in the study) for specimens where there was disagreement, should reduce any impact of this effect.
Beyond the variables we were able to model, there are a number of other factors which
are likely to contribute to accuracy in the identification of planktonic
foraminifera. The power of the microscope being used for the analysis is
likely to have a significant effect, particularly at the smaller size
fraction. In this study everyone used the same model of microscope to remove
any variation from this factor. However, in order to obtain sufficient
microscopes for everyone on the course, it was necessary to use relatively
low powered (35
The mixed-effects models indicate that the largest variation in
identification outside the variables we have modelled comes from
specimen-level differences. Even after accounting for species identity and
size variation within a species, some specimens remain more challenging to
identify. The specimens used in this study were chosen to at least have all
the defining characteristics, making them easier to identify than more
damaged or fragmented specimens. However, they were taken from a typical
field sample, so they had a certain amount of sediment still attached making
some identifications more challenging. For instance, detecting the presence
of supplementary apertures for distinguishing between
Species identification was the next most important random effect, whilst person-specific factors (other than experience which was a fixed effect) only had a variance of 0.11. This suggests that the main variation between people occurs as a result of their experience. Gender had no influence on the accuracy of identifications. Additionally, an individual's results can vary over time (Zachariasse et al., 1978). In this study participants were encouraged to focus on accuracy rather than speed in their identifications. Where researchers are working under more time pressure, identifications are likely to be less accurate. Factors such as how tired the participant is, how long they have been identifying samples for that day and whether they are expecting to find a particular species in a sample are also likely to have a small effect on the analysis, but quantifying these additional effects was outside the scope of this study.
The way the specimens were presented might have reduced the accuracy of the identifications. For practical purposes (to enable specimen-level identification), each specimen was placed individually in a slide well. Whilst the presentation we used is more realistic than fixed specimens or images (cf. Al-Sabouni et al., 2018), it is still not completely realistic. More typically, specimens are grouped by species during identification, meaning that morphologically distinct misidentifications are more likely to stand out. Although we were unable to test for the potential positive effects this practice may have, we advise doing so to further reduce the chances of misidentification.
This analysis focussed on one specific time period; however, Zachariasse et al. (1978) pointed out that delimiting species, particularly if multiple samples are compared through time, is challenging with planktonic foraminifera as a result of their very high resolution fossil record. Species descriptions are based on the concept of types, where specimens are related to a typical morphology. When the full ancestor–descendant lineage is present, however, some of the transitional forms will fit more than one morphospecies definition (Pearson, 1998). Our analysis has highlighted that confidence in a species concept tends to increase the accuracy of the identification. However, in a study where that species evolves, confidence in identification might be misplaced.
In this study, we show that one of the largest effects on accuracy was whether
scientists were confident in their identification of a specimen. Researcher
assessments of their own confidence are largely accurate – they know whether
they know – offering a natural path to improved accuracy by re-examining
problematic specimens more closely, with more use of literature and/or other
people's expertise. Students who have had only a few days of training and who
are using low-powered microscopes (35
The median accuracy of 57 % for all the participants found in this study is lower than the 68 % found in Al-Sabouni et al. (2018). These results suggest that, unsurprisingly, it takes more than a few days of training for individuals to become reliable planktonic foraminiferal taxonomists. However, for those species they are confident in, the students were achieving more comparable answers. The more experienced participants here reached a median of 79 %, which is significantly higher than the more global set of participants in Al-Sabouni et al. (2018), suggesting that at least some of the disagreement among workers can come as a result of differences in taxonomic concepts. However, even among experienced participants using the same agreed taxonomy there are disagreements in species definitions.
More generally, there are several things that can be done to minimise taxonomic errors. Taxonomic training is demonstrably beneficial to provide a good grounding in taxonomic concepts. If a study is focussing on particular species (e.g. for isotope analyses), then consider closely related species or those with similar morphology that could be confused. Additionally, picking out clean and whole specimens is likely to give higher accuracy, as well as revisiting specimens where the original identification was unconfident. Before a full community analysis, it is advisable to revisit the taxonomy of the species that are likely to be present to determine how they can be differentiated. Even many of the more experienced workers in this analysis did not know all the species in the recent community, as that is not their main study focus.
Boltovskoy (1965) suggested that the more consistent use of photographs in taxonomic papers would reduce taxonomic problems by making it clearer which species concept is being used. This opinion is still valid today, particularly with the building of large datasets with data from multiple sources, such as the MARGO (Kučera et al., 2005a) and the ForCenS (Siccha and Kučera, 2017) databases. Considering the gradual evolutionary change of many lineages, it is unlikely that taxonomic disagreements will ever be fully resolved. However, if all studies included their taxonomic list and their main references, ideally with associated descriptions or photographs in several relevant orientations, such as is done in Rillo et al. (2016), it would make comparisons between studies more robust.
The specimens and their associated images are deposited in
the Natural History Museum, London (NHM UK PM PF 74556–74565). The data and
the code required to run these analyses are available in Fenton (2018;
The supplement related to this article is available online at:
ISF and AP designed the experiments. All authors, with the exception of PNP, performed the identifications; PNP acted as an arbitrator for taxonomic decisions. ISF performed the analyses. ISF prepared the manuscript with contributions from all co-authors.
The authors declare that they have no conflict of interest.
This study was initially performed at the Natural Environment Research Council Advanced Training Short Course NE/N019024/1 in Taxonomy and Biostratigraphy of Cenozoic Planktonic Foraminifera 2017. Isabel S. Fenton was funded by NERC Standard Grant NE/M003736/1 during the completion of this study. We would like to thank the Angela Marmont Centre for providing the microscopes used in this analysis. We would also like to thank Manuel Weinkauf, Pincelli Hull and an anonymous reviewer for their helpful comments which have improved this manuscript. Edited by: Sev Kender Reviewed by: Manuel Weinkauf, Pincelli Hull, and one anonymous referee