SemEval-2010 Task 8: Multi-Way Classification
of Semantic Relations Between Pairs of Nominals
Iris Hendrickx∗ , Su Nam Kim† , Zornitsa Kozareva‡ , Preslav Nakov§ ,
O S´eaghdha¶, Sebastian Pad´o , Marco Pennacchiotti∗∗,
Lorenza Romano††, Stan Szpakowicz‡‡
dataset for each of seven relations. We have definedSemEval-2010 Task 8 as multi-way classification,
where the label for each example must be chosen
way classification of semantic relations be-
from the complete set of ten relations. We have also
produced more data: 10,717 annotated examples,
compared to 1,529 in SemEval-1 Task 4.
to the problem and to provide a standardtestbed for future research. This paper de-
fines the task, describes the training and
test data and the process of their creation,lists the participating systems (10 teams,
We first decided on an inventory of semantic rela-
28 runs), and discusses their results.
tions. Ideally, it should be exhaustive (enable thedescription of relations between any pair of nomi-
nals) and mutually exclusive (each pair of nominals
SemEval-2010 Task 8 focused on semantic rela-
in context should map onto only one relation). The
tions between pairs of nominals. For example, tea
literature, however, suggests that no relation inven-
and ginseng are in an ENTITY-ORIGIN relation in
tory satisfies both needs, and, in practice, some
“The cup contained tea from dried ginseng.”. The
trade-off between them must be accepted.
automatic recognition of semantic relations has
As a pragmatic compromise, we selected nine
many applications, such as information extraction,
relations with coverage sufficiently broad to be of
document summarization, machine translation, or
general and practical interest. We aimed at avoid-
construction of thesauri and semantic networks.
ing semantic overlap as much as possible. We
It can also facilitate auxiliary tasks such as word
included, however, two groups of strongly related
sense disambiguation, language modeling, para-
relations (ENTITY-ORIGIN / ENTITY-DESTINA-
phrasing, and recognizing textual entailment.
Our goal was to create a testbed for automatic
WHOLE / MEMBER-COLLECTION) to assess mod-
classification of semantic relations. In develop-
els’ ability to make such fine-grained distinctions.
ing the task we met several challenges: selecting
Our inventory is given below. The first four were
a suitable set of relations, specifying the annota-
also used in SemEval-1 Task 4, but the annotation
tion procedure, and deciding on the details of the
guidelines have been revised, and thus no complete
task itself. They are discussed briefly in Section 2;
see also Hendrickx et al. (2009), which includesa survey of related work. The direct predecessor
Cause-Effect (CE). An event or object leads to an
of Task 8 was Classification of semantic relations
effect. Example: those cancers were caused
between nominals, Task 4 at SemEval-1 (Girju et
al., 2009). That task had a separate binary-labeled
Instrument-Agency (IA). An agent uses an in-
∗ University of Lisbon, [email protected]
† University of Melbourne, [email protected]‡ University of Alicante, [email protected]
Product-Producer (PP). A producer causes a
§ National University of Singapore, [email protected]
product to exist. Example: a factory man-
¶University of Cambridge, [email protected]
University of Stuttgart, [email protected]
‡‡ University of Ottawa and Polish Academy of Sciences,
stored in a delineated area of space. Example:
Entity-Origin (EO). An entity is coming or is de-
nominals, which usually span a single word, except
rived from an origin (e.g., position or mate-
for lexicalized terms such as science fiction.
rial). Example: letters from foreign countries
We also impose a syntactic locality requirement
Entity-Destination (ED). An entity is moving to-
on example candidates, thus excluding instances
wards a destination. Eg. the boy went to bed
where the relation arguments occur in separate sen-
tential clauses. Permissible syntactic patterns in-
Component-Whole (CW). An object is a compo-
clude simple and relative clauses, compounds, and
nent of a larger whole. Example: my apart-
pre- and post-nominal modification. In addition,
we did not annotate examples whose interpretation
relied on discourse knowledge, which led to the
exclusion of pronouns as arguments. See the guide-
nonfunctional part of a collection. Example:
lines for details on other issues (noun compounds,
aspectual phenomena, temporal relations).
Communication-Topic (CT). An act of commu-
nication, written or spoken, is about a topic.
The annotation took place in three rounds. First,we manually collected around 1,200 sentences for
each relation through pattern-based Web search. In
We defined a set of general annotation guidelines
order to ensure a wide variety of example sentences,
as well as detailed guidelines for each semantic
we used a substantial number of patterns for each
relation. Here, we describe the general guidelines,
relation, typically between one hundred and several
which delineate the scope of the data to be col-
hundred. Importantly, in the first round, the relation
lected and state general principles relevant to the
itself was not annotated: the goal was merely to
collect positive and near-miss candidate instances.
Our objective is to annotate instances of seman-
A rough aim was to have 90% of candidates which
tic relations which are true in the sense of hold-
instantiate the target relation (“positive instances”).
ing in the most plausible truth-conditional inter-
In the second round, the collected candidates for
pretation of the sentence. This is in the tradition
each relation went to two independent annotators
of the Textual Entailment or Information Valida-
for labeling. Since we have a multi-way classifi-
tion paradigm (Dagan et al., 2006), and in con-
cation task, the annotators used the full inventory
trast to “aboutness” annotation such as semantic
of nine relations plus OTHER. The annotation was
roles (Carreras and M`arquez, 2004; Carreras and
made easier by the fact that the cases of overlap
M`arquez, 2005) or the BioNLP 2009 task (Kim
were largely systematic, arising from general phe-
et al., 2009) where negated relations are also la-
nomena like metaphorical use and situations where
belled as positive. Similarly, we exclude instances
more than one relation holds. For example, there is
of semantic relations which hold only in hypothet-
a systematic potential overlap between CONTENT-
ical or counterfactural scenarios. In practice, this
means disallowing annotations within the scope of
ing on whether the situation described in the sen-
modals or negations, e.g., “Smoking may/may not
tence is static or dynamic, e.g., “When I came,
the <e1>apples</e1> were already put in the
We accept as relation arguments only noun
<e2>basket</e2>.” is CC(e1, e2), while “Then,
phrases with common-noun heads. This distin-
the <e1>apples</e1> were quickly put in the
guishes our task from much work in Information
<e2>basket</e2>.” is ED(e1, e2).
Extraction, which tends to focus on specific classes
In the third round, the remaining disagreements
of named entities and on considerably more fine-
were resolved, and, if no consensus could be
grained relations than we do. Named entities are a
achieved, the examples were removed. Finally,
specific category of nominal expressions best dealt
we collected all positive examples for each relation
with using techniques which do not apply to com-
from the respective dataset, and sampled OTHER
mon nouns. We only mark up the semantic heads of
examples from the examples labelled with this
pseudo-relation from all datasets. The final merged
The full task guidelines are available at docs.google.
dataset has a size of 10,717 instances. We released
8,000 for training and kept the rest for testing.2
Table 1 shows some statistics about the dataset.
The first column (Freq) shows the absolute and
relative frequecies of each relation. The second col-
umn (Pos) shows that the average share of positive
instances was closer to 75% than to 90%, indicat-
ing that the patterns catch a substantial amount of
“near-miss” cases. However, this effect varies a lot
across relations, causing the non-uniform relation
distribution in the test set (first row).3 After the sec-
Table 1: Annotation Statistics. Freq: Absolute and
ond round, we also computed inter-annotator agree-
relative frequency in the dataset; Pos: percentage
ment (third column, IAA). Inter-annotator agree-
of “positive” relation instances in the candidate set;
ment was computed on the sentence level, as the
percentage of sentences for which the two annota-tions were identical. That is, these figures can be in-terpreted as exact-match accuracies. We do not re-
precision (P), recall (R), and F1-Score for each
port Kappa, since chance agreement on preselected
relation, (4) micro-averaged P, R, F1, (5) macro-
candidates is difficult to estimate.4 IAA is between
averaged P, R, F1. For (4) and (5), the calculations
60% and 95%, again with large relation-dependent
ignored the OTHER relation. Our official scoring
variation. Some of the relations were particularly
metric is macro-averaged F1-Score for (9+1)-way
easy to annotate, notably CONTENT-CONTAINER,
classification, taking directionality into account.
despite the systematic ambiguity mentioned above,
The teams were asked to submit test data pre-
which can be resolved through relatively clear cri-
dictions for varying fractions of the training data.
teria. ENTITY-ORIGIN was the hardest relation to
Specifically, we requested results for the first 1000,
annotate. We encountered ontological difficulties
2000, 4000, and 8000 training instances, called
in defining both Entity (e.g., in contrast to Effect)
TD1 through TD4. TD4 was the full training set.
and Origin (as opposed to Cause). Our numbersare on average around 10% higher than those re-
ported by Girju et al. (2009). This may be a side
Table 2 lists the participants and provides a rough
effect of our data collection method. To gather
overview of the system features. Table 3 shows the
1,200 examples in realistic time, we had to seek
results. Unless noted otherwise, all quoted numbers
productive search query patterns, which invited
certain homogeneity. For example, many queriesfor C
ONTENT-CONTAINER centered on “usual sus-
pect” such as box or suitcase. Many instances of
the teams by the performance of their best system
MEMBER-COLLECTION were collected on the ba-
on TD4, since a per-system ranking would favor
sis of from available lists of collective names.
teams with many submitted runs. UTD submit-ted the best system, with a performance of over
82%, more than 4% better than the second-best
The participating systems had to solve the follow-
system. FBK IRST places second, with 77.62%,
ing task: given a sentence and two tagged nominals,
a tiny margin ahead of ISI (77.57%). Notably, the
predict the relation between those nominals and the
ISI system outperforms the FBK IRST system for
TD1 to TD3, where it was second-best. The accu-
We released a detailed scorer which outputs (1) a
racy numbers for TD4 (Acc TD4) lead to the same
confusion matrix, (2) accuracy and coverage, (3)
overall ranking: micro- versus macro-averagingdoes not appear to make much difference either. A
2This set includes 891 examples from SemEval-1 Task 4.
random baseline gives an uninteresting score of 6%.
We re-annotated them and assigned them as the last examples
Our baseline system is a simple Naive Bayes classi-
training dataset to ensure that the test set was unseen.
3To what extent our candidate selection produces a biased
fier which relies on words in the sentential context
sample is a question that we cannot address within this paper.
only; two systems scored below this baseline.
4We do not report Pos or IAA for OTHER, since OTHER is
As for the amount of training data, we see a sub-
a pseudo-relation that was not annotated in its own right. Thenumbers would therefore not be comparable to other relations.
stantial improvement for all systems between TD1
thography) + Cyc; parameterestimation by optimization ontraining set
like FBK NK-RES1 with differ-ent context windows and collo-cation cutoffs
prepositional patterns, estima-tion of semantic relation
Table 2: Participants of SemEval-2010 Task 8. Res: Resources used (WN: WordNet data; WP:Wikipedia data; S: syntax; LC: Levin classes; G: Google n-grams, RT: Roget’s Thesaurus, PB/NB:
PropBank/NomBank). Class: Classification style (ME: Maximum Entropy; BN: Bayes Net; DR: DecisionRules/Trees; CRF: Conditional Random Fields; 2S: two-step classification)
Table 3: F1-Score of all submitted systems on the test dataset as a function of training data: TD1=1000,TD2=2000, TD3=4000, TD4=8000 training examples. Official results are calculated on TD4. The results
marked with ∗ were submitted after the deadline. The best-performing run for each participant is italicized.
and TD4, with diminishing returns for the transi-
(rank 1 and 6) split the task into two classifica-
tion between TD3 and TD4 for many, but not all,
tion steps (relation and direction), but the 2nd- and
systems. Overall, the differences between systems
3rd-ranked systems do not. The use of a sequence
are smaller for TD4 than they are for TD1. The
model did not show a benefit either.
spread between the top three systems is around 10%
The systems use a variety of resources. Gener-
at TD1, but below 5% at TD4. Still, there are clear
ally, richer feature sets lead to better performance
differences in the influence of training data size
(although the differences are often small – com-
even among systems with the same overall archi-
pare the different FBK IRST systems). This can
tecture. Notably, ECNU-SR-4 is the second-best
be explained by the need for semantic generaliza-
system at TD1 (67.95%), but gains only 7% from
tion from training to test data. This need can be
the eightfold increase of the size of the training data.
addressed using WordNet (contrast ECNU-1 to -3
At the same time, ECNU-SR-3 improves from less
with ECNU-4 to -6), the Google n-gram collection
than 40% to almost 69%. The difference between
(see ISI and UTD), or a “deep” semantic resource
the systems is that ECNU-SR-4 uses a multi-way
(FBK IRST uses Cyc). Yet, most of these resources
classifier including the class OTHER, while ECNU-
are also included in the less successful systems, so
SR-3 uses binary classifiers and assigns OTHER
beneficial integration of knowledge sources into se-
if no other relation was assigned with p>0.5. It
mantic relation classification seems to be difficult.
appears that these probability estimates for classes
are only reliable enough for TD3 and TD4.
the systems suggest that it might be possible to
The Influence of System Architecture.
achieve improvements by building an ensemble
vestigate the classification scheme and the re-
system. When we combine the top three systems
sources used. Almost all systems used either Max-
(UTD, FBK IRST-12VBCA, and ISI) by predict-
Ent or SVM classifiers, with no clear advantage
ing their majority vote, or OTHER if there was none,
for either. Similarly, two systems, UTD and ISTI
we obtain a small improvement over the UTD sys-
tem with an F1-Score of 82.79%. A combination of
This is an instance of CW misclassified as IA, prob-
the top five systems using the same method shows
ably on account of the verb use which is a frequent
a worse performance, however (80.42%). This sug-
gests that the best system outperforms the rest by
a margin that cannot be compensated with systemcombination, at least not with a crude majority vote.
There is little doubt that 19-way classification is a
We see a similar pattern among the ECNU systems,
non-trivial challenge. It is even harder when the
where the ECNU-SR-7 combination system is out-
domain is lexical semantics, with its idiosyncrasies,
performed by ECNU-SR-5, presumably since it
and when the classes are not necessarily disjoint,
incorporates the inferior ECNU-SR-1 system.
despite our best intentions. It speaks to the successof the exercise that the participating systems’ per-
formance was generally high, well over an order
performance on individual relations, especially the
of magnitude above random guessing. This may
extremes. There are very stable patterns across all
be due to the impressive array of tools and lexical-
systems. The best relation (presumably the eas-
semantic resources deployed by the participants.
iest to classify) is CE, far ahead of ED and MC.
Section 4 suggests a few ways of interpreting
Notably, the performance for the best relation is
and analyzing the results. Long-term lessons will
75% or above for almost all systems, with compar-
undoubtedly emerge from the workshop discussion.
atively small differences between the systems. The
One optimistic-pessimistic conclusion concerns the
hardest relation is generally IA, followed by PP.5
size of the training data. The notable gain TD3 →
Here, the spread among the systems is much larger:
TD4 suggests that even more data would be even
the highest-ranking systems outperform others on
better, but that is so much easier said than done: it
the difficult relations. Recall was the main prob-
took the organizers in excess of 1000 person-hours
lem for both IA and PP: many examples of these
to pin down the problem, hone the guidelines and
two relations are misclassified, most frequently as
relation definitions, construct sufficient amounts of
OTHER. Even at TD4, these datasets seem to be
trustworthy training data, and run the task.
less homogeneous than the others. Intriguingly, PPshows a very high inter-annotator agreement (Ta-
ble 1). Its difficulty may therefore be due not toquestionable annotation, but to genuine variability,
X. Carreras and L. M`arquez. 2004. Introduction to the
or at least the selection of difficult patterns by the
CoNLL-2004 Shared Task: Semantic Role Labeling. In Proc. CoNLL-04, Boston, MA.
dataset creator. Conversely, MC, among the easiestrelations to model, shows only a modest IAA.
X. Carreras and L. M`arquez. 2005. Introduction to the
CoNLL-2005 Shared Task: Semantic Role Labeling.
that are classified incorrectly by all systems. We
I. Dagan, O.Glickman, and B. Magnini, 2006.
analyze them, looking for sources of errors. In ad-
chine Learning Challenges, volume 3944 of Lecture
dition to a handful of annotation errors and some
Notes in Computer Science, chapter The PASCAL
borderline cases, they are made up of instances
Recognising Textual Entailment Challenge, pages
which illustrate the limits of current shallow mod-
eling approaches in that they require more lexical
R. Girju, P. Nakov, V. Nastase, S. Szpakowicz, P. Tur-
knowledge and complex reasoning. A case in point:
ney, and D. Yuret. 2009. Classification of Semantic
The bottle carrier converts your <e1>bottle</e1>
Relations between Nominals. Language Resourcesand Evaluation, 43(2):105–121.
into a <e2>canteen</e2>.
OTHER is misclassified either as CC (due to the
I. Hendrickx, S. Kim, Z. Kozareva, P. Nakov, D. ´
nominals) or as ED (because of the preposition
S´eaghdha, S. Pad´o, M. Pennacchiotti, L. Romano,and S. Szpakowicz. 2009. SemEval-2010 Task 8:
into). Another example: [.] <e1>Rudders</e1>
Multi-Way Classification of Semantic Relations Be-
are used by <e2>towboats</e2> and other ves-
tween Pairs of Nominals. In Proc. NAACL Work-
sels that require a high degree of manoeuvrability.
shop on Semantic Evaluations, Boulder, CO.
5The relation OTHER, which we ignore in the overall F
J. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsu-
score, does even worse, often below 40%. This is to be ex-
jii. 2009. Overview of BioNLP’09 Shared Task on
pected, since the OTHER examples in our datasets are near
Event Extraction. In Proc. BioNLP09, Boulder, CO.
misses for other relations, thus making a very incoherent class.
Conditions in the Baltic Area of the USSR In Bauska and Riga it was possible to stay at the iebraucama vieta for five rubles per person; a room in a hotel cost ten to twelve rubles a night. There was a 100-ruble fine for giving lodgings to German refugees. Restaurants in Riga were State-owned and had numbers. Usually there are uniformed waitresses and meals are paid for when served, generally wi
Monday, 4 February 2013 • 3 taking place in our respective borders,” DEP Pennsylvania regulator to study Secretary Mike Krancer said. radioactivity in drilling wastes that requires landfills to monitor for radia- Pennsylvania’s environmental regulator sources in the shale gas production process, tion levels in incoming wastes, and that only Thursday announced plans to conduct a