nlpado.de

Nlpado.de

SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals Iris Hendrickx∗ , Su Nam Kim† , Zornitsa Kozareva‡ , Preslav Nakov§ , O S´eaghdha¶, Sebastian Pad´o , Marco Pennacchiotti∗∗, Lorenza Romano††, Stan Szpakowicz‡‡ dataset for each of seven relations. We have definedSemEval-2010 Task 8 as multi-way classification, where the label for each example must be chosen way classification of semantic relations be- from the complete set of ten relations. We have also produced more data: 10,717 annotated examples, compared to 1,529 in SemEval-1 Task 4.
to the problem and to provide a standardtestbed for future research. This paper de- fines the task, describes the training and test data and the process of their creation,lists the participating systems (10 teams, We first decided on an inventory of semantic rela- 28 runs), and discusses their results.
tions. Ideally, it should be exhaustive (enable thedescription of relations between any pair of nomi- nals) and mutually exclusive (each pair of nominals SemEval-2010 Task 8 focused on semantic rela- in context should map onto only one relation). The tions between pairs of nominals. For example, tea literature, however, suggests that no relation inven- and ginseng are in an ENTITY-ORIGIN relation in tory satisfies both needs, and, in practice, some “The cup contained tea from dried ginseng.”. The trade-off between them must be accepted.
automatic recognition of semantic relations has As a pragmatic compromise, we selected nine many applications, such as information extraction, relations with coverage sufficiently broad to be of document summarization, machine translation, or general and practical interest. We aimed at avoid- construction of thesauri and semantic networks.
ing semantic overlap as much as possible. We It can also facilitate auxiliary tasks such as word included, however, two groups of strongly related sense disambiguation, language modeling, para- relations (ENTITY-ORIGIN / ENTITY-DESTINA- phrasing, and recognizing textual entailment.
Our goal was to create a testbed for automatic WHOLE / MEMBER-COLLECTION) to assess mod- classification of semantic relations. In develop- els’ ability to make such fine-grained distinctions.
ing the task we met several challenges: selecting Our inventory is given below. The first four were a suitable set of relations, specifying the annota- also used in SemEval-1 Task 4, but the annotation tion procedure, and deciding on the details of the guidelines have been revised, and thus no complete task itself. They are discussed briefly in Section 2; see also Hendrickx et al. (2009), which includesa survey of related work. The direct predecessor Cause-Effect (CE). An event or object leads to an of Task 8 was Classification of semantic relations effect. Example: those cancers were caused between nominals, Task 4 at SemEval-1 (Girju et al., 2009). That task had a separate binary-labeled Instrument-Agency (IA). An agent uses an in- ∗ University of Lisbon, [email protected] † University of Melbourne, [email protected]‡ University of Alicante, [email protected] Product-Producer (PP). A producer causes a § National University of Singapore, [email protected] product to exist. Example: a factory man- ¶University of Cambridge, [email protected] University of Stuttgart, [email protected] ‡‡ University of Ottawa and Polish Academy of Sciences, stored in a delineated area of space. Example: Entity-Origin (EO). An entity is coming or is de- nominals, which usually span a single word, except rived from an origin (e.g., position or mate- for lexicalized terms such as science fiction.
rial). Example: letters from foreign countries We also impose a syntactic locality requirement Entity-Destination (ED). An entity is moving to- on example candidates, thus excluding instances wards a destination. Eg. the boy went to bed where the relation arguments occur in separate sen- tential clauses. Permissible syntactic patterns in- Component-Whole (CW). An object is a compo- clude simple and relative clauses, compounds, and nent of a larger whole. Example: my apart- pre- and post-nominal modification. In addition, we did not annotate examples whose interpretation relied on discourse knowledge, which led to the exclusion of pronouns as arguments. See the guide- nonfunctional part of a collection. Example: lines for details on other issues (noun compounds, aspectual phenomena, temporal relations).
Communication-Topic (CT). An act of commu- nication, written or spoken, is about a topic.
The annotation took place in three rounds. First,we manually collected around 1,200 sentences for each relation through pattern-based Web search. In We defined a set of general annotation guidelines order to ensure a wide variety of example sentences, as well as detailed guidelines for each semantic we used a substantial number of patterns for each relation. Here, we describe the general guidelines, relation, typically between one hundred and several which delineate the scope of the data to be col- hundred. Importantly, in the first round, the relation lected and state general principles relevant to the itself was not annotated: the goal was merely to collect positive and near-miss candidate instances.
Our objective is to annotate instances of seman- A rough aim was to have 90% of candidates which tic relations which are true in the sense of hold- instantiate the target relation (“positive instances”).
ing in the most plausible truth-conditional inter- In the second round, the collected candidates for pretation of the sentence. This is in the tradition each relation went to two independent annotators of the Textual Entailment or Information Valida- for labeling. Since we have a multi-way classifi- tion paradigm (Dagan et al., 2006), and in con- cation task, the annotators used the full inventory trast to “aboutness” annotation such as semantic of nine relations plus OTHER. The annotation was roles (Carreras and M`arquez, 2004; Carreras and made easier by the fact that the cases of overlap M`arquez, 2005) or the BioNLP 2009 task (Kim were largely systematic, arising from general phe- et al., 2009) where negated relations are also la- nomena like metaphorical use and situations where belled as positive. Similarly, we exclude instances more than one relation holds. For example, there is of semantic relations which hold only in hypothet- a systematic potential overlap between CONTENT- ical or counterfactural scenarios. In practice, this means disallowing annotations within the scope of ing on whether the situation described in the sen- modals or negations, e.g., “Smoking may/may not tence is static or dynamic, e.g., “When I came, the <e1>apples</e1> were already put in the We accept as relation arguments only noun <e2>basket</e2>.” is CC(e1, e2), while “Then, phrases with common-noun heads. This distin- the <e1>apples</e1> were quickly put in the guishes our task from much work in Information <e2>basket</e2>.” is ED(e1, e2).
Extraction, which tends to focus on specific classes In the third round, the remaining disagreements of named entities and on considerably more fine- were resolved, and, if no consensus could be grained relations than we do. Named entities are a achieved, the examples were removed. Finally, specific category of nominal expressions best dealt we collected all positive examples for each relation with using techniques which do not apply to com- from the respective dataset, and sampled OTHER mon nouns. We only mark up the semantic heads of examples from the examples labelled with this pseudo-relation from all datasets. The final merged The full task guidelines are available at docs.google.
dataset has a size of 10,717 instances. We released 8,000 for training and kept the rest for testing.2 Table 1 shows some statistics about the dataset.
The first column (Freq) shows the absolute and relative frequecies of each relation. The second col- umn (Pos) shows that the average share of positive instances was closer to 75% than to 90%, indicat- ing that the patterns catch a substantial amount of “near-miss” cases. However, this effect varies a lot across relations, causing the non-uniform relation distribution in the test set (first row).3 After the sec- Table 1: Annotation Statistics. Freq: Absolute and ond round, we also computed inter-annotator agree- relative frequency in the dataset; Pos: percentage ment (third column, IAA). Inter-annotator agree- of “positive” relation instances in the candidate set; ment was computed on the sentence level, as the percentage of sentences for which the two annota-tions were identical. That is, these figures can be in-terpreted as exact-match accuracies. We do not re- precision (P), recall (R), and F1-Score for each port Kappa, since chance agreement on preselected relation, (4) micro-averaged P, R, F1, (5) macro- candidates is difficult to estimate.4 IAA is between averaged P, R, F1. For (4) and (5), the calculations 60% and 95%, again with large relation-dependent ignored the OTHER relation. Our official scoring variation. Some of the relations were particularly metric is macro-averaged F1-Score for (9+1)-way easy to annotate, notably CONTENT-CONTAINER, classification, taking directionality into account.
despite the systematic ambiguity mentioned above, The teams were asked to submit test data pre- which can be resolved through relatively clear cri- dictions for varying fractions of the training data.
teria. ENTITY-ORIGIN was the hardest relation to Specifically, we requested results for the first 1000, annotate. We encountered ontological difficulties 2000, 4000, and 8000 training instances, called in defining both Entity (e.g., in contrast to Effect) TD1 through TD4. TD4 was the full training set.
and Origin (as opposed to Cause). Our numbersare on average around 10% higher than those re- ported by Girju et al. (2009). This may be a side Table 2 lists the participants and provides a rough effect of our data collection method. To gather overview of the system features. Table 3 shows the 1,200 examples in realistic time, we had to seek results. Unless noted otherwise, all quoted numbers productive search query patterns, which invited certain homogeneity. For example, many queriesfor C ONTENT-CONTAINER centered on “usual sus- pect” such as box or suitcase. Many instances of the teams by the performance of their best system MEMBER-COLLECTION were collected on the ba- on TD4, since a per-system ranking would favor sis of from available lists of collective names.
teams with many submitted runs. UTD submit-ted the best system, with a performance of over 82%, more than 4% better than the second-best The participating systems had to solve the follow- system. FBK IRST places second, with 77.62%, ing task: given a sentence and two tagged nominals, a tiny margin ahead of ISI (77.57%). Notably, the predict the relation between those nominals and the ISI system outperforms the FBK IRST system for TD1 to TD3, where it was second-best. The accu- We released a detailed scorer which outputs (1) a racy numbers for TD4 (Acc TD4) lead to the same confusion matrix, (2) accuracy and coverage, (3) overall ranking: micro- versus macro-averagingdoes not appear to make much difference either. A 2This set includes 891 examples from SemEval-1 Task 4.
random baseline gives an uninteresting score of 6%.
We re-annotated them and assigned them as the last examples Our baseline system is a simple Naive Bayes classi- training dataset to ensure that the test set was unseen.
3To what extent our candidate selection produces a biased fier which relies on words in the sentential context sample is a question that we cannot address within this paper.
only; two systems scored below this baseline.
4We do not report Pos or IAA for OTHER, since OTHER is As for the amount of training data, we see a sub- a pseudo-relation that was not annotated in its own right. Thenumbers would therefore not be comparable to other relations.
stantial improvement for all systems between TD1 thography) + Cyc; parameterestimation by optimization ontraining set like FBK NK-RES1 with differ-ent context windows and collo-cation cutoffs prepositional patterns, estima-tion of semantic relation Table 2: Participants of SemEval-2010 Task 8. Res: Resources used (WN: WordNet data; WP:Wikipedia data; S: syntax; LC: Levin classes; G: Google n-grams, RT: Roget’s Thesaurus, PB/NB: PropBank/NomBank). Class: Classification style (ME: Maximum Entropy; BN: Bayes Net; DR: DecisionRules/Trees; CRF: Conditional Random Fields; 2S: two-step classification) Table 3: F1-Score of all submitted systems on the test dataset as a function of training data: TD1=1000,TD2=2000, TD3=4000, TD4=8000 training examples. Official results are calculated on TD4. The results marked with ∗ were submitted after the deadline. The best-performing run for each participant is italicized.
and TD4, with diminishing returns for the transi- (rank 1 and 6) split the task into two classifica- tion between TD3 and TD4 for many, but not all, tion steps (relation and direction), but the 2nd- and systems. Overall, the differences between systems 3rd-ranked systems do not. The use of a sequence are smaller for TD4 than they are for TD1. The model did not show a benefit either.
spread between the top three systems is around 10% The systems use a variety of resources. Gener- at TD1, but below 5% at TD4. Still, there are clear ally, richer feature sets lead to better performance differences in the influence of training data size (although the differences are often small – com- even among systems with the same overall archi- pare the different FBK IRST systems). This can tecture. Notably, ECNU-SR-4 is the second-best be explained by the need for semantic generaliza- system at TD1 (67.95%), but gains only 7% from tion from training to test data. This need can be the eightfold increase of the size of the training data.
addressed using WordNet (contrast ECNU-1 to -3 At the same time, ECNU-SR-3 improves from less with ECNU-4 to -6), the Google n-gram collection than 40% to almost 69%. The difference between (see ISI and UTD), or a “deep” semantic resource the systems is that ECNU-SR-4 uses a multi-way (FBK IRST uses Cyc). Yet, most of these resources classifier including the class OTHER, while ECNU- are also included in the less successful systems, so SR-3 uses binary classifiers and assigns OTHER beneficial integration of knowledge sources into se- if no other relation was assigned with p>0.5. It mantic relation classification seems to be difficult.
appears that these probability estimates for classes are only reliable enough for TD3 and TD4.
the systems suggest that it might be possible to The Influence of System Architecture.
achieve improvements by building an ensemble vestigate the classification scheme and the re- system. When we combine the top three systems sources used. Almost all systems used either Max- (UTD, FBK IRST-12VBCA, and ISI) by predict- Ent or SVM classifiers, with no clear advantage ing their majority vote, or OTHER if there was none, for either. Similarly, two systems, UTD and ISTI we obtain a small improvement over the UTD sys- tem with an F1-Score of 82.79%. A combination of This is an instance of CW misclassified as IA, prob- the top five systems using the same method shows ably on account of the verb use which is a frequent a worse performance, however (80.42%). This sug- gests that the best system outperforms the rest by a margin that cannot be compensated with systemcombination, at least not with a crude majority vote.
There is little doubt that 19-way classification is a We see a similar pattern among the ECNU systems, non-trivial challenge. It is even harder when the where the ECNU-SR-7 combination system is out- domain is lexical semantics, with its idiosyncrasies, performed by ECNU-SR-5, presumably since it and when the classes are not necessarily disjoint, incorporates the inferior ECNU-SR-1 system.
despite our best intentions. It speaks to the successof the exercise that the participating systems’ per- formance was generally high, well over an order performance on individual relations, especially the of magnitude above random guessing. This may extremes. There are very stable patterns across all be due to the impressive array of tools and lexical- systems. The best relation (presumably the eas- semantic resources deployed by the participants.
iest to classify) is CE, far ahead of ED and MC.
Section 4 suggests a few ways of interpreting Notably, the performance for the best relation is and analyzing the results. Long-term lessons will 75% or above for almost all systems, with compar- undoubtedly emerge from the workshop discussion.
atively small differences between the systems. The One optimistic-pessimistic conclusion concerns the hardest relation is generally IA, followed by PP.5 size of the training data. The notable gain TD3 → Here, the spread among the systems is much larger: TD4 suggests that even more data would be even the highest-ranking systems outperform others on better, but that is so much easier said than done: it the difficult relations. Recall was the main prob- took the organizers in excess of 1000 person-hours lem for both IA and PP: many examples of these to pin down the problem, hone the guidelines and two relations are misclassified, most frequently as relation definitions, construct sufficient amounts of OTHER. Even at TD4, these datasets seem to be trustworthy training data, and run the task.
less homogeneous than the others. Intriguingly, PPshows a very high inter-annotator agreement (Ta- ble 1). Its difficulty may therefore be due not toquestionable annotation, but to genuine variability, X. Carreras and L. M`arquez. 2004. Introduction to the or at least the selection of difficult patterns by the CoNLL-2004 Shared Task: Semantic Role Labeling.
In Proc. CoNLL-04, Boston, MA.
dataset creator. Conversely, MC, among the easiestrelations to model, shows only a modest IAA.
X. Carreras and L. M`arquez. 2005. Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling.
that are classified incorrectly by all systems. We I. Dagan, O.Glickman, and B. Magnini, 2006.
analyze them, looking for sources of errors. In ad- chine Learning Challenges, volume 3944 of Lecture dition to a handful of annotation errors and some Notes in Computer Science, chapter The PASCAL borderline cases, they are made up of instances Recognising Textual Entailment Challenge, pages which illustrate the limits of current shallow mod- eling approaches in that they require more lexical R. Girju, P. Nakov, V. Nastase, S. Szpakowicz, P. Tur- knowledge and complex reasoning. A case in point: ney, and D. Yuret. 2009. Classification of Semantic The bottle carrier converts your <e1>bottle</e1> Relations between Nominals. Language Resourcesand Evaluation, 43(2):105–121.
into a <e2>canteen</e2>.
OTHER is misclassified either as CC (due to the I. Hendrickx, S. Kim, Z. Kozareva, P. Nakov, D. ´ nominals) or as ED (because of the preposition S´eaghdha, S. Pad´o, M. Pennacchiotti, L. Romano,and S. Szpakowicz. 2009. SemEval-2010 Task 8: into). Another example: [.] <e1>Rudders</e1> Multi-Way Classification of Semantic Relations Be- are used by <e2>towboats</e2> and other ves- tween Pairs of Nominals. In Proc. NAACL Work- sels that require a high degree of manoeuvrability.
shop on Semantic Evaluations, Boulder, CO.
5The relation OTHER, which we ignore in the overall F J. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsu- score, does even worse, often below 40%. This is to be ex- jii. 2009. Overview of BioNLP’09 Shared Task on pected, since the OTHER examples in our datasets are near Event Extraction. In Proc. BioNLP09, Boulder, CO.
misses for other relations, thus making a very incoherent class.

Source: http://www.nlpado.de/~sebastian/pub/papers/semeval10_hendrickx.pdf

redyear.lt

Conditions in the Baltic Area of the USSR In Bauska and Riga it was possible to stay at the iebraucama vieta for five rubles per person; a room in a hotel cost ten to twelve rubles a night. There was a 100-ruble fine for giving lodgings to German refugees. Restaurants in Riga were State-owned and had numbers. Usually there are uniformed waitresses and meals are paid for when served, generally wi

lamnipipe.com

Monday, 4 February 2013 • 3 taking place in our respective borders,” DEP Pennsylvania regulator to study Secretary Mike Krancer said. radioactivity in drilling wastes that requires landfills to monitor for radia- Pennsylvania’s environmental regulator sources in the shale gas production process, tion levels in incoming wastes, and that only Thursday announced plans to conduct a