The notion of Base Concepts was introduced in the EuroWordNet project to reach maximum overlap and compatibility across wordnets in different languages, while at the same time, allow for the distributive development of wordnets in the world, each wordnet being a language specific structure and lexicalization pattern. The Base Concepts are supposed to be the concepts that play the most important role in the various wordnets of different languages. This role was measured in terms of two main criteria:
The Base Concepts are thus the fundamental building blocks for establishing the relations in a wordnet and give information about the dominant lexicalization patterns in languages. Base Concepts should not be confused with Basic Level Concepts as defined by Rosch (1977). Basic Level Concepts are the result of a compromise between two conflicting principles of categorization:
As a result of this, Basic Level Concepts typically occur in the middle of hierarchies and less than the maximum number of relations. Base Concepts mostly involve the first principle only. They are generalizations of features or semantic components and thus apply to a maximum number of concepts.
The following types of Base Concepts have been distinguished:
The selection of the Base Concepts is an approximation based on:
The structural properties of wordnets are partially arbitrary and thus only weakly indicative. The idea has been so far that independent selections from a large number of languages will still give a good approximation. As properties have been used:
Sense frequencies are not available and word frequencies were shown to be unreliable. Furthermore, it should be noted that many wordnets are developed by expanding from Princeton WordNet and therefore do not contribute to the definition of the Global Base Concepts. Roughly two approaches have been followed for building wordnets:
Obviously, the merge aproach would give more independent suggestions for BCs. However, the expand approach can still contribute if the resulting local wordnet structure is revised and validated in a later phase and afterwards makes a selection according to the same criteria.
In EuroWordNet, an initial set of 1024 Common Base Concepts (CBCs) were selected and defined as Princeton WordNet1.5 synsets. These CBCs play a BC role in at least two independent wordnets. The languages in EuroWordNet are: English, Dutch, German, French, Spanish, Italian, Czech and Estonian, but for the initial selection only English, Dutch, Spanish and Italian were used. For the 1012 CBCs, EuroWordNet defined a top-ontology that has been the common semantic framework for defining the relations in each individual wordnet separately. On the next page you can find a definition of the EuroWordNet CBCs and the top-ontology classification: EuroWordNet Base Concepts and Top-Ontology
In the BalkaNet project, a similar approach was applied to another set of languages: Greek, Romanian, Serbian, Turkish, Bulgarian. BalkaNet extended the set to 4689 synsets and upgraded the mapping of the CBCs to Princeton WordNet 2.0. The 5000 CBCs as WordNet2.0 synsets can be downloaded here:
4689 Common Base Concepts from EuroWordNet and BalkaNet as Wordnet2.0 synsets
The Base Concepts have been defined so far in two European projects, EuroWordNet and BalkaNet. They played a crucial role in building the wordnets. More information can be found in this powerpoint presentation: Building Wordnets, by Piek Vossen. Below is a short description of the approach.
Each wordnet was developed in two phases according to a top-down approach:
Using this approach we guarantee that the cores of the wordnets are highly compatible and comparable, but at the same time language-specific structures and lexicalizations can be expressed. For developing the core wordnets, we followed the next approach:
For developing the extended wordnets, various criteria were used, among which:
Core wordnets are typicaly between 5,000 and 10,000 synsets. Extended wordnets go beyond 20,000 synsets.