Toolbox
  • Printable version
 
TOOLBOX
LANGUAGES
Language
Wikipedia Affiliate Button
 
In other languages

WikiWord/Excerpt

From BrightByte

Jump to: navigation, search

Outline of a method for building a multilingual thesaurus from Wikipedia

English translation of selected chapters of the WikiWord thesis "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia" by Daniel Kinzler. Translation by the author. There is also an english language paper written for Wikimania 2009: WikiWord - multilingual image search and more

Contents

Synopsis

The present thesis [1] describes and analyses methods to generate a multilingual thesaurus from the data pf Wikipedia projects in different languages. Of special interest shall be the extraction of relations between terms (labels, words) to language-independent concepts as well as relations between such concepts, such as subsumption, similarity and relatedness. To this end, the requirements and the available data will be analyzed, a prototype for extracting the desired data will be developed, and the data created using that prototype will be evaluated against the requirements.

Translator's note: some parts of the following text may appear redundant. This is because key ideas are reiterated throughout the thesis, and this excerpt especially reproduces parts of the text that describe the key ideas.

Fundamentals (Chapter 1)

Automatic processing gained importance over the last years: from the growing flood of textual information arises the need to develop better methods to automatically index, search and filter text. It turned out that to this end, not only appropriate methods and sufficient computing power and memory spaces are required, but that machine-usable knowledge is also of crucial importance to processing of natural language [1][2]. Such knowledge ranges from formal ontologies, as used by expert systems and for knowledge representation for artificial intelligence, to "soft", "flat" knowledge about language, for example identifying names, finding the end of a sentence or knowing which terms refer to which concept in which language.

The use of so called "work knowledge" (common sense) for software systems is captured in the Knowledge as Power Hypothesis of B.G. Buchanan [3]:

The power of an intelligent program to perform its task well depends primarily on the quantity and quality of knowledge it has about that task.

And D.B. Lenat says about the need to generalize, i.e. the use of concept hierarchies[4]:

To behave intelligently in unexpected situations, an agent must be capable of falling back on increasingly general knowledge.

Lenat further states that programs without knowledge are brittle, because they can only perform tasks that have been fully foreseen by the programmer [5].

The goal of the present thesis is to create a collection of such knowledge, a thesaurus, automatically from the data provided by Wikipedia. Wikipedia appears well suited as a source for such data, since it is a large, well-structured and up-to-date corpus, which is also available under a free license and in a great number of languages.

The idea is to design a software system (henceforth WikiWord) which is able to extract knowledge about relations between, and terms for, concepts from Wikipedia. This knowledge is contained in the structure of Wikipedia more or less explicitly: Relations are expressed through special syntactical structures (markup) as well as through conventions about the structure and format of content. This thesis aims to develop and test methods for evaluating these structures, especially methods that work without considering natural language processing, that is, the actual textual content.

[...]

Idea (Chapter 3)

Translator's note: in the following text, the following terminology is used:
  • "articles" are Wikipedia pages that describe and define a concept. Other types of pages include redirects, disambiguations, lists and categories. More information can be found in the category Wikipedia policies and guidelines.
  • "Link text" refers to the visible surface text of wiki links (hyper links), also known as the link's label. For more information about the markup and other features used on Wikipedia pages, such as language links, categorization links, templates, etc, please refer to the category Wikipedia help.

WikiWord, the software system to be created, should extract the relations needed for a multilingual concept-based thesaurus in an efficient manner from Wikipedia (for a detailed list of requirements, see chapter 5, not translated). The basic idea is to limit the extraction to mining the different types of hyper links that may be used in Wikipedia articles (see appendix A.3, not translated):

  • Simple wiki links provide information about the semantic context of concepts — they can be used to derive semantic relatedness, among other things (compare sections 2.3 and 2.4, not translated, and especially [6]).
  • The link text provides information about which terms or labels (the link text) refer to which concept (the link target) (compare section 2.3, not translated, and especially [7][8][9]).
  • Categorization links provide a relation of subsumption (abstraction) (compare section 2.3 and 2.4, not translated, and especially [10].
  • Language links connects descriptions of the same concept (or similar concepts) in different languages — this allows semantic similarity of concepts to be derived, between languages and also within a single language (i.e. within a wiki). They can thus be used to build a multilingual thesaurus by combining concepts from different languages into language independent (or pan-lingual) concepts (compare section 4.3). This appears to be a novel approach which has so far not been researched.

Besides wiki links additional properties of wiki pages may be used, for example, the templates used on on the page. These can be used especially for classifying pages and concepts (see 8.5, 8.6 and 2.6, not translated). Also, redirects, disambiguation pages as well as some special "magic words" may be used to assign additional terms to a concept (see 8.9, 8.10 and 9.1, not translated).

This method of analysis which focuses on wiki links and other markup elements bypasses a number of difficulties encountered by classical methods of automatic thesaurus generation[11]: especially, no natural language processing is performed, and even problematic tasks on the lexical level, like stemming, are avoided. Also, human knowledge encoded in the application of markup, rules and conventions on Wikipedia is used as directly as possible, instead of relying on statistics-based heuristics (compare 1.2, 8, 9.1, not translated).

Process (Chapter 4)

Transformations (fig. 4.1)
Transformations (fig. 4.1)

The approach WikiWord takes to generating the thesaurus can be seen as a sequence of transformations: fig. 4.1 shows data models and processing steps, in which each step converts the data from one model to the next. The implementation of WikiWord roughly follows this pattern (see section 7.1, not translated), although in the concrete realization some of the steps have been parallelized and nested (see section 7.2, not translated).

The individual transformations as well as the resulting data models will be described in the following sections.

Analysis of Wiki Text (Section 4.1)

The first transformation is the parsing and analysis of wiki text (markup) from the XML-dump[12][13]. In this step, relevant information is extracted from the wiki text and represented in the resource model which represents the wiki page and provides direct access to the properties of the page. Most of the textual content is discarded here, the page is regarded as an unsorted collection of features such as links, categories, templates, etc.

The resource model offers access specifically to the following properties of a wiki page:

  • The title of the page, as well as the namespace it belongs to
  • The text of the page, with different levels of processing, ranging from the unchanged wiki text from Wikipedia to plain text stripped of any markup, see sections 8.3 (not translated).
  • all wiki links, that is, all wiki pages that are referenced from this page, along with the link text used in the links. The method for extracting and handling wiki links is described in 8.7 (not translated).
    Some wiki links have special semantics (compare A.3, not translated) and are interpreted by WikiWord in a special way. Such special wiki links define:
    • the categories a page belongs to
    • language links defined on the page
  • all sections (headings) on the page
  • all templates used on the page, including parameters applied to them (see A.5, not translated). The method for extracting template references is described in section 8.4 (not translated).

This information is extracted mainly using simple pattern matching applied to wiki text. Based upon this information, more specific properties are determined (mostly also using pattern matching):

  • the type of the page. This property determines, how the page is processed further. Possible page types (resource types) are: article, redirect, disambiguation, list, and category. Details of this classification are described in section 8.5 (not translated).
  • the type of the concept that is described by the page, in case it is an article. This classification has no influence on the further processing by WikiWord, but may be useful when using the thesaurus, especially for the task of named entity recognition. Possible types of concepts are: place, person, organization, name, time, number, life form, and other. This classification is described in detail in section 8.6 (not translated).
  • A set of terms that can be determined as labels for the concept directly from the page itself, especially from the title (for details, see 8.9, not translated).
  • The first sentence of the page's text, for use as the definition (gloss) of the concept described by an article. The method used for extracting the first sentence, see section 8.8 (not translated).
  • disambiguation-links, that is, such wiki links on disambiguation pages that link to one of the meanings of the term that is being disambiguated. The algorithm used for determining those links is described in section 8.10 (not translated).
  • the target of the redirect, if the page is a redirect.

Some of the patterns used to detect specific pieces of information in the wiki text are specific to a wiki project. Note that such patterns are generally not determined ad hoc, but are determined by the wiki markup (see appendix A, not translated) or model explicit conventions or guidelines of individual wiki projects [14][15][16].

The methods that are used to determine the properties of a wiki page are described in more detail in chapter 8 (not translated). The Resource model itself is defined in appendix D.1 (not translated).

Construction of the local data model (section 4.2)

Local Data Model (fig 4.2)
Local Data Model (fig 4.2)

The second transformation is the processing of the data of the resource model in order to import it into the local data model (the concept model). In this step, the information from the Wikipedia are interpreted as elements if a thesaurus (compare section 9.1, not translated).

The local data model defines the structure of a monolingual, concept-oriented thesaurus. Concrete local data sets, that is, concrete monolingual thesauri, will be stored in this structure. The data mode is a term-concept-network (see figure 4.2), it contains relation between terms and concepts (the relation of designation or signification) and between different concepts (for example, subsumption/abstraction and relatedness), as shown in fig. 4.2(a). Relations between terms emerge implicitly. This model is similar to the structure suggested for this purpose in the literature, especially in [6] and other works mentioned in section 2.4 (not translated). The implementation of this model as a schema for a relational database system is described in appendix C.1 (not translated).

Specifically, the model consists of the following elements:

  • Terms: Terms (shown in fig. 4.2(b) as circles) are lexical entities: words, forms of words, groups of words, phrases, names, etc. Each term exists only once in a data-set (that is, at most once per language). Terms are designators (labels) for concepts.
  • Concepts: Concepts (shown in fig. 4.2(b) as rectangular boxes) are logical entities: they represent (abstract or concrete) things. Concepts are the meanings of terms.
    For each concept, additional properties are stored in addition to its identity, among which there is a name to be used for display in a user interface and the type of the concept (as described in the section about the resource model above), as well as further secondary properties like an IDF-Wale (see 9.3, not translated, IDF refers to the inverse document frequency).
  • Designation: The designation relation (in fig. 4.2(b) shown as a solid line) connects terms to concepts, for example the terms a, n and c to concept A. Any number of terms may be assigned to (i.e. designate) any number of concepts. The designation relation helps bridging the semantic gap between the text as a sequence of words and knowledge about concepts (refer to [17] and once more [6]). This relation is weighted, the number of times a specific term was used to refer to a concept is specified.
    The dotted connections in the figure emerge implicitly from the relation of designation: They show synonymy between such terms that both refer to the same concept (for example a and b because of their connections to A) and homonymy between concepts that are connected to the same term via the designation relation (for example between A and C because of their relations to b and c). Note that for clarity, not all those implicit relations are represented in the figure.
  • Subsumption: The relation of subsumption (in fig. 4.2(b) shown as an arrow) connects more general with more specific concepts. It is irreflexive and anti-symmetrical, as well as (conceptionally) transitive, and it constitutes a directed acyclic graph over the set of concepts. Ideally (but not necessarily), all concepts are reachable from a single root. The concrete semantics of the subsumption remains unspecified, it covers such different relations as abstraction (subtype), instantiation (is-a), aggregation (part-of) and attribution (property-of); see [18][10][19] as well as other research presented in section 2.4 (not translated). This relation corresponds to the relation of hyponymy between terms, that is, the BT/NT-relation as defined by ISO 2788.
  • Relatedness and similarity: Semantic relatedness and similarity of concepts (shown in fig. 4.2 as dashed lines) is a symmetrical relation. It is useful among other things for the automatic disambiguation of terms as well as for the task of query expansion (compare the research presented in section 2.3, not translated, especially [20] and [21]).

Local datasets are stored in a relational database (see appendix C.1, not translated), for the programmatical use of individual concepts there also exists an object oriented representation as transfer objects of the data access objects (DAO-layer, see appendix C.3, not translated). Each concept is represented as an object which provides access to the properties of the concept, e.g. to the terms for this concept as well as lists if related or subsumed concepts (see appendix D.2, not translated).

The relations of the local data model are constructed from the resource model, that is, the information contained in the individual resources (wiki pages) are interpreted and stored as relations in the local data model (i.e. in the thesaurus). Specifically, the following interpretations are applied:

  • Pages that are no redirect, disambiguation, list, category or otherwise special, are considered to be articles, that is, it's assumed that they describe exactly one concept. Consequently, a concept record is created for every such page, with the concept type determined for that page by the resource model; This corresponds to the approach suggested by the literature, specifically the research presented in section 2.3 and 2.4 (not translated).
  • The relation of subsumption is derived directly from the categorization of pages. No difference is made with respect to categorization between concept pages and articles, that is, categories are simply handled as concepts (see 9.2, not translated). According to the research cited in section 2.4 (not translated), this is common practice.
  • The relatedness of concepts is determined based on cross-references (wiki links) between articles: if two articles refer to each other using wiki links, it is assumed that they describe related concepts[6], see 9.1.3 (not translated).
  • The similarity of concepts is determined using language-links: if two articles refer to the same article in another language using language-links, these two articles are considered similar. The reason, is that language-links should always point to similar (or, ideally, equivalent) articles[22], and the relation of similarity is considered to be transitive and symmetrical. Language-links are evaluated in [23], the idea however to use them to determine semantic similarity appears not to have been researched yet.
  • The labels (terms) for a concept are determined from a variety of sources, among others:
    • from the title of the article, as well as the use of the magic word DISPLAYTITLE. The latter has not yet been used for this purpose in the cited literature.
    • from the title of redirect pages that refer to the article. Redirects are evaluated by most research that analyses Wikipedia, compare sections 2.3 and 2.4 (not translated).
    • from the title of disambiguation pages that refer to the article. Disambiguation pages have also been analyzed before, compare for instance [24].
    • from the link text of wiki links that refer to the article. This information has rarely been utilized by prior research (with the notable exception of [7] and research in the field of named entity recognition, see section 2.6, not translated), even though it can be very valuable for information retrieval (compare [8][9]).
    • from sort-keys used when categorizing articles, as well as the use of the magic word DEFAULTSORT. Theses sources of alternative terms have not been studied in the quoted literature.

The interpretation of the information from the wiki text is described in detail in section 9.1 (not translated). After importing the data from the individual wiki pages, some post-processing is applied (see 9.1.3, not translated). All redirects and categorization-aliases are resolved in that step (see 9.1.2 and 9.1.3, not translated) and the semantic relatedness and similarity of concepts is determined see 9.1.3, not translated). The result is a local dataset which represents a monolingual thesaurus.

Merging (Section 4.3)

The third transformation is the merging of several local datasets into a global dataset: groups of similar concepts from different languages are combined into one language-independent concept each. Such a method for creating a multilingual thesaurus was not described in the cited literature.

The global data model contains mainly information about which concepts from the individual language-specific data sets have been combined into a language-independent (pan-lingual) concept and how the relations between the local concepts are mapped to relations of language-independent concepts (see appendix C.1, not translated). Together with the local dataset that it references, a global dataset constitutes a collection of data sets that represents a multilingual thesaurus.

Each concept in the global data set is a set of local concepts, with at most one concept from each language. Languages-specific properties (especially terms and glosses) are not again stored in the global dataset — Instead, they are taken from the respective local data set if required.

The main task when creating the global data set is thus, to find groups of concepts from the different languages (i.e. local data set), with concepts that are as similar as possible (or, ideally, equivalent) to each other. The (possibly severe) differences in granularity and coverage of the Wikipedia in the different languages has to be taken into account when doing this. The algorithm used can be summarized as follows:

  1. import all concepts from each language in the collection.
  2. determine, which concepts refer directly using language-links to which other concept in the multilingual thesaurus. This way, pairs of similar concepts from different languages are marked in the thesaurus.
  3. determine, which pairs of concepts refer to each other in this way. Because language links generally refer to equivalent or more general concepts, it can be assumed that, if two concepts reference each other via language links, they are equivalent or at least very similar. This approach is analogous to the method used to determine related concepts withing one language, following [6].
  4. merge pairs of equivalent concepts, combining all the concepts' properties. While doing this, it is recorded which language-independent concept covers which languages. Each language must be present in each resulting pan-lingual concept only once.
  5. merge pairs, until no more suitable pairs of equivalent concepts that have no covered languages in common is available. If two concepts are connected by language-links in both ways, but the sets of the languages each covers overlap, those concepts are conflicting and can not be combined. Such pairs usually consist of very similar concepts.

This algorithm is described in more detail in section 9.2 (not translated). Following the merging property some post-processing is performed for consolidating the data (see 9.2.3, not translated). The relatedness and similarity between concepts is re-calculated, based on the combined relations of the concepts from different languages.

Translator's note: this process is also described in a separate paper by the author, see [25].

Export (Section 4.4)

The fourth and last transformation is the export of the local and global datasets into a a form suitable for reuse, namely, the thesaurus-model (see chapter 10, not translated). It represents the relations that have been extracted from Wikipedia in a standardized, for other software usable and useful form, namely RDF/SKOS (see section 10.3, not translated). The mapping of the thesaurus data to RDF/SKOS follows the pertinent literature, especially [6][26][27].

[...]

Summary (Chapter 15)

The present thesis discussed the extraction of a multilingual thesaurus from Wikipedia. In the outset, some basics about Wikipedia and thesauri where covered (chapter 1, mostly not translated) and some related work was evaluated (chapter 2, not translated). Further, a method was designed for extracting the relevant information from wiki text (chapter 4). Also, a prototype implementing this method , WikiWord, was developed (chapter 8 and 9, not translated), which saves the generated data in a relational database (appendix C, not translated). For the thesaurus that was thus created, a method was developed to export it into an exchange format which is based on the standards RDF and SKOS (see 10, not translated). WikiWord comprises a total of about 31 000 lines of source code in nearly 200 classes (compare appendix G, not translated).

The prototype was used to import the data of several Wikipedia projects and generate a multilingual thesaurus from them. This was then evaluated with respect to validity as well as usefulness for different tasks related to automatic language processing and information retrieval (part IV, mostly not translated).

Features (Section 15.1)

The WikiWord system, which is described by the present thesis, implements an end-to-end solution for the task of automatically generating a multilingual thesaurus, using the data of Wikipedia. It mainly relies upon the information encoded in the structure of hyper links as well as categorization and the use of templates on wiki pages. Features, that set WikiWord apart from the systems and methods described in chapter 2 (not translated) are especially:

  • To determine the terms (labels) used for concepts, the link text of wiki links referring to the concept's wiki page are used, among other things (see section 8.9, not translated). The frequency with which a given term is used to refer to a given concept is also recorded. The use of link text has indeed been studied before, however generally not in the context of a thesaurus structure but rather with respect to to disambiguation, especially for named entity recognition (compare section 2.3 and 2.4, not translated).
  • To determine the terms (labels) additional resources have been tapped in addition to the commonly used sources like page titles, redirects and disambiguation pages, especially sort keys given when categorizing pages as well as magic words like DISPLAYTITLE and DEFAULTSORT (see section 9.1, not translated).
  • For processing disambiguation pages, an algorithm was developed which goes beyond what previous work in this field described (compare 2.3, not translated).
  • A methods was designed for determining the "main article" if a category and consequently mapping both, the article and the category, to the same concept in the thesaurus (9.1.2, not translated). This method is language independent and does not use stemming. Such a methods has not yet been suggested in the cited works.
  • Based on different features, such as categories and templates use, pages and concepts have been classified (see 8.5 and 8.6, not translated). Such a classification of concepts has been described before in the context of named entity recognition, but has not previously been combined with a thesaurus.
  • The use of language links for determining semantic similarity between concepts has been studied for the first time (see 9.1.3, not translated). Generally, research seems to focus on determining semantic relatedness using Wikipedia, semantic similarity is rarely covered.
  • The combination of data from multiple Wikipedias by merging equivalent concepts from different languages, identified using their language links, was also not described before. Merely [28] use language links to provide glosses for concepts in different languages. They rely solely on the English language Wikipedia for structural data, while WikiWord treats all languages equally. The use of language links for comparing the category structure of different wikis is suggested in [23], however, merging those structures is not discussed.
  • WikiWord also takes into account concepts that are represented only as sections of articles, and not as articles in their own right (9.1.1, not translated).
  • It also considers concepts that are referenced, but do not have an article. While little information is known about such concepts, they can still be useful for e.g. indexing documents.
  • The data structure generated by WikiWord abstracts from the structure of Wikipedia. Specifically, in contrast to [29], categories are not directly stored at all and wiki pages are only represented as auxiliary entities.
  • Offering a mapping to RDF/SKOS, as described in chapter 10 (not translated), allows the data to be used immediately in existing systems for processing knowledge networks. A similar approach is presented by [6]. [28] also offer such a mapping, however with respect to representing properties of concepts and semantically strong relations between concepts, such as can be typically extracted from infoboxes, not with respect to the relations employed by a thesaurus. [26] studies the representation of thesauri in SKOS, however applied to manually maintained thesauri like MeSH.
  • WikiWord does not use natural language processing. Textual content as such has been ignored by the import process, with the exception of the first sentence, which has been extracted as a gloss for the concept. The methods used are largely language independent. Project-specific patterns that model conventions and peculiarities of the respective Wikipedia, can be adjusted using a set of configuration variables.
  • WikiWord does not attempt to determine concrete semantic relations between concepts or property values of the subjects of articles. The processing if such structured data is not in the scope of this research.

WikiWord uses a variety of sources of information in the wiki text, some of which have not so far been considered. This way, it does not only create a thesaurus useful for common tasks such as indexing, disambiguation and determining semantic similarity and relatedness. The data generated also offers a richer basis to which several of the methods for using Wikipedia, as already described in the literature, could be applied.

Results and Outlook (Section 15.2)

The evaluation of the data generated by WikiWord shows, it provides a very wide foundation of data which satisfies the requirements defined in chapter 5 (not translated). The quality of the diverse types of data that have been extracted is satisfactory to good, and in some cases methods have been proposed that allow the quality of the data to be increased, generally at the expends of coverage.

Particularly, the relation of designation (assigning terms resp. labels to concepts, see 12.5, not translated), the calculation of semantic relatedness as well as the automatic disambiguation (in chapter 14, not translated) have been evaluated. The quality of merged concepts created from concepts from different languages when building a multilingual thesaurus has be reviewed and found sufficient (11.3, not translated), further investigations and comparisons to alternative methods appear however to be in order.

Another question that deserves further research is, in how far the partially unclear or erroneous category structure, as displayed by some wikis, may be improved by aptly combining the information from several wikis, as [23] suggests.

Another point which is pending an evaluation of suitability for practical use is the export to RDF/SKOS (see chapter 10, not translated). These standards offer some guidance about what type of information should be represented in what form [27], integration tests with one or more real life systems able to process SKOS, however, appear to be indicated. Projects which would probably benefit from the data provided by WikiWord, and would therefor be obvious choices for integration testing, include DBpedia [28][30], Wortschatz [31][32], Freebase [33] as well as the free dictionary project OmegaWiki [34][35].

The question in how far the multilingual thesaurus created by WikiWord is suitable as a basis for translation systems also remains to be investigated. Especially the effect using this data has on an existing system might prove instructive.

The present diploma thesis describes methods for the extraction of knowledge about the relations of concepts among each other as well as about the labels used for concepts. A working prototype was developed and applied to the data of several Wikipedias. It was shown that the methods described here satisfy the requirements, and a number of opportunities for further research were identified. The author remains hoping to have, with this thesis, made a contribution towards improving systems for natural language processing and information retrieval by providing access to a large amount of details lexical-semantic world-knowledge.

Knowledge is not enough. We have to apply it.
— J.W. v. Goethe

Appendix

Translator's note: some information about the usage of WikiWord as well as the testing data used for evaluation and the source code will be translated soon and will be added here.

References

  1. "Feature Generation for Textual Information Using World Knowledge", Evgeniy Gabrilovich, 2006
  2. "Word Sense Disambiguation: The State of the Art", Nancy Ide and Jean Veronis, 1998
  3. "Forward", B. G. Buchanan and E. A. Feigenbaum, 1982
  4. "On the thresholds of knowledge", D. B. Lenat and E. A. Feigenbaum, 1991
  5. D. B. Lenat, RV Guha, K. Pittman, D. Pratt, and M. Shepherd, 1990
  6. 6.0 6.1 6.2 6.3 6.4 6.5 6.6 "Mining a Large-Scale Term-Concept Network from Wikipedia", Andrew Gregorowicz and Mark A. Kramer, 2006
  7. 7.0 7.1 "Using Wikipedia for Automatic Word Sense Disambiguation", Rada Mihalcea, 2007
  8. 8.0 8.1 "Analysis of anchor text for web search", N. Eiron and K. Mccurley, 2003
  9. 9.0 9.1 "Mining anchor text for query refinement", Reiner Kraft and Jason Zien, 2004
  10. 10.0 10.1 "Deriving a large scale taxonomy from wikipedia", Simone P. Ponzetto and Michael Strube, 2007
  11. "A Thesaurus Construction Method from Large Scale Web Dictionaries", Kotaro Nakayama, Takahiro Hara and Shojiro Nishio, 2007
  12. "Wikimedia dump service"
  13. Help:Export
  14. "Wikipedia:What is an article?"
  15. "Wikipedia:Categorization"
  16. "Wikipedia:Disambiguation"
  17. "Yago: a core of semantic knowledge", F. M. Suchanek, G. Kasneci and G. Weikum, 2007
  18. "Collaborative thesaurus tagging the Wikipedia way", Jakob Voß, 2006
  19. "Semantic Wikipedia - Checking the Premises", Rainer Hammwöhner, 2007
  20. "Comparing Wikipedia and German WordNet by Evaluating Semantic Relatedness on Multiple Datasets", Torsten Zesch and Iryna Gurevych and Max Mühlhäuser, 2007
  21. "WikiRelate! Computing Semantic Relatedness Using Wikipedia", M. Strube and S. P. Ponzetto, 2006
  22. "Help:Interlanguage links"
  23. 23.0 23.1 23.2 "Interlingual Aspects if Wikipedia’s Quality", Rainer Hammwöhner, 2007
  24. "Knowledge Derived from Wikipedia for Computing Semantic Relatedness", Simone P. Ponzetto and Michael Strube, 2007
  25. Building Language-Independent Concepts from Wikipedia, Daniel Kinzler, 2008
  26. 26.0 26.1 "A Method to Convert Thesauri to SKOS", Mark van Assem, Veronique Malaise, Alistair Miles and Guus Schreiber, 2006
  27. 27.0 27.1 "Quick Guide to Publishing a Thesaurus on the Semantic Web", Alistair Miles (ed.), 2005
  28. 28.0 28.1 28.2 "DBpedia: A Nucleus for a Web of Open Data", Sören Auer, C. Bizer, G. Kobilarov, Jens Lehmann, R. Cyganiak and Z. Ives, 2007
  29. "Analyzing and Accessing Wikipedia as a Lexical Semantic Resource", Torsten Zesch, Iryna Gurevych and Max Mühlhäuser, 2007
  30. DBpedia
  31. "Projekt Deutscher Wortschatz", Uwe Quasthoff, 1998
  32. Wortschatz
  33. Freebase
  34. "An Online Ontology: WiktionaryZ", E. M. van Mulligen, E. Möller, P. J. Roes, M. Weeber, G. Meijssen, C. Chichester and B. Mons, 2006
  35. OmegaWiki

(no comments yet)