Toolbox
  • Printable version
 
Toolbox
LANGUAGES
Language
Personal tools
Wikipedia Affiliate Button
 
In other languages

WikiWord/scrap

From BrightByte

Jump to: navigation, search

Contents

Analyzer

  • Check out warnings, refine rules
  • Check out table stats, look for unused types
  • more concept types:
    • NUMBER
    • ORGANIZATION
    • CHARACTER
    • UNIT
    • (OTHER_)PROPER_NAME: WORK, BRAND, etc
  • handle lists!
  • ignore very long concept names and terms (and definitions)
  • more test cases, esp en, nl, no
  • Aganda:
    • resume in buildXXX (reset steps!) -> test!
    • flush prompt?!
  • Exclude:
    • Portals/Projects (if not in own namespace - check thsi!)
    • Shortcuts (WP:xxx)
    • -> badLink pattern
  • fix: term-from-sortkey (?!)
  • decode url-encoded links
  • infer terms for "missing" concepts!
  • min/max length for terms/concepts!
  • TODO: strip left-over [] from definitions
  • {{merge}}...
  • discuss/handle {{duplicate}} bzw. {{merge}}
  • discuss/handle {{otheruses}}
  • NOT Fokus:
    • Categories
    • Link-Structure
  • Fokus:
    • Semantic Gap
    • Term-Konzept
    • Multilang
    • Disambig (?!)
  • Pluggable sentence-splitter
  • Keep "content" of "inline"-templates
  • use sentence/words/links from disambig-line
  • en: ConceptType: ORGANIZATION...
  • keep section links with:colons
  • FIXME: fr inline templates... lots of them...
  • check all warnings
  • eyeball microcorpus...
  • mine style-X diambig: otheruses, disambiglink...


  • mis-spellings: skos:hidden-label

from AI3

From AI3: 99 Wikipedia Sources Aiding the Semantic Web

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

   * Ontology development and categorization
   * Word sense disambiguation
   * Named entity recognition
   * Named entity disambiguation
   * Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

  • Articles
    • First paragraph — Definitions
    • Full text — Description of meaning; related terms; translations
    • Redirects — Synonymy; spelling variations, misspellings; abbreviations
    • Title — Named entities; domain specific terms or senses
    • Subject — Category suggestion (phrase marked in bold or in first paragraph)
    • Section heading — Category suggestions
  • Article links
    • Context — Related terms; co-occurrences
    • Label — Synonyms; spelling variations; related terms
    • Target — Link graph; related terms
    • LinksTo — Category suggestion
    • LinkedBy — Category suggestion
  • Categories
    • Category — Category suggestion
    • Contained articles — Semantically related terms (siblings)
    • Hierarchy — Hyponymic and meronymic relations between terms
  • Disambiguation pages
    • Article links — Sense inventory
  • Infobox Templates
    • Name –
    • Item — Category suggestion; entity suggestion
  • Lists
    • Hyponyms

Clustering

  • Implement similarity clusterator
  • Greedy: Allow multi-concept merge in single round
    • hope: less conflicts, get [A, B, C] [D] instead of [A B] [C D]!
  • TRY: collect weakly connected once no morge merging of strongly connected concepts is possible
  • Interwiki:
    • muliple targets in the same language from single source (if source merged with cat page or redir)
    • intgerwiki -> redir
    • intgerwiki -> disambig
    • intgerwiki -> cat (!!)

Database

  • "similarity" -> "siblings"
    • make symmetrical in the end!
  • for large updates (idLinks, ...):
    • chunked
    • programmatic
    • file (shell out, or implement...)
    • medusa
  • for each concept, calculate:
    • idf -> (Nakayama bezgl Wikipedia)
    • local generality (indeg/outdeg) -> Muchnik et.Al.
    • discover hyperonyms by picking very general outlinks (but beware years and units)
  • relations as feature vectors
  • check type conflicts on merge? check out warnings!
  • JobQueue-Workers -> Daemon?!
  • disable keys for clustering? benchmark!
  • global meanings: not needed! (using local meanings via origin-table should be fast enough)
  • in meaning table, flag meanings that come *only* from link-text rule (and show frequency)
    • min-freq for link-text-only meanings
    • drop all stuff below threashold?
  • confidence level (esp for broad/narrow) ?
  • link/reference table for global thesaurus
  • fix statistics for global thesaurus
  • fix table-stats: which schema?!
  • section links:
    • ignore if on-page!
    • link section-concept to "parent"?
  • fill title not listed as term in meanings! (run query to find them - example: de:Seestrandkiefer)
  • eval meaning survey: coverage for different modes (how much is stripped?)
  • RAND for missing!
  • no name for global concepts! preferred label for local concepts!
  • TRY: when building global concepts, exclude UNKNOWN concepts at first, import later! (they have no translations, thus can't be merged!)
  • concept names in Broader table! etc...
  • ConceptDescription.getName
  • FIXME: langlinks -> redirect: need resolve-redirect before buildLangPrep()!
  • TODO: VERIFY import/clustering of micro-corpus!
  • DON'T DELETE WARNINGS TABLE!
  • smart cutoff for getMeanings!
  • include links in conceptinfo!
  • FIXME: "unknown" concepts generated by redirects (and other links) to disambig pages. should probably be ignored! entires in meanign table are misleading!
  • XXX: name clashes between disambig and catregory -> false positives when deleting bad links from borader-table.
  • Filter terms: remove terms that also applie to broader (or narrower) concepts
  • FIXME: breakl cycles (with/without leafs)
    • -> alternative algorithm: prune leafs/roots interatively, until only loops are left.
  • FIXME: make super-root!
    • Note: need "substance" (origin/concept)!
  • TODO: "similar" by simple langlink, "very similar" by langmatch (merge-clash)
  • TODO: "similarity" based on bidi-links -> skos:related
    • check semantics for skos:related - conflict with skos:transitiveBroader?
  • TODO: check consistency!

Micro Corpus

  • redir, redir->redir, redir->nowhere, redir->disambig
  • disambigs
  • category pages (structured)
  • test:
    • for lanuages: import concepts, extract text, build concept info, build statistics, build thesaurus
    • for thesaurus: build concept info, build statistics
  • verify:
    • ogle via web
    • check in db: all entities, all relations...
  • Concept: Merge with Category (duped interwikis...)

Test Corpus

  • Missing redirects in Mountains-Corpus: wp:en:Gerizim, wp:de:Mount_Blanc
  • include categories
  • include *all* redirects
  • include some disambiguations
  • TRY SIMPLE ENGLISH!

Web Interface

  • names for thesauri -> into title
  • show warnings for given concept/resource
  • corpus matrix: overview
  • show concept-relations:
    • broader, narrower
    • in-links, out-links -> maintain global pagelink tabe!
    • langlinks?
    • siblings/similar (clustering conflicts)
    • new stuff: cooc, co-cooc, disambig-context (?)
  • fix page: statistics, warnings
  • log:
    • indent (need repeat-macro for yates!)
    • show context
    • hide start/end
    • fix parameters
    • pretty duration
  • Integrate Zipfer
  • concept langlinks: no links? no ids!
  • fix terms for global concept: maintain lang!
  • fix dataset selection
  • fix language-set detection
  • fix justify-terms (in global mode)

Directories

  • Diplomarbeit
    • Paper
    • WikiWord
      • doc
      • src
      • lib
    • Data
      • data..
    • Evaluation
      • data...
  • LICENSE !

Outline

Abstract

Motivation

  • Semantic bootstrapping, semantic dictionary -> Glossary/Thesaurus
    • synonyms (terms), homonyms (meanings), translations
  • Wikipedia is nice:
    • unique IDs
    • disambiguated
    • high quality
    • log redundancy
    • conventions -> recurring patterns
  • Potential users
    • Wortschatz
    • DBPedia
    • OmegaWiki
    • YAGO
    • FreeBase
    • Wikipedia (search feature)

Scope

  • what is it?
    • thesaurus ?
    • semantic lexicon?
    • semantic dictionary?
  • features
    • term (lexeme) <-> concept relation -> homonyms, synonyms
    • translingual concepts -> translations, knowledge transfer
    • heuristics/patterns model per-project conventions
      • give examples of explicit conventions!
      • only per-language info is list of abbreviations for sentence splitting. only for def-extraction
    • extract definition, plain text (sahnehäubchen)
    • low complexity (calculate!)
    • largely ignores templates -> no problem with maintenance categories
    • only light use of category structure
    • concept types (identify propert names, etc)
    • virtually no stopwords
    • nearly exclusively nouns
    • multi-word phrases, proper nouns/individuals/named entities
    • (common) inflected forms, casual/contextualized forms ("greek")
    • special: "section concepts"
  • others
    • full text corpus stats
    • nlp parsing (extracting semantics)
    • structured data extraction (infoboxes)
    • taxonomy
    • semantic relations
  • Result
    • soft data, some errors, lots of blur
    • no good for reasoning
    • good as context data for
      • disambiguation
      • query expansion
      • etc...
      • needs experiment!

Architecture

  • UML: storage, import
  • import flow
  • db layout
  • entry points -> use cases
  • parallelization, performance

Heuristics

  • most heuristics model conventions
  • tags, categories, titles, and other patterns
  • style guides:
    • definition first
    • structure of disambig
    • use of sortkey
    • ...
  • unique ID for heuristic
    • at explanation + reference to spec in wikipedia
    • in source code / javadoc

Clustering Algorithms

  • reciprocal links
  • translation set similarity
    • conflict: negative similarity (extreme: neg inf)
    • weight by granularity (project size)?
    • heigh weight for direct reference to the local concept itself.

Disambig

  • use per-concept context, cooc-freqency, generality
  • inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
  • eval:
    • map to wordnet synsets / compare
    • compare wordnet/SUMO taxonomy

Data Evaluation

  • eval concepts
    • 50+50 pages per wiki
    • resourceType, conceptType
    • Definition: missing, good, truncated, broken, wrong
      • Stats per type?!
    • (+plain text)
    • (+hyperonyms, langlinks, outlinks)
  • eval terms for each concept:
    • pick random concept -> prefer "common" concepts!
    • exact match
    • inflected
    • capitalized
    • too broad -> abbreviation (contextualization): [[griechische|Griechische Mythologie]] (how common?)
    • too narrow -> generalization: [[griechische Mythologie|Griechenland]] (how common?)
    • extra: personification (Amerika -> Amerikaner; Bass -> Bassist; ...)
    • plain wrong / misleading
    • broken
    • calculate percentage unweighted/weighted by fq/log(fq)
    • with/without terms-from-links
    • with cutoff (min ~2)
    • with cutoff only for link-text-only terms
    • without "missing" terms!
    • TODO: eval individual rules! (use link table)
  • eval retrieval (use as search index)
    • for 50+50 terms per language:
    • list concepts for term
    • compare with gold std
      • WordNet, ResearchCyc
      • wikipedia search
      • google
      • manual
      • WordNet
      • YAGO
    • calculate percentage unweighted/weighted by fq/log(fq)
    • with/without terms-from-links
    • with cutoff (min ~2)
    • with cutoff only for link-text-only terms
  • eval clustering
    • try excerpts: Mountains, Domesticated animals, Popes, ...
    • verify cluster content
    • verify siblings ("similar concepts")
  • TODO: BAD concept! (wrong red link)
  • TODO: check for stat. significance!

Data Export

  • SKOS
  • Topic Maps
  • ...?

Outlook

Disambig

  • use per-concept context, cooc-freqency, generality
  • inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
  • map to wordnet synsets / compare
  • compare wordnet/SUMO taxonomy

Networks

  • for each: cluster, degree distribution (small world? zipf?)
  • try to construct a hierarhcy by clustering (-> similarity via feature-vectors = relations)
  • Net of:
    • links (in/out/undir)
    • cooc, co-cooc
    • coref, co-coref
  • try PageRank, HITS

Information Retrieval

  • use as search index
  • use for QE
  • use for resource selection
  • ...

Conclusion

CLI

  • query interface


Tools

  • Medusa [1]
  • TinyCC [2]
  • Findlinks, Text2Satz, Sentrick (Vorl. Textdatenbanken) [3]
  • abbr-detector (Vorl. Textdatenbanken, email uq)
  • ASV Toolbox [4]
  • OpenNLP (?) [5]


People

  • Medusa -> Marco Büchler
  • Tomas Wittig
  • S. Bordag
  • Mathias Richter (mathias dot richter at info uni le) -> Wikipedia cooc

Misc

  • Lizenzen...
    • Arbeit: GFDL+CC-BY-SA
    • Programm: GPL (libs: auch LGPL, BSD, Apache, etc)