Contents |
Analyzer
- Check out warnings, refine rules
- Check out table stats, look for unused types
- more concept types:
- NUMBER
- ORGANIZATION
- CHARACTER
- UNIT
- (OTHER_)PROPER_NAME: WORK, BRAND, etc
- handle lists!
- ignore very long concept names and terms (and definitions)
- more test cases, esp en, nl, no
- Aganda:
- resume in buildXXX (reset steps!) -> test!
- flush prompt?!
- Exclude:
- Portals/Projects (if not in own namespace - check thsi!)
-
Shortcuts (WP:xxx) -
-> badLink pattern
- fix: term-from-sortkey (?!)
-
decode url-encoded links
-
infer terms for "missing" concepts! -
min/max length for terms/concepts!
- TODO: strip left-over [] from definitions
- {{merge}}...
- discuss/handle {{duplicate}} bzw. {{merge}}
- discuss/handle {{otheruses}}
- NOT Fokus:
- Categories
- Link-Structure
- Fokus:
- Semantic Gap
- Term-Konzept
- Multilang
- Disambig (?!)
- evil disamig: wp:de:Samal
- Pluggable sentence-splitter
- Keep "content" of "inline"-templates
- use sentence/words/links from disambig-line
- en: ConceptType: ORGANIZATION...
- keep section links with:colons
- FIXME: fr inline templates... lots of them...
- check all warnings
- eyeball microcorpus...
- mine style-X diambig: otheruses, disambiglink...
- mis-spellings: skos:hidden-label
from AI3
From AI3: 99 Wikipedia Sources Aiding the Semantic Web
But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:
* Ontology development and categorization * Word sense disambiguation * Named entity recognition * Named entity disambiguation * Semantic relatedness and relations.
These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:
- Articles
- First paragraph — Definitions
- Full text — Description of meaning; related terms; translations
- Redirects — Synonymy; spelling variations, misspellings; abbreviations
- Title — Named entities; domain specific terms or senses
- Subject — Category suggestion (phrase marked in bold or in first paragraph)
- Section heading — Category suggestions
- Article links
- Context — Related terms; co-occurrences
- Label — Synonyms; spelling variations; related terms
- Target — Link graph; related terms
- LinksTo — Category suggestion
- LinkedBy — Category suggestion
- Categories
- Category — Category suggestion
- Contained articles — Semantically related terms (siblings)
- Hierarchy — Hyponymic and meronymic relations between terms
- Disambiguation pages
- Article links — Sense inventory
- Infobox Templates
- Name –
- Item — Category suggestion; entity suggestion
- Lists
- Hyponyms
Clustering
- Implement similarity clusterator
- Greedy: Allow multi-concept merge in single round
- hope: less conflicts, get [A, B, C] [D] instead of [A B] [C D]!
- TRY: collect weakly connected once no morge merging of strongly connected concepts is possible
- Interwiki:
- muliple targets in the same language from single source (if source merged with cat page or redir)
- intgerwiki -> redir
- intgerwiki -> disambig
- intgerwiki -> cat (!!)
Database
- "similarity" -> "siblings"
-
make symmetrical in the end!
-
- for large updates (idLinks, ...):
- chunked
- programmatic
- file (shell out, or implement...)
- medusa
- for each concept, calculate:
- idf -> (Nakayama bezgl Wikipedia)
- local generality (indeg/outdeg) -> Muchnik et.Al.
- discover hyperonyms by picking very general outlinks (but beware years and units)
- relations as feature vectors
- check type conflicts on merge? check out warnings!
- JobQueue-Workers -> Daemon?!
- disable keys for clustering? benchmark!
-
global meanings: not needed! (using local meanings via origin-table should be fast enough) - in meaning table, flag meanings that come *only* from link-text rule (and show frequency)
- min-freq for link-text-only meanings
- drop all stuff below threashold?
- confidence level (esp for broad/narrow) ?
- link/reference table for global thesaurus
-
fix statistics for global thesaurus -
fix table-stats: which schema?! - section links:
- ignore if on-page!
- link section-concept to "parent"?
-
fill title not listed as term in meanings! (run query to find them - example: de:Seestrandkiefer) -
eval meaning survey: coverage for different modes (how much is stripped?) -
RAND for missing! - no name for global concepts! preferred label for local concepts!
- TRY: when building global concepts, exclude UNKNOWN concepts at first, import later! (they have no translations, thus can't be merged!)
- concept names in Broader table! etc...
- ConceptDescription.getName
- FIXME: langlinks -> redirect: need resolve-redirect before buildLangPrep()!
- TODO: VERIFY import/clustering of micro-corpus!
- DON'T DELETE WARNINGS TABLE!
- smart cutoff for getMeanings!
- include links in conceptinfo!
- FIXME: "unknown" concepts generated by redirects (and other links) to disambig pages. should probably be ignored! entires in meanign table are misleading!
- XXX: name clashes between disambig and catregory -> false positives when deleting bad links from borader-table.
- Filter terms: remove terms that also applie to broader (or narrower) concepts
-
FIXME: breakl cycles (with/without leafs)-
-> alternative algorithm: prune leafs/roots interatively, until only loops are left.
-
- FIXME: make super-root!
- Note: need "substance" (origin/concept)!
- TODO: "similar" by simple langlink, "very similar" by langmatch (merge-clash)
-
TODO: "similarity" based on bidi-links -> skos:related- check semantics for skos:related - conflict with skos:transitiveBroader?
- TODO: check consistency!
Micro Corpus
- redir, redir->redir, redir->nowhere, redir->disambig
- disambigs
- category pages (structured)
- test:
- for lanuages: import concepts, extract text, build concept info, build statistics, build thesaurus
- for thesaurus: build concept info, build statistics
- verify:
- ogle via web
- check in db: all entities, all relations...
- Concept: Merge with Category (duped interwikis...)
Test Corpus
- Missing redirects in Mountains-Corpus: wp:en:Gerizim, wp:de:Mount_Blanc
- include categories
- include *all* redirects
- include some disambiguations
- TRY SIMPLE ENGLISH!
Web Interface
- names for thesauri -> into title
- show warnings for given concept/resource
- corpus matrix: overview
- show concept-relations:
-
broader, narrower - in-links, out-links -> maintain global pagelink tabe!
- langlinks?
-
siblings/similar (clustering conflicts) - new stuff: cooc, co-cooc, disambig-context (?)
-
- fix page: statistics, warnings
- log:
- indent (need repeat-macro for yates!)
-
show context -
hide start/end -
fix parameters - pretty duration
- Integrate Zipfer
-
concept langlinks: no links? no ids! -
fix terms for global concept: maintain lang! -
fix dataset selection - fix language-set detection
- fix justify-terms (in global mode)
Directories
- Diplomarbeit
- Paper
- WikiWord
- doc
- src
- lib
- Data
- data..
- Evaluation
- data...
- LICENSE !
Outline
Abstract
Motivation
- Semantic bootstrapping, semantic dictionary -> Glossary/Thesaurus
- synonyms (terms), homonyms (meanings), translations
- Wikipedia is nice:
- unique IDs
- disambiguated
- high quality
- log redundancy
- conventions -> recurring patterns
- Potential users
- Wortschatz
- DBPedia
- OmegaWiki
- YAGO
- FreeBase
- Wikipedia (search feature)
Scope
- what is it?
- thesaurus ?
- semantic lexicon?
- semantic dictionary?
- features
- term (lexeme) <-> concept relation -> homonyms, synonyms
- translingual concepts -> translations, knowledge transfer
- heuristics/patterns model per-project conventions
- give examples of explicit conventions!
- only per-language info is list of abbreviations for sentence splitting. only for def-extraction
- extract definition, plain text (sahnehäubchen)
- low complexity (calculate!)
- largely ignores templates -> no problem with maintenance categories
- only light use of category structure
- concept types (identify propert names, etc)
- virtually no stopwords
- nearly exclusively nouns
- multi-word phrases, proper nouns/individuals/named entities
- (common) inflected forms, casual/contextualized forms ("greek")
- special: "section concepts"
- others
- full text corpus stats
- nlp parsing (extracting semantics)
- structured data extraction (infoboxes)
- taxonomy
- semantic relations
- Result
- soft data, some errors, lots of blur
- no good for reasoning
- good as context data for
- disambiguation
- query expansion
- etc...
- needs experiment!
Architecture
- UML: storage, import
- import flow
- db layout
- entry points -> use cases
- parallelization, performance
Heuristics
- most heuristics model conventions
- tags, categories, titles, and other patterns
- style guides:
- definition first
- structure of disambig
- use of sortkey
- ...
- unique ID for heuristic
- at explanation + reference to spec in wikipedia
- in source code / javadoc
Clustering Algorithms
- reciprocal links
- translation set similarity
- conflict: negative similarity (extreme: neg inf)
- weight by granularity (project size)?
- heigh weight for direct reference to the local concept itself.
Disambig
- use per-concept context, cooc-freqency, generality
- inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
- eval:
- map to wordnet synsets / compare
- compare wordnet/SUMO taxonomy
Data Evaluation
- eval concepts
- 50+50 pages per wiki
- resourceType, conceptType
- Definition: missing, good, truncated, broken, wrong
- Stats per type?!
- (+plain text)
- (+hyperonyms, langlinks, outlinks)
- eval terms for each concept:
- pick random concept -> prefer "common" concepts!
- exact match
- inflected
- capitalized
- too broad -> abbreviation (contextualization): [[griechische|Griechische Mythologie]] (how common?)
- too narrow -> generalization: [[griechische Mythologie|Griechenland]] (how common?)
- extra: personification (Amerika -> Amerikaner; Bass -> Bassist; ...)
- plain wrong / misleading
- broken
- calculate percentage unweighted/weighted by fq/log(fq)
- with/without terms-from-links
- with cutoff (min ~2)
- with cutoff only for link-text-only terms
- without "missing" terms!
- TODO: eval individual rules! (use link table)
- eval retrieval (use as search index)
- for 50+50 terms per language:
- list concepts for term
- compare with gold std
- WordNet, ResearchCyc
- wikipedia search
- manual
- WordNet
- YAGO
- calculate percentage unweighted/weighted by fq/log(fq)
- with/without terms-from-links
- with cutoff (min ~2)
- with cutoff only for link-text-only terms
- eval clustering
- try excerpts: Mountains, Domesticated animals, Popes, ...
- verify cluster content
- verify siblings ("similar concepts")
- TODO: BAD concept! (wrong red link)
- TODO: check for stat. significance!
Data Export
- SKOS
- Topic Maps
- Vocabulary -> Maicher
- ...?
Outlook
Disambig
- use per-concept context, cooc-freqency, generality
- inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
- map to wordnet synsets / compare
- compare wordnet/SUMO taxonomy
Networks
- for each: cluster, degree distribution (small world? zipf?)
- try to construct a hierarhcy by clustering (-> similarity via feature-vectors = relations)
- Net of:
- links (in/out/undir)
- cooc, co-cooc
- coref, co-coref
- try PageRank, HITS
Information Retrieval
- use as search index
- use for QE
- use for resource selection
- ...
Conclusion
CLI
- query interface
Tools
- Medusa [1]
- TinyCC [2]
- Findlinks, Text2Satz, Sentrick (Vorl. Textdatenbanken) [3]
- abbr-detector (Vorl. Textdatenbanken, email uq)
- ASV Toolbox [4]
- OpenNLP (?) [5]
People
- Medusa -> Marco Büchler
- Tomas Wittig
- S. Bordag
- Mathias Richter (mathias dot richter at info uni le) -> Wikipedia cooc
Misc
- Lizenzen...
- Arbeit: GFDL+CC-BY-SA
- Programm: GPL (libs: auch LGPL, BSD, Apache, etc)



