Knowledge Base Completion for Long-Tail Entities

Question

1. What is the main challenge in enriching a KG with crisp SPO facts using LM-based approaches?

2. How does the two-stage pipeline in the Corroboration and Canonicalization section work?

3. What are the limitations of KBC techniques?

4. What is the two-stage KBC method?

Accepted Answer

The main challenge in enriching a KG with crisp SPO facts using LM-based approaches is the non-negligible fraction of false or 'hallucinated' outputs by the LM, leading to error rates above 10 percent. Additionally, even correct answers are not properly canonicalized, as they are surface phrases and not unique entities in the KG. This problem is further aggravated when the to-be-inferred O arguments are long-tail entities, with very few facts in Wikidata. Long-tail entities have less than 14 triples in Wikidata, making them a pain point that calls for KBC. The approach aims to address this problem by devising a novel method specifically geared to cope with long-tail entities.

Accepted Answer

The two-stage pipeline in the Corroboration and Canonicalization section utilizes two different Transformer-based language models. In the first stage, candidate answers are generated for input prompts, and informative sentences are retrieved from sources like Wikipedia. The second stage validates these candidates and disambiguates the answer strings onto entities in the underlying Knowledge Graph (KG). This process helps in mapping long-tail entities accurately, such as 'Lhasa' to 'Lhasa de Sela' and 'Bratsch' to 'Bratsch (band)'.

Accepted Answer

KBC techniques have limitations as many facts predicted are obvious and can be derived by simple rules for transitivity and inverse relations. Studies have found this limitation in works by Akrami et al. (2020) and Sun et al. (2020).

Accepted Answer

The two-stage KBC method is an unsupervised approach for Knowledge Base Completion (KBC) that utilizes Language Models (LMs) as a latent source for facts not inferable from the Knowledge Graph (KG) itself. It operates in two stages: 1. Generating candidate facts for a given Subject-Predicate (S-P) pair, where 'O' represents an entity name and possibly a multi-word phrase. 2. Corroborating the candidates, retaining those with high confidence of being correct, and disambiguating the 'O' argument into a KG entity. The method employs a generic prompt template for cloze questions, using the relation type-signature from the KG schema. It also leverages Wikipedia sentences from the S entity's article and the SpanBERT language model fine-tuned on SQuAD 2.0. The first stage yields a scored list of candidates, and the second stage involves re-ranking and pruning false positives using the generative entity disambiguation model GENRE, based on BART and finetuned on BLINK.

Accepted Answer

The MALT dataset focuses on the long-tail challenge in relation extraction and ambiguous facts. It emphasizes long-tail entities and surface-name ambiguity, which are often overlooked in existing benchmarks. The dataset includes three types of entities: Business, MusicComposition, and Human, with a total of 8 predicates. It contains 65.3% multi-word phrase entities and 58.6% ambiguous facts. The dataset aims to address the difficulty of coping with long-tail entities and surface-name ambiguity, providing a more comprehensive benchmark for relation extraction and knowledge base completion tasks.

Accepted Answer

GenIE baselines perform well in precision but have poor recall. Our two-stage method achieves both good precision (44%) and recall (43%), outperforming GenIE and other baselines. However, there is still room for improvement in inferring facts for long-tail entities. Our method serves as a building block to aid human curators by suggesting facts that augment the KG. Human annotators assessed 250 fact candidates, achieving an average precision of 61% across all relations. For the relation 'educated at', our method achieved 76% precision, highlighting the potential to close gaps in the KG.

Knowledge Base Completion for Long-Tail Entities

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is the main challenge in enriching a KG with crisp SPO facts using LM-based approaches?

2. How does the two-stage pipeline in the Corroboration and Canonicalization section work?

3. What are the limitations of KBC techniques?

4. What is the two-stage KBC method?

5. What is the focus of the MALT dataset?

6. How does GenIE compare in precision and recall?

Related Papers (5)

Effective Chinese Organization Name Linking to a List-Like Knowledge Base

DBpedia based Ontological Concepts Driven Information Extraction from Unstructured Text

Discovering and disambiguating named entities in text

Efektivitas Sistem Temu Kembali Informasi Perpustakaan Digital Institut Seni Indonesia (ISI) Yogyakarta dalam Tinjauan Recall dan Precision

Entity Linking Korean Text: An Unsupervised Learning Approach using Semantic Relations