1. What is the main challenge in enriching a KG with crisp SPO facts using LM-based approaches?
The main challenge in enriching a KG with crisp SPO facts using LM-based approaches is the non-negligible fraction of false or 'hallucinated' outputs by the LM, leading to error rates above 10 percent. Additionally, even correct answers are not properly canonicalized, as they are surface phrases and not unique entities in the KG. This problem is further aggravated when the to-be-inferred O arguments are long-tail entities, with very few facts in Wikidata. Long-tail entities have less than 14 triples in Wikidata, making them a pain point that calls for KBC. The approach aims to address this problem by devising a novel method specifically geared to cope with long-tail entities.
read more
2. How does the two-stage pipeline in the Corroboration and Canonicalization section work?
The two-stage pipeline in the Corroboration and Canonicalization section utilizes two different Transformer-based language models. In the first stage, candidate answers are generated for input prompts, and informative sentences are retrieved from sources like Wikipedia. The second stage validates these candidates and disambiguates the answer strings onto entities in the underlying Knowledge Graph (KG). This process helps in mapping long-tail entities accurately, such as 'Lhasa' to 'Lhasa de Sela' and 'Bratsch' to 'Bratsch (band)'.
read more
3. What are the limitations of KBC techniques?
KBC techniques have limitations as many facts predicted are obvious and can be derived by simple rules for transitivity and inverse relations. Studies have found this limitation in works by Akrami et al. (2020) and Sun et al. (2020).
read more
4. What is the two-stage KBC method?
The two-stage KBC method is an unsupervised approach for Knowledge Base Completion (KBC) that utilizes Language Models (LMs) as a latent source for facts not inferable from the Knowledge Graph (KG) itself. It operates in two stages: 1. Generating candidate facts for a given Subject-Predicate (S-P) pair, where 'O' represents an entity name and possibly a multi-word phrase. 2. Corroborating the candidates, retaining those with high confidence of being correct, and disambiguating the 'O' argument into a KG entity. The method employs a generic prompt template for cloze questions, using the relation type-signature from the KG schema. It also leverages Wikipedia sentences from the S entity's article and the SpanBERT language model fine-tuned on SQuAD 2.0. The first stage yields a scored list of candidates, and the second stage involves re-ranking and pruning false positives using the generative entity disambiguation model GENRE, based on BART and finetuned on BLINK.
read more