Research:Newsletter/2023/December

Vol: 13 • Issue: 12 • December 2023 [contribute] [archives]

"LLMs Know More, Hallucinate Less" with Wikidata

"Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata"

Overview of how the authors' "WikiSP" semantic parser is used to answer a user's question:
"An entity linker is used to link entities in the user query to their unique ID in Wikidata; e.g. “A Bronx Tale” is linked to entity ID “Q1130705”. The query and entity linker outputs are fed to the WikiSP semantic parser to produce a modified version of SPARQL, where property IDs (e.g. “P915”) are replaced by their unique string identifiers (e.g. “filming_location”). If applying the [SPARQL] query to Wikidata fails to return a result, we default to [OpenAI's large language model] GPT-3, labeling the result as a GPT-3 guess. Returned answers are presented in the context of the query, so the user can tell if the answer is acceptable; if not, we also show the guess from GPT-3. Here WikiSP mistakenly uses “filming_location” instead of “narrative_location”; the user detects the mistake, thumbs down the answer, and the GPT-3 answer is provided."

This paper^[1] (by five graduate students at Stanford University's computer science department and Monica S. Lam as last author) sets out to show that

While large language models (LLMs) can answer many questions correctly, they can also hallucinate and give wrong answers. Wikidata, with its over 12 billion facts, can be used to ground LLMs to improve their factuality.

To do this, the paper "presents WikiSP, a few-shot sequence-to-sequence semantic parser for Wikidata that translates a user query, along with results from an entity linker, directly into SPARQL queries [to retrieve information from Wikidata]." It is obtained by fine-tuning one of Facebook/Meta LLaMA 1 large language models.

For example, the user question "What year did giants win the world series?" is supposed to be converted into the query SELECT DISTINCT ?x WHERE {?y wdt:sports_season_of_league_or_competition wd:Q265538; wdt:winner wd:Q308966; wdt:point_in_time ?x. }. The paper uses a modified SPARQL syntax that replaces numerical property IDs (here, P3450) with their English-language label (here, "sports season of league or competition"). The authors motivate this choice by observing that "While zero-shot LLMs [e.g. ChatGPT] can generate SPARQL queries for the easiest and most common questions, they do not know all the PIDs and QIDs [property and item IDs in Wikidata], and nor is it possible to include them in a prompt."

To evaluate the performance of "WikiSP", and as a second contribution of the paper, the authors present

[...] WikiWebQuestions, a high-quality question answering benchmark for Wikidata. Ported over from WebQuestions for Freebase, it consists of real-world data with SPARQL annotation. [...]

Despite being the most popular large knowledge base for a long time, existing benchmarks on Wikidata with labeled SPARQL queries are unfortunately either small or of low quality. On the other hand, benchmarks over the deprecated Freebase still dominate the KBQA research with better-quality data.

Using this new benchmark, "Our experimental results demonstrate the effectiveness of [WikiSP], establishing a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively." However, the paper's "Limitations" section hints that despite the impressive "12 billion facts" factoid that the paper opens with, Wikidata's coverage may be too limited to answer most user questions in a satisfying manner:

Even though knowledge bases are an important source of facts, a large portion of the knowledge available in digital form (e.g. Wikipedia, news articles, etc.), is not organized into knowledge bases. As such, the results of this paper can be considered complementary to the larger body of fact-checking research based on free text.

To address this weakness, the authors combine this Wikidata-based setup with a standard LLM that provides the answer if the Wikidata query fails to return a result. They state that

By pairing our semantic parser with GPT-3, we combine verifiable results with qualified GPT-3 guesses to provide useful answers to 96% of the questions in dev.

Data and evaluation code from the paper have been released in a GitHub repo, where the authors state that "We are now working on releasing fine-tuned models."

The paper's endeavour bears some similarity to a paper authored by a different team of Stanford graduate students with professor Lam that sought to use Wikipedia (rather than Wikidata) to reduce LLM hallucations, see the review in our July issue: "Wikipedia-based LLM chatbot 'outperforms all baselines' regarding factual accuracy".

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
US-based editors wanted for workshop on research ethics: For a research project titled "Beyond the Individual: Community-Engaged Design and Implementation of a Framework for Ethical Online Communities Research", a team from the University of Minnesota's GroupLens lab is seeking US-based Wikipedia editors to participate in a 2-hour remote workshop, to discuss "ways that research can help or harm the community" (following up on a previous workshop with non-US-based English Wikipedia editors). Interested users can sign up here.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata"

From the abstract:^[2]

"In this work, we explore the use of Large Language Models (LLMs) for knowledge engineering tasks in the context of the ISWC 2023 LM-KBC Challenge. For this task, given subject and relation pairs sourced from Wikidata, we utilize pre-trained LLMs to produce the relevant objects in string format and link them to their respective Wikidata QIDs. [...] The method achieved a macro-averaged F1-score of 0.701 across the properties, with the scores varying from 1.00 to 0.328. These results demonstrate that the knowledge of LLMs varies significantly depending on the domain and that further experimentation is required to determine the circumstances under which LLMs can be used for automatic Knowledge Base (e.g., Wikidata) completion and correction. The investigation of the results also suggests the promising contribution of LLMs in collaborative knowledge engineering. LLMKE won Track 2 of the challenge.

"Large language models learn to organize concepts in ways that are strikingly similar to how concepts are organized in [Wikidata]"

From the abstract:^[3]

"Knowledge bases such as WikiData provide large-scale, high-quality representations of inferential semantics and world knowledge. We show that large language models learn to organize concepts in ways that are strikingly similar to how concepts are organized in such knowledge bases. Knowledge bases model collective, institutional knowledge, and large language models seem to induce such knowledge from raw text. We show that bigger and better models exhibit more human-like concept organization, across four families of language models and three knowledge graph embeddings."

"Enhancing Multilingual Language Model with Massive Multilingual Knowledge Triples" from Wikidata

From the abstract:^[4]

[...] we explore methods to make better use of the multilingual annotation and language agnostic property of KG knowledge graph triples, and present novel knowledge based multilingual language models (KMLMs) trained directly on the knowledge triples. We first generate a large amount of multilingual synthetic sentences using the Wikidata KG triples. Then based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to enable the LMs to not only memorize the factual knowledge but also learn useful logical patterns. Our pretrained KMLMs demonstrate significant performance improvements on a wide range of knowledge-intensive cross-lingual tasks, including named entity recognition (NER), factual knowledge retrieval, relation classification, and a newly designed logical reasoning task.

"KGConv, a Conversational Corpus grounded in Wikidata"

From the abstract:^[5]

"We present KGConv, a large, conversational corpus of 71k conversations where each question-answer pair is grounded in a Wikidata fact. Conversations contain on average 8.6 questions and for each Wikidata fact, we provide multiple variants (12 on average) of the corresponding question using templates, human annotations, hand-crafted rules and a question rewriting neural model. We provide baselines for the task of Knowledge-Based, Conversational Question Generation. [...]"

"WikiDialog" dataset: "Dialog inpainting" using Wikipedia

From the abstract^[6] of a paper presented by a team of Google researchers at last year's ICML conference:

"[...] conversational question answering (ConvQA) systems have long been stymied by scarce training data that is expensive to collect. To address this problem, we propose a new technique for synthetically generating diverse and high-quality dialog data: dialog inpainting. Our approach takes the text of any document and transforms it into a two-person dialog between the writer and an imagined reader: we treat sentences from the article as utterances spoken by the writer, and then use a dialog inpainter to predict what the imagined reader asked or said in between each of the writer's utterances. By applying this approach to passages from Wikipedia and the web, we produce WikiDialog and WebDialog, two datasets totalling 19 million diverse information-seeking dialogs -- 1,000x larger than the largest existing ConvQA dataset. Furthermore, human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets."

As "a real example of a dialog inferred from a Wikipedia passage using dialog inpainting" the paper presents the following (abridged) exchange between an "imagined reader" of the Freshman 15 article and a Wikipedia "Writer" who (after the initial greeting) always answers with excerpts from the article, with all other sentences filled in by the inpainter:

Writer: "I'm an automated assistant. I can tell you about Freshman 15."
Imagined Reader: "How does the freshman 15 relate to eating habits?"
Writer: "In parental-supervised diets, students also usually ingest the proper proportion [...]"
Imagined Reader: "What is the cause of this?"
Writer: "This is because when students go off to college, they face an independence [...]"
Imagined Reader: "Do people tend to eat healthier or less healthy when they are away from home?"
Writer: "Research has shown that over 60 percent of college students commonly ingest [...]"

Wikipedia-based "Retrieval Augmentation Reduces Hallucination in Conversation" with large language models

From the abstract of a 2021 paper by a team from Facebook AI Research:^[7]

"Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures [retrieving articles from Wikipedia] [...] for knowledge-grounded dialogue [...] We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots."

Large language models as an alternative to Wikidata?

From the abstract:^[8]

"Pre-trained language models (LMs) have recently [as of 2021] gained attention for their potential as an alternative to (or proxy for) explicit knowledge bases (KBs). In this position paper, we examine this hypothesis, identify strengths and limitations of both LMs and KBs, and discuss the complementary nature of the two paradigms."

The authors acknowledge that "Starting from [a 2019 paper], many works have explored whether this LM-as-KB paradigm [i.e. the ability of LLMs to answer factual questions, by now familiar to users of ChatGPT] could provide an alternative to structured knowledge bases such as Wikidata. However, the paper concludes, as of 2021,

[...] that LMs cannot broadly replace KBs as explicit repositories of structured knowledge. While the probabilistic nature of LM-based predictions is suitable for task-specific end-to-end learning, the inherent uncertainty of outputs does not meet the quality standards of KBs. LMs cannot separate facts from correlations, and this entails major impediments for KB maintenance. We advocate, on the other hand, that LMs can be valuable assets for KB curation, by providing a “second opinion” on new fact candidates or, in the absence of corroborated evidence, signal that the candidate should be refuted.

References

↑ Xu, Silei; Liu, Shicheng; Culhane, Theo; Pertseva, Elizaveta; Wu, Meng-Hsi; Semnani, Sina; Lam, Monica (December 2023). "Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata". In Bouamor, Houda; Pino, Juan; Bali, Kalika. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. EMNLP 2023. Singapore: Association for Computational Linguistics. pp. 5778–5791. doi:10.18653/v1/2023.emnlp-main.353. Data and evaluation code
↑ Zhang, Bohui; Reklos, Ioannis; Jain, Nitisha; Peñuela, Albert Meroño; Simperl, Elena (2023-09-15), Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata, arXiv code
↑ Gammelgaard, Mathias Lykke; Christiansen, Jonathan Gabel; Søgaard, Anders (2023-08-29), Large language models converge toward human-like concept organization, arXiv
↑ Liu, Linlin; Li, Xin; He, Ruidan; Bing, Lidong; Joty, Shafiq; Si, Luo (December 2022). "Enhancing Multilingual Language Model with Massive Multilingual Knowledge Triples". In Goldberg, Yoav; Kozareva, Zornitsa; Zhang, Yue. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. pp. 6878–6890. doi:10.18653/v1/2022.emnlp-main.462.
↑ Brabant, Quentin; Lecorve, Gwenole; Rojas-Barahona, Lina M.; Gardent, Claire (2023-08-29), KGConv, a Conversational Corpus grounded in Wikidata, arXiv
↑ Dai, Zhuyun; Chaganty, Arun Tejasvi; Zhao, Vincent Y.; Amini, Aida; Rashid, Qazi Mamunur; Green, Mike; Guu, Kelvin (2022-06-28). "Dialog Inpainting: Turning Documents into Dialogs". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 4558–4586. Dataset, poster presentation
↑ Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason (2021). "Retrieval Augmentation Reduces Hallucination in Conversation". doi:10.48550/ARXIV.2104.07567. , also in: Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803 November 7–11, 2021
↑ Razniewski, Simon; Yates, Andrew; Kassner, Nora; Weikum, Gerhard (2021-10-10). "Language Models As or For Knowledge Bases". arXiv:2110.04888 [cs].

Wikimedia Research Newsletter
Vol: 13 • Issue: 12 • December 2023
About • Subscribe: Email • [archives] • [Signpost edition] • [contribute] • [research index]

[1] Xu, Silei; Liu, Shicheng; Culhane, Theo; Pertseva, Elizaveta; Wu, Meng-Hsi; Semnani, Sina; Lam, Monica (December 2023). "Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata". In Bouamor, Houda; Pino, Juan; Bali, Kalika. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. EMNLP 2023. Singapore: Association for Computational Linguistics. pp. 5778–5791. doi:10.18653/v1/2023.emnlp-main.353. Data and evaluation code

[2] Zhang, Bohui; Reklos, Ioannis; Jain, Nitisha; Peñuela, Albert Meroño; Simperl, Elena (2023-09-15), Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata, arXiv code

[3] Gammelgaard, Mathias Lykke; Christiansen, Jonathan Gabel; Søgaard, Anders (2023-08-29), Large language models converge toward human-like concept organization, arXiv

[4] Liu, Linlin; Li, Xin; He, Ruidan; Bing, Lidong; Joty, Shafiq; Si, Luo (December 2022). "Enhancing Multilingual Language Model with Massive Multilingual Knowledge Triples". In Goldberg, Yoav; Kozareva, Zornitsa; Zhang, Yue. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. pp. 6878–6890. doi:10.18653/v1/2022.emnlp-main.462.

[5] Brabant, Quentin; Lecorve, Gwenole; Rojas-Barahona, Lina M.; Gardent, Claire (2023-08-29), KGConv, a Conversational Corpus grounded in Wikidata, arXiv

[6] Dai, Zhuyun; Chaganty, Arun Tejasvi; Zhao, Vincent Y.; Amini, Aida; Rashid, Qazi Mamunur; Green, Mike; Guu, Kelvin (2022-06-28). "Dialog Inpainting: Turning Documents into Dialogs". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 4558–4586. Dataset, poster presentation

[7] Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason (2021). "Retrieval Augmentation Reduces Hallucination in Conversation". doi:10.48550/ARXIV.2104.07567. , also in: Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803 November 7–11, 2021

[8] Razniewski, Simon; Yates, Andrew; Kassner, Nora; Weikum, Gerhard (2021-10-10). "Language Models As or For Knowledge Bases". arXiv:2110.04888 [cs].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]