Page MenuHomePhabricator

[Newcomer track] Machine learning for Wikipedia
Closed, ResolvedPublic

Description

I am proposing a session at Wikimania Hackathon regarding machine learning projects for Wikipedia.
During a 30-45 minute session, I would like to cover the following topics:

  • Some common guidelines about usage of machine learning at Wikipedia, explained with examples
    • TODO: documents related to this will be linked to this phab ticket.
  • A list of project ideas that can be starting point for your hacking at sessions.
  • A general overview of technology that can be used, limitations that will apply to Wikipedia projects

Event Timeline

Session outline, links, suggested reading and materials is given below. I will also have a presentation with this content.

Machine learning

Wikimedia Engineering Architecture Principles are applicable:

Some principles to keep in mind:

As per Developer Satisfaction Survey 2023, The majority of respondents indicated that English is not their first or primary language.
So I would personally focus on projects or ideas related to language diversity in this session.

Machine translation

Wikipedia now has a self hosted machine translation service.

You can access the test instance at https://translate.wmcloud.org. It has translation APIs too.

Try to build some cool applications using the machine translation API. You get free APIs.

  • How about a browser plugin that translate selected text using MinT?
  • A Wikipedia gadget/script that translate wikipedia sections?
  • ASR->MT->TTS and do speech to speech translation?

However, this is a test instance, so we don't offer any uptime guarantees. But don't worry, this service can run on your laptops.
Just clone the repo from here, and run it: https://github.com/wikimedia/mediawiki-services-machinetranslation
Or download the prebuilt docker container and run it: https://docker-registry.wikimedia.org//wikimedia/mediawiki-services-machinetranslation/tags/

Example docker deployment: Remember to replace the tag with latest one.

$ docker pull docker-registry.wikimedia.org/wikimedia/mediawiki-services-machinetranslation:2023-07-10-051738-production

$ docker run -dp 8089:8989 docker-registry.wikimedia.org/wikimedia/mediawiki-services-machinetranslation:2023-07-10-051738-production

You now has MT service supporting 35924 language pairs for 198 unique language running on your laptop.

As you can see, MinT supports translating not only plaintext, but html, json, markdown, svg etc.

Can you host this service in your webserver? Your universities webserver? Or Wikipedia chapters server and help distribute the computing cost for WMF? Seems a good idea? Interested?
Read this document: https://docs.google.com/document/d/1zBX1H5qjQq15_5EREAxILRBTbeDxyCPUlEtVsNLS2uo/edit

Speech

Text to speech

Wikipedia started a project called phonos to read out IPA(pronunciation representation) and general TTS capabilities. However dependency on Google's paid TTS to support large set of languages is not a good idea. See https://phabricator.wikimedia.org/T317274

Interested in exploring and trying out alternate options? See https://github.com/coqui-ai/TTS,
Meta recently announced https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/ And coqui-ai/TTS rencently integrated this.

https://tts.wmcloud.org is a demo web application that use Coqui-ai/TTS and Meta's MMS speech models.

  • Do you think a TTS service can help wikipedia projects? Do you have some project ideas with TTS? Does this TTS support your language?
  • If it supports, did you try it? Is there a way to fine tune it? If language is not present, can we add support for new language?

Automatic speech recognition

https://github.com/openai/whisper (MIT licensed).
There are whispercpp that optimize it to run it in just CPUs(also GPUs). Try it in your computer?

Does this ASR support your language? If it supports, did you try it? Is there a way to fine tune it? If language is not present, can we add support for new language?
What are some application of ASR in Wikipedia context?

Compute optimization

Due to our opensource policies, we have restrictions on kind of GPUs we can use. https://techblog.wikimedia.org/2020/04/06/saying-no-to-proprietary-code-in-production-is-hard-work-the-gpu-chapter/
Even if we can use such powerful GPUs, any kind of optimization on inference saves energy and operational costs, make these technologies accessible to more people.
See how we optimized machine translation modesl to run on CPUs https://diff.wikimedia.org/2023/06/13/mint-supporting-underserved-languages-with-open-machine-translation/

Optical Character Recognition

Tesseract supports 100+ languages and various image formats

https://github.com/tesseract-ocr/tesseract
An example OCR frontend with API running at https://tesseract.wmcloud.org
There is also https://ocr.wmcloud.org to OCR content present in commons. It uses tesseract, google cloud vision OCR and Transkribus.

  • Does Tesseract works well for your language? Please try it and give feedback
  • Build applications using OCR. Example: Translate a jpeg image to another language?

Transkribus is an AI-powered platform for text recognition, transcription and searching of historical documents https://readcoop.eu/transkribus

Large Language Models

Do read:

  1. https://en.wikipedia.org/wiki/Wikipedia:Large_language_models
  2. https://medium.com/freely-sharing-the-sum-of-all-knowledge/wikipedias-value-in-the-age-of-generative-ai-b19fec06bbee#6c31
  3. Thoughts on chatGPT and Wikimedia https://docs.google.com/document/d/1GB8PS26xJV2OR46UO5_6JyVLGh2HX_3l5VprBjXpvTQ/edit

For learning:

  1. https://simonwillison.net/2023/Aug/3/weird-world-of-llms/
  2. https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
  3. https://arxiv.org/pdf/2307.10169.pdf Challenges and Applications of Large Language Models

Experimenting with LLM is costly. WMF, at present does not have any LLM model based service.

ChatGPT plugin. https://diff.wikimedia.org/2023/07/13/exploring-paths-for-the-future-of-free-knowledge-new-wikipedia-chatgpt-plugin-leveraging-rich-media-social-apps-and-other-experiments/

LLM support of languages are also not wide.

There are some internal experiments on using LLM for some usecases. All are experimental.

Retrieval Augmented Generation and natural language question answering:
https://thottingal.in/blog/2023/07/21/wikiqa/

Wikidata knowledge graph to articles

  • Creating summaries of article based on facts retrieved from Wikidata
  • Creating placeholder article based on facts retrieved from Wikidata
### Instruction: Write a paragraph based the given data below in fluent English.
 
Place name: Tenerife
area: 2,034 km
known for : tourism
vistitors: 6 million per year.
popular resorts: Puerto de la Cruz and Playa de las Américas.

### Response:

Related reading:

Datasets

See https://phabricator.wikimedia.org/T341907
Wikimedia's huggingface profile: https://huggingface.co/wikimedia

Slst2020 renamed this task from Session : Machine learning for Wikipedia to [Newcomer track] Machine learning for Wikipedia.Aug 12 2023, 2:02 PM

@santhosh: Thanks for participating in the Hackathon! We hope you had a great time.

  • If this session took place: Please change the task status to resolved via the Add Action...Change Status dropdown.
    • If there are session notes (e.g. on Etherpad or a wiki page), or if the session was recorded, please make sure these resources are linked from this task.
    • If there are specific follow-up tasks from this session / event: Please create dedicated tasks and add another active project tag to those tasks, so others can find those tasks (as likely nobody in the future will look at the Hackathon workboard when trying to find something they are interested in).
  • In this session / event did not take place: Please set the task status to declined.

Thank you,
Phabricator housekeeping service

No reply; resolving task.

  NODES
chat 3
Idea 5
idea 5
INTERN 1
Note 2
Project 8
todo 1