Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Chtnnh
	Mar 17 2020, 3:45 PM

Description

Profile Information

Name: Chaitanya Mittal
IRC nickname on Freenode: chtnnh
Web Profile: https://www.github.com/chtnnh
Resume

resume.pdf132 KBDownload

Location: Dubai, AE
Typical working hours: 18:00 - 02:00 (UTC+4)

Synopsis

Short summary describing your project and how it will benefit Wikimedia projects

The current automatic classification system in place for the ptwiki is very naive and simply checks a few if conditions and places articles accordingly. There are 6 _target labels that the existing system places articles into, 2 of which require editor approval. This model will be replaced with the improved ‘articlequality’ model to automatically label articles based on quality and ‘draftquality’ model to filter out drafts that are spam and/or vandalism.

This proposal elaborates on implementing ‘articlequality’ and ‘draftquality’ model for the Portuguese wiki by following a design like that in the English wiki based largely on the work done by Morten Warncke-Wang et al.

Such an implementation would require feature extraction from ptwiki, training various models on these features and fitness testing these models to find the best fit.

The immediate use cases of this model would be:

Help increase the quality of automated article classification for ptwiki
Streamline work for editors on ptwiki with respect to finalizing articles that need expansion, improvements or articles that can be featured.

The implementation would also pave the way for further work to be done in automating various wiki tasks for ptwiki.

Mentor(s): @Halfak @Darwinius
Have you contacted your mentors already? Yes!

Deliverables

Days/Dates	Milestone/Deadline/Subtask Accomplished
Apr 27 - May 17	Community bonding period: spend time interacting with analytics team at Wikimedia, understand common practices and norms
May 18 - May 24	Preliminary research on features to be extracted from ptwikis
May 25 - May 31	Completion and Integration of extractors for ptwiki
Jun 1 - Jun 7	Testing for Extractors and Implementation of feature_lists
Jun 8 - Jun 14	Testing feature_lists
Jun 15 - Jun 19	Phase 1 Evaluations
Jun 22 - Jun 28	Research various models for implementing articlequality
Jun 29 - Jul 5	Implement top few models to benchmark performance
Jul 6 - Jul 12	Testing and Implementation of top few models
Jul 13 - Jul 17	Phase 2 Evaluations
Jul 20 - Jul 26	Selection of top performing model
Jul 27 - Aug 2	Streamlining footprint of selected model
Aug 3 - Aug 9	Streamlining selected model and completing subtasks. Documenting the process and model for future reference in ORES engineering
Aug 10 - Aug 24	Final Evaluation

In addition to code, I plan to start a blog on my portfolio website where I will write about my work on this project once every two weeks. This will help with documentation as well as give certain exposure to Wikimedia AI projects.

Participation

In terms of participation, I plan to communicate mainly through five channels: Phabricator for documented information, IRC for general queries, Zulip for task specific queries and Email and team meetings for official communication regarding progress.

As far as source code is concerned, I have learnt that the best way to share code is through commits. But in cases where this is not the best option, services like https://codeshare.io could be handy.

About Me

Hi! I am Chaitanya Mittal, an undergrad in Computer Science and Engineering currently in my first year. I am an algorithmic coder and machine learning enthusiast. I have the distinction of qualifying to the Asia Regionals of the ACM ICPC 2018. I have worked with the Mozilla Foundation and the Mifos Foundation previously, though only for a short period of time. I am an open source enthusiast and truly believe in the power it holds to influence the world.

In particular though, I have fallen in love with Wikimedia's vision, "Imagine a world where we can all share freely in the sum of all knowledge" and the fact that it stays true to that. In the spirit of free knowledge and collaborative code, I believe Wikimedia leads by example.

The time frame for the project is from June to August. I will have summer break from July going on until August end. I will only have minor college engagement during the first two weeks of the project and I will strive to not let it affect my enthusiasm towards the project in any way.

This proposal has been selected for GSoC 2020

What does making this project happen mean to you?

Having relied on Wikimedia since childhood, without even realizing it, I understand the role that WIkimedia plays and has been playing in shaping how knowledge is shared around the world. The successful completion of this project would directly improve wiki quality for a language with more than 200 million native speakers. To be able to make a small difference in how 200 million people access knowledge would mean the world to me.

It would help a 19 year old realize that collaboration can lead to great things. This is what making this project happen means to me.

Past Experience

Having actively worked in open source for a year now, I have looked for a welcoming community working towards a cause I could relate with. In this process, I have encountered multiple projects (Mozilla, Mifos), developers and tasks. Although it is difficult for me to quantitatively describe this experience, I can affirm that it has helped me become a better developer, I have helped with some tasks here in the WikiMedia community as well!

T245068 is the first task that I have completed.
T246438, T246663 are tasks I am currently working on with @Halfak and have made significant progress in, as of the writing of this proposal.

At a personal level, I actively program competitively and keep myself up to date on the latest machine learning algorithms being developed. I love both Python and C although competitive programming does make me use C++ quite often. I am a native Linux and Bash user and prefer coding in vim or VisualStudio Code.

Any Other Info

References: T246663

Related Projects/Microtasks:

T246438 could be used as a microtask and the implemented features for text complexity can be utilized for all wikis instead of just enwiki.
Convert all extractors for various wikis to generators to handle 0 or more labels per template (currently all expect only 1 label per template)

Relevant Links:

This is the original proposal for the Google Summer of Code 2020 and the scope of the project has expanded. The final scope will be included in the reports that follow.

Related Objects
Search...

Status	Assigned	Task
Resolved	Chtnnh	T247847 Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis
Resolved	Chtnnh	T251171 Add `words_to_watch` to articlequality and draftquality models in ptwiki
Resolved	Chtnnh	T250809 Review model performance for ptwiki 'articlequality' and 'draftquality'
Resolved	Chtnnh	T246663 Build article quality model for ptwikipedia
Resolved	Chtnnh	T246667 Build draft quality model for ptwikipedia
Resolved	Chtnnh	T251905 Write report about misclassification reports
Open	None	T194509 Build article quality model for bswiki
Resolved	Chtnnh	T251571 Build article quality model for Ukrainian Wikipedia
Resolved	Chtnnh	T253672 Build improved 'articlequality' model for ptwiki
Resolved	Chtnnh	T258735 Build articlequality model for Hindi wiki

Event Timeline

Chtnnh created this task.Mar 17 2020, 3:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2020, 3:45 PM

We've already started the first steps here. @GoEThe, would you be interested in co-mentoring this project?

By "we" let me be clear that @Chtnnh has already started to pick up the preliminary work for this task.

srishakatux moved this task from Backlog to Accepted Proposals on the Google-Summer-of-Code (2020) board.Mar 17 2020, 6:08 PM

In T247847#5975987, @Halfak wrote:

We've already started the first steps here. @GoEThe, would you be interested in co-mentoring this project?

Sure, I would be happy to.

edit: I guess, before signing up, I should know what is the time commitment for this.

In T247847#5978667, @GoEThe wrote:

Sure, I would be happy to.
edit: I guess, before signing up, I should know what is the time commitment for this.

Thank you! Typically, mentors are expected to spend 4-5 hours per week for each student. :)

You may also refer to the following:

GSoC mentor responsibilities: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Mentors
GSoC mentor guides: https://google.github.io/gsocguides/mentor/

Chtnnh renamed this task from Proposal (GSoC / Outreachy 2020): Implement articlequality model for ptwiki to Proposal (GSoC 2020): Implement articlequality model for ptwiki.Mar 18 2020, 5:06 PM

@GoEThe Do you think you would be able to commit to mentoring this project? The reason I am asking is because I am expected to submit names of potential mentors in my final proposal

Thank you so much!

Hi, sorry. Things are a bit unstable at the moment. I don't think I can commit for that amount of time.

That's alright 😄

@srishakatux @Pavithraes Do you have any suggestions for me?

@GoEThe, could you recommend anyone else from ptwiki who might be able to help us understand the language and community needs?

@Darwinius said that he might have some time to help. And of course I can answer some questions as they appear, if time is not critical.

@Darwinius Hello! Do you think you would be able to help us out with this proposal?

Answering question with a 1-2 day lag would be perfectly acceptable. I think we'll primarily need support for local ptwiki and Portuguese language stuff. E.g. we're working on gathering data for the "drafttopic" model now and we'd like to have you check our assumptions on how we're interpreting the meaning of ER6 and ER20 deletion reasons. I expect to see more of that and maybe some help testing the models once we're ready to serve you predictions about some articles/drafts.

@Chtnnh hello! Yes, I hope so. What should I do? How can I help?

Hello Darwin! Me and @Halfak would like to submit this proposal to the coming Google Summer of Code program and require a second mentor from the Portuguese wiki community. @GoEThe suggested your name to us. What we would need from you is about 4 hours a week to answer some questions about the ptwiki and ascertain any assumptions we maybe making while developing this model. We would also require your assistance in testing the models once we're ready. Do you think it would be possible for you to commit your time to this?

The program lasts from Jun until August.

Halfak moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Apr 6 2020, 5:00 PM

He7d3r subscribed.Apr 27 2020, 9:52 PM

Chtnnh added subtasks: T246663: Build article quality model for ptwikipedia, T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki, T246667: Build draft quality model for ptwikipedia, T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.May 5 2020, 12:55 PM

Pavithraes mentioned this in T247614: Proposal (GSoC 2020): Implement an NSFW image classifier with open_nsfw.May 5 2020, 7:10 PM

Chtnnh renamed this task from Proposal (GSoC 2020): Implement articlequality model for ptwiki to Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis.May 7 2020, 11:10 AM

Chtnnh set Due Date to Aug 23 2020, 8:00 PM.

Chtnnh updated the task description. (Show Details)

@Darwinius Hey Darwin! Can we find another way to collaborate and discuss the task? Something like irc where you can find me hanging in the #wikimedia-ai channel by the nick chtnnh. I am open to anything that works for you also.

Chtnnh closed subtask T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki as Resolved.May 20 2020, 10:10 AM

Chtnnh added subtasks: T252581: Train and test editquality models for Hindi Wikipedia, T194509: Build article quality model for bswiki, T130296: Train/test edit quality models for ukwiki.May 20 2020, 10:18 AM

Chtnnh added a subtask: T251571: Build article quality model for Ukrainian Wikipedia.May 22 2020, 3:13 PM

Chtnnh updated the task description. (Show Details)May 25 2020, 8:02 AM

Chtnnh closed subtask T246663: Build article quality model for ptwikipedia as Resolved.May 26 2020, 6:02 PM

Chtnnh added a subtask: T253672: Build improved 'articlequality' model for ptwiki.May 26 2020, 6:29 PM

Chtnnh removed subtasks: T130296: Train/test edit quality models for ukwiki, T252581: Train and test editquality models for Hindi Wikipedia.Jun 1 2020, 3:53 PM

Halfak closed subtask T246667: Build draft quality model for ptwikipedia as Resolved.Jun 22 2020, 4:36 PM

Chtnnh added a subtask: T258735: Build articlequality model for Hindi wiki.Jul 24 2020, 7:29 AM

Google-Summer-of-Code (2020) is over! I believe you have already documented your project here https://www.mediawiki.org/wiki/Google_Summer_of_Code/Past_projects#2020. If not, I would encourage you to do so. Also, is there anything else remaining in this task to address? If not, please consider closing this task as resolved.

srishakatux awarded a token.Sep 23 2020, 4:00 AM

calbon closed subtask T258735: Build articlequality model for Hindi wiki as Resolved.Sep 23 2020, 4:15 PM

• ACraze closed subtask T253672: Build improved 'articlequality' model for ptwiki as Resolved.Sep 23 2020, 5:06 PM

• ACraze closed subtask T250809: Review model performance for ptwiki 'articlequality' and 'draftquality' as Resolved.Sep 23 2020, 5:08 PM

calbon reopened subtask T258735: Build articlequality model for Hindi wiki as Open.Sep 24 2020, 5:13 PM

In T247847#6486056, @srishakatux wrote:

Google-Summer-of-Code (2020) is over! I believe you have already documented your project here https://www.mediawiki.org/wiki/Google_Summer_of_Code/Past_projects#2020. If not, I would encourage you to do so. Also, is there anything else remaining in this task to address? If not, please consider closing this task as resolved.

@Chtnnh: Ping. Can you please answer?

• ACraze moved this task from Backlog/Lift Wing to Backlog/Other on the Machine-Learning-Team board.Jan 20 2021, 12:59 AM

calbon closed subtask T258735: Build articlequality model for Hindi wiki as Resolved.Jan 20 2021, 6:39 PM

@Chtnnh: Hi, could you please answer the last question?

Aklapper removed Due Date.Feb 8 2021, 9:56 AM

Sorry for the delay in marking the task as resolved.

We have been able to successfully build and deploy articlequality and draftquality models for the Portuguese Wikipedia and had begun work on the models for Ukrainian and Hindi wikis.

Due to the change in long term plans with the ML team, the Ukrainian and Hindi wiki models have been put on the backlog until the foreseeable future.

@Chtnnh is now focusing energies towards helping the ML team get Lift Wing to production. Uk and Hi wiki models will be further developed after that goal has been achieved.

@Chtnnh: Thanks for the update, and your work! :)

calbon closed subtask T251571: Build article quality model for Ukrainian Wikipedia as Resolved.May 19 2021, 5:12 PM

hashar mentioned this in T371035: Archive Gerrit repo mediawiki/services/open-nsfw.Jul 25 2024, 2:58 PM

hashar mentioned this in rMSNS mediawiki-services-open-nsfw.Jul 25 2024, 3:04 PM

Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikisClosed, ResolvedPublicActions