Page MenuHomePhabricator

Create Cyberbot Project on Labs
Closed, ResolvedPublic

Description

Hi. Cyberbot has hit a point where it's going to need more resources to operate. With it's newest bot, the demand for resources are quite high, especially since I have been asked, by Ocaasi on behalf of WMF, to provide this bot on 30 of the largest wikis, and later perhaps more. Having discussed with Coren, it would seem preferable to create a project with needed resources rather than a new exec node. With that said, it would also make sense to move Cyberbot entirely off of toollabs and move it to it's own project as well. That will then free up the exec node reserved for cyberbot that can then be used for other bots and scripts.

While this new bot is still a work in progress, as in I am still working on making it as efficient as possible and developing the last function, some users have expressed a desire to see the bot in action soon, since it's now approved to function with current development.

I'm not yet sure of the resources I will truly need to accomplish this, but my educated guess suggests the number is quite high. I will continue to improve the efficiency of the bot to avoid needlessly using resources, but I am opening this phab ticket to get things started at least.

Event Timeline

Cyberpower678 assigned this task to coren.
Cyberpower678 raised the priority of this task from to Low.
Cyberpower678 updated the task description. (Show Details)
Cyberpower678 added a project: VPS-Projects.
Cyberpower678 added subscribers: Cyberpower678, Ocaasi.

It's not immediately clear to me what VM resources we're talking about.

Can you tell me more about what those numbers mean? Will the storage be on NFS or local instance storage? If the latter, will it be divided amongst many instances or just running on one big one? Does a 'worker' represent full-time use of a VM CPU?

<Cyberpower678> Ideally, I would like 170GB of RAM, and enough CPU to simultaneously

170 GB is about 50% of the ram of a single hardware node, or about 5% of RAM of all of labs. That's a lot, so I'd probably cast a vote for the 12 day option, at least in the short run.

Even for that, we need to hold off for a few weeks until we're able to rack some more hardware. I can allocate you a smaller project in the meantime to get things off the ground -- default quota for a new project is about 50G.

Like I said earlier, the bot is still in development. I will continue to improve resource usage. I must say these are rather high demands, so I will continue to shave GBs of off the bot.

I don't understand. What kind of work are you doing that requires so much memory?

I don't understand. What kind of work are you doing that requires so much memory?

I want to run the DeadLinksBot in parallel to improve it's speed to handle the 5 million articles faster. Doing so will require more memory.

I don't understand. What kind of work are you doing that requires so much memory?

I also intend to run it on 29 other wikis.

But again, these resource demands are not final. I'm still working on the bots resource usage.

If I understand correctly, DeadLinksBot parses pages fetching archived sources / or submitting for archival. That seems a very linear task. Why does it need 500MB per worker?

Also, I guess you estimating the task size based on page count or links count. What speed are you estimating?

If DeadLinksBot indeed checks dead web links, then we already have several (?) bots running for that purpose, including AFAIR @Giftpflanze's bot for which there is also a Labs project Dwl IIRC (dead web link something?).

So if the purpose is to detect dead web links, I would like a) to avoid making that a part of a whole Cyberbot system and b) combining the various endeavours.

Also, c), as checking web links is stateless, this feels like an ideal candidate for the new Kubernetes system that @yuvipanda is installing.

https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5: "Future feature requests, such as detecting unmarked dead links, should be made under a subsequent BRFA." From that I conclude that the workload won't be overlapping with my efforts. What is consuming the resources is probably keeping all data in memory instead of writing it to disk (which should be preferable I guess?). A point which concerns me, is that the bot traverses over the complete article namespace via the API (for template transclusion and article text), which should better be done via the DB and/or dumps.

Cyberpower678 changed the task status from Open to Stalled.Sep 18 2015, 1:30 PM

Hey, guys, I appreciate all the input. It's very useful. Per a discussion I've had with Earwig last night, I'm temporarily suspending this request. I'm going to attempt some changes which should drastically reduce RAM requirements, in theory. I would appreciate hold off commenting or actioning this request until I have more info and post.

From experience of running a fast editing global bot http://meta.wikimedia.org/wiki/User:Addbot
10 million articles can be edited in a month across all sites using 15 threads, so I guess 5 million could take 2 weeks on the same number of threads?

Personally I find the code that is going to be running very hard to follow so I can't really throw many comments on this.
But I imagine one of the slower bit will be checking the links are dead? / getting the wayback link?

Personally I would use the dump, check everything, get the links and then make the edits that were needed.
Run 1 thread per wiki then making the edits that have already been worked out, and as this should only be 1 api call this would likely go rather fast.

Anyway, it will be interesting to see how this develops after your discussion with Earwig!

I agree with @Addshore too. I also don't think ops will be very happy with you making millions of API requests for page contents either (if that was your original plan).

coren removed coren as the assignee of this task.Nov 16 2015, 6:41 PM
coren subscribed.
Cyberpower678 changed the task status from Stalled to Open.Mar 27 2016, 5:19 AM
Cyberpower678 raised the priority of this task from Low to Medium.

So resource usage has been improved significantly. Average usage without the checkIfDead class, is roughly 30MB. Anticipating memory usage when the checkIfDead class is enabled, I'm going to make an educated guess of around 100-200MB, to account for all the websites it will load into memory when checking if sites are dead. Factoring in future usage, I would like to request a 64GB project. Is that unreasonable?

Unless we can be upgraded in the future, we'll take 8GB to start with. That should be more than sufficient for now.

  NODES
Idea 3
idea 3
Note 1
Project 19
USERS 1