The generic media structure looks like,
<wrapper> figure,span <link> a,span <media element> audio, video, img, etc.
and the less mirrors that,
figure[ typeof~='mw:File' ] > *:first-child > img, > audio, ...
The list of possible elements for the media element is growing. In addition to audio, video, img it can be a picture or soon a model. In T304343, it's clear that TimedMediaHandler would like to swap out the media element for some spans but maintain the styling.
In T304010, it's request to
Please just use classes instead
Mirroring the dom in css is something we'd like to get away from as well, from T270150#7211201
One level of child combinators are necessary because media can be nested in the figcaption,
<wrapper> <link>...</link> <caption>...</caption> </wrapper>
and styles should not apply in there. There's a slight issue that active formatting elements can be reopened as direct descendants of the figure, but that's being looked at in T314059.
So the less would simplify to something like,
figure[ typeof~='mw:File' ] > *:first-child .mw-file-element
The counterpoint is bloat that this adds, which is a concern of T297984