Wikidata:Property proposal/has sequenced genome

sequenced genome URL

edit

Originally proposed at Wikidata:Property proposal/Generic

Descriptionsequenced and assembled genome for this taxon
Representsgenome (Q7020)
Data typeURL
Domaintaxon (Q16521)
Example 1Homo sapiens (Q15978631)https://www.ncbi.nlm.nih.gov/genome/51 (NCBI link)
Example 2Drosophila melanogaster (Q130888)https://www.ncbi.nlm.nih.gov/assembly/GCF_000001215.4 (NCBI link)
Example 3Photinus pyralis (Q137821)http://www.fireflybase.org/firefly_data.html (non NCBI link)
Example 4Renilla muelleri (Q5296168)http://rmue.reefgenomics.org/download/ (non NCBI link)
SourceWikipedia:Lists of sequenced genomes, NCBI, scientific literature
Planned useannotating the existence of genomic assemblies for given species

Motivation

edit

(Moved from description: ) genome as measured by available data (e.g. FASTA files, GFF3 files), a scientific preprint, or a published scientific article. ArthurPSmith (talk) 18:13, 10 April 2019 (UTC)[reply]

Hi there, so, many 10s-100s of thousands of organismal genomes (including viruses, bacteria, eukaryotes) have been sequenced. As it stands now, there is not a single source where metadata recording the existence of such sequenced genomes, including citation of the preprint or published scientific article describing the genome, and the location (URL) where the raw genomic data (e.g. FASTA files, GFF3 files) can be downloaded from, is available. If the data are uploaded to NCBI[1], there is a decent record of such metadata, but unfortunately a significant subset of genomes, mostly eukaryotic ones (e.g. the firefly genome[2]) have not yet been uploaded or maybe never will be uploaded to NCBI, and are hosted on non-NCBI databases.

So, I think Wikidata could take on this challenge of recording the metadata if the genome for a given taxon has been sequenced or not, what the citation is, and where the data can be downloaded. For genomes on NCBI, presumably bots could pull over the information (as NCBI Taxonomy IDs and therefore the descending genomes are liked from the appropriate Wikidata items), but for items not on NCBI (e.g. genome published but data hosted on a home-built webpage), people could manually input this metadata. At one point, Wikipedia was trying to keep track of all the sequenced genomes in categories, but it got deleted as it was too cumbersome, and Wikidata was suggested as an alternative (see discussion at "Sequenced genomes" section on this page https://en.wikipedia.org/wiki/Wikipedia:Categories_for_discussion/Log/2015_October_10), and the existing "Wikipedia:Lists of sequenced genomes" is several years out of date.

My proposal is that Wikidata add a "has sequenced genome" property, that could be assigned to the taxon class, similar to existing descriptive properties like has natural reservoir. This "has sequenced genome" could then hold metadata on item pages for taxa which have a sequenced genome, e.g. Wikidata:Ignelater luminosus (see Wikipedia:Ignelater luminosus for citation about sequenced genome). On Wikidata the "has sequenced genome" property textbox could hold the URL where the data is availble, and the reference could refer to the preprint, publication, or website where the genome is described. Eventually, it would be nice if this data could be linked into the Wikipedia:Template:Speciesbox or the Wikipedia:Template:Taxonbar, so this annotation could be automatically propagated to Wikipedia pages.

Ideally, this "has sequenced genome" property would only apply to taxon that have the "species" or "subspecies" taxon rank, as it doesn't make much sense for a higher taxonomic rank (e.g. genus, family), to have a single sequenced genome assigned to it. That being said, even for a single inbred line of a single species, which is as close to genetically/genomically identical individuals that we can produce in biology, there may be multiple different version of the assembly produced over time (as sequencing technology has improved, people may redo the project to get a better genome assembly). For example, see the Tribolium castaeneum genome version 3.0 versus genome version 5.2 . Not sure how Wikidata would handle this situation of multiple assembly versions. There are often also multiple genome assemblies within a single species, as people are doing population genetics of the species, and want to compare.

Thoughts on this? Photocyte (talk) 13:03, 3 April 2019 (UTC)[reply]

Discussion

edit
  • I had forgotten that I had proposed this data item be added to Wikidata back in 2015. It still seems to me to be the most logical place to hold the information. The data item should also be propogated to Wikispecies' taxon listings as well. I fully agree with it being just for taxa at the species level. Certainly it doesn't belong at the higher levels, including genus. All lower levels would automatically inherit the status of the species. Beeswaxcandle (talk) 20:00, 3 April 2019 (UTC)[reply]
  • @Photocyte: A few comments: (1) "has sequenced genome" sounds like a Boolean property, but you are pointing to a URL. I think the label should be just "sequenced genome" or even just "genome URL" or something like that. (2) There's no problem with having multiple values for a property - see any scholarly article here with multiple authors listed, for instance. (3) Your examples need to be fleshed out - what is the URL you would provide as value for this property for your example cases? List at least 1 URL for each as a sample so people here can have a better feel for the data you are linking to. (4) You may want to involve the genewiki people in this? ArthurPSmith (talk) 20:22, 4 April 2019 (UTC)[reply]

  Notified participants of WikiProject Biology   WikiProject Taxonomy has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. ChristianKl09:10, 8 April 2019 (UTC)[reply]

  • @ArthurPSmith: Thank you for your & others' comments! Regarding (1) Indeed, has sequenced genome is a boolean property, and "sequenced genome URL" might be a more accurate way to describe the property I am suggesting here. As the presence of an annotated genome URL implies there is a sequenced genome for this species, is there a way to make it so wikidata will automatically propagate the has sequenced genome = True item to all entries that have the "sequenced genome URL" filled out? The boolean property is only useful I think for computational annotation, whereas the URL is the actual valuable metadata that might be worth propagating to Wikipedia pages. Regarding (2), that is a workable solution it seems, although I suppose it would be nice if the "sequenced genome URL" property also had some record of the assembly version that was being pointed to. Regarding (3), examples have been updated to show actual URLs, equivalent to what the property items could hold. Photocyte (talk) 12:43, 10 April 2019 (UTC)[reply]
  •   Support I've adjusted the formatting of the label, description, and examples to fit our usual template. On having a "record of the assembly version", this could presumably be done either via references on the statement, or perhaps a suitable qualifier. ArthurPSmith (talk) 18:14, 10 April 2019 (UTC)[reply]
  Done @Photocyte, Mr. Fulano, Tinker Bell, Faendalimas, Vulphere: you get the round number sequenced genome URL (P6800) enjoy! --99of9 (talk) 05:36, 3 June 2019 (UTC)[reply]

Just a quick note, here is a Wikidata SPARQL query service example to search for this property:

#Taxa that have been annotated with the sequenced genome URL
select distinct ?item ?itemLabel ?itemDescription ?sitelinks ?url where {
    ?item wdt:P31 wd:Q16521;  # Any instance of a taxa
          wdt:P6800 ?url;  #  Which taxa actually have a url entered
             wikibase:sitelinks ?sitelinks.
   
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY DESC(?sitelinks)

Photocyte (talk) 19:19, 3 May 2021 (UTC)[reply]

  NODES
Idea 1
idea 1
Note 1
Project 4