Autogeneratebinary_fileset-temporary.pgen + .pvar + .psam. (The MAF filter has not yet been applied at this stage. See the order of operations page for more details.)
Read binary_fileset-temporary.pgen + .pvar + .psam. Calculate MAFs. Remove all variants with MAF < 0.05 from the current analysis.
Generate binary_fileset.pgen + .pvar + .psam. Any samples/variants removed from the current analysis are also not present in this fileset. (This is the --make-pgen step.)
In contrast, the fileset left behind by --keep-autoconv is just the result of step 1.
--make-bed creates a PLINK 1 binary fileset instead, while --make-bpgen creates a hybrid fileset (main genotype table is in PLINK 2 format, sample and variant files use the PLINK 1 representation) loadable with --bpfile.
Other notes:
The 'vzs' modifier causes the variant file to be Zstd-compressed.
The 'format=' modifier requests an uncompressed fixed-variant-width .pgen file, which may be easier for some programs to read. (These do not directly support multiallelic variants.) For now, the only supported format code is '2', which is just like PLINK 1 .bed, except with an extended (12-byte instead of 3-byte) header containing variant and sample counts, and rotated genotype codes (00 = hom ref, 01 = het, 10 = hom alt, 11 = missing).
The 'erase-phase' and 'erase-dosage' modifiers prevent phase and dosage information from being written to the new .pgen.
When a hardcall is missing but the corresponding dosage is present, 'fill-missing-from-dosage' causes the (Euclidean-)nearest hardcall to be filled in, with ties broken in favor of the lower-index allele.
The first five columns of a .pvar file are always #CHROM/POS/ID/REF/ALT. Supported optional .pvar column sets are:
xheader: All ## header lines (yeah, this is technically not a column). Without this, only the #CHROM header line is kept.
vcfheader: xheader, with additions to make it a valid VCF header. (On its own, this doesn't force the entire file to conform to the VCF format; use "pvar-cols=vcfheader,qual,filter,info" for that.)
maybequal: QUAL. Omitted if all loaded values are missing.
qual: Force QUAL column to be written even when empty.
maybefilter: FILTER. Omitted if all loaded values are missing.
filter: Force FILTER column to be written even when empty.
maybeinfo: INFO. Omitted if all loaded values are missing, or if INFO/PR is the only subfield.
info: Force INFO column to be written.
maybecm: Centimorgan coordinate. Omitted if all loaded values are 0.
cm: Force CM column to be written even when empty.
The default is xheader,maybequal,maybefilter,maybeinfo,maybecm.
The following .psam column sets are supported:
maybefid: Family ID, '0' = missing. Omitted if all loaded values are missing.
fid: Force FID column to be written even when empty.
(IID is always present, and positioned here.)
maybesid: Source ID (useful when multiple samples are collected from a single organism), '0' = missing. Omitted if all loaded values are missing.
sid: Force SID column to be written even when empty.
maybeparents: Father and mother IIDs, '0' = missing. Omitted if all loaded values are missing.
parents: Force PAT and MAT columns to be written even when empty.
pheno1: First active phenotype. If no phenotypes are loaded, all entries are set to the --output-missing-phenotype string.
phenos: All active phenotypes, if any. (Can be combined with pheno1 to force at least one phenotype column to be written.)
The default is maybefid,maybesid,maybeparents,sex,phenos.
Note that maybefid/maybesid/maybeparents are defined differently here than they are in other commands.
By default, --make-[b]pgen/--make-bed (as well as --make-just-pvar/--make-just-bim below) do not resort the variants, and they'll error out if the input file is not at least sorted by chromosome. (This is a change from PLINK 1.x.) However, if you add --sort-vars, the variants will be resorted by chromosome code, then position, then ID. The following string-comparison modes are supported:
'natural'/'n': Natural sort (default).
'ascii'/'a': ASCII.
Regular chromosomes are sorted (in numeric code order; for humans, PAR1 has an effective numeric code of 22.5, PAR2 23.5) before other contigs.
The --sort-vars setting also controls --pmerge[-list]'s variant output order.
--make-just-pvar is a variant of --make-pgen which only generates a .pvar file, and --make-just-psam plays the same role for .psam files. Similarly, --make-just-bim just generates a .bim file, and --make-just-fam just generates a .fam file. Unlike most other PLINK commands, these do not require genotype data (though you won't have access to many filtering flags when using these in no-genotype mode).
Use these cautiously. It is very easy to desynchronize your binary genotype data and your .pvar/.psam indexes if you use these commands improperly. If you have any doubt, stick with --make-[b]pgen/--make-bed.
--export creates a new fileset, after sample/variant filters have been applied. The following output formats are currently supported:
A: Sample-major additive (0/1/2) coding, suitable for loading from R. Dosages are now supported. Haploid genotypes are coded on a 0-2 scale. If you need uncounted alleles to be named in the header line, add the 'include-alt' modifier.
By default, REF alleles are now counted (this is a change from PLINK 1.x); this can be adjusted with --export-allele. --export-allele's input file should have variant IDs in the first column and allele IDs in the second.
AD: Sample-major additive (0/1/2) + dominant (het=1/hom=0) coding. Also supports dosages and 'include-alt'.
Av: Variant-major 0/1/2. Dosages are supported. (For backward compatibility, 'A-transpose' also refers to this format, but that name is misleading: export of this format doesn't involve a transpose operation, while export of the 'A' and 'AD' formats does require transpose and is less efficient as a consequence.)
bgen-1.1: Older Oxford-format .bgen + .sample.
bgen-1.2, bgen-1.3: Newer Oxford-format .bgen + .sample.
Single-part sample IDs are stored in the .bgen; the 'id-paste' modifier controls which .psam columns are used to construct the IDs (choices are maybefid, fid, iid, maybesid, and sid; default is maybefid,iid,maybesid) there, while the 'id-delim' modifier sets the character between the ID pieces (default '_'). By default, two-part IDs are written to the .sample file (but see the 'sample-v2' modifier discussed below). Default probability precision is 16-bit; use the 'bits=' modifier to change this.
haps, hapslegend: Oxford-format .haps + .sample[ + .legend]. All data must be biallelic and phased. Add the 'bgz' modifier to block-gzip the .haps file.
oxford, oxford-v2: Oldest Oxford-format .gen + .sample. Add the 'bgz' modifier to block-gzip the .gen file. 'oxford' requests the original .gen file format with 5 leading columns (understood by older PLINK builds), storing chromosome codes in the "SNP ID" column; 'oxford-v2' requests the current 6-leading-column flavor, and stores the same variant IDs in the "SNP ID" and "rsID" columns.
ped, compound-genotypes: PLINK 1 sample-major .ped + .map. This format is simultaneously highly inefficient, even relative to other text formats, and limited in scope (unobserved minor allele codes can't be stored); continued use is strongly discouraged.
phylip: Relaxed PHYLIP format, with IUPAC ambiguity codes used for heterozygous genotypes. Input must be SNP-only.
phylip-phased: Relaxed PHYLIP format, but with two alignments per sample; "<ID delimiter>A" and "<ID delimiter>B" are appended to the sample IDs. Input must be SNP-only, all-diploid, and fully phased.
tped: PLINK 1 variant-major .tped + .tfam. (For backward compatibility, 'transpose' also refers to this format, but that name is misleading: export of this format doesn't involve a transpose operation, while export of the 'ped' format does require transpose.)
vcf, vcf-4.2, bcf, bcf-4.2: VCF (default version 4.3). If PAR1 and PAR2 are present, they are automatically merged with chrX, with proper handling of chromosome codes and male ploidy. If the 'bgz' modifier is added, the VCF file is block-gzipped (this always happens with BCF output).
The 'id-paste' and 'id-delim' modifiers have the usual effect.
Genotypes are always exported. If you want to export a sites-only VCF instead, see --make-pgen/--make-just-pvar's 'vcfheader' column set.
Dosage are not exported unless the 'vcf-dosage=' modifier is present. The following five dosage export modes are supported:
generates new_text_fileset.bgen and new_text_fileset.sample from the data in binary_fileset.pgen + .pvar + .psam, while
plink2 --pfile binary_fileset --recode vcf id-paste=iid --out new
generates new.vcf from the same data, removing family IDs in the process.
In addition,
The '12' modifier causes REF alleles to be coded as '1' and ALT1 alleles as '2', while '01' maps REF → 0 and ALT1 → 1. Note that this is essentially the reverse of PLINK 1.x; the reversal was not properly documented before 7 Mar 2023.
The 'spaces' modifier makes the output space-delimited instead of tab-delimited, whenever both are permitted.
The 'used-sites' modifier causes a .used_sites.tsv file to be included with PHYLIP-format output.
For biallelic formats where it's unspecified whether the REF/major allele should appear first or second, --export defaults to second for compatibility with PLINK 1.9. Use the 'ref-first' modifier to change this.
'sample-v2' causes .sample files to be exported according to the QCTOOLv2 rather than the original specification. Only one sample ID column is exported ('id-paste' and 'id-delim' settings apply), parental IDs are exported if present, and category names are preserved rather than converted to positive integers.
'bgen-omit-sample-id-block' causes the sample ID block to be omitted from exported bgen-1.2 and -1.3 files.
When biallelic genotype posterior probabilities are exported, PLINK 2 assigns zero probability to the furthest genotype. E.g. when dosage(C)=1.3 at an A/C SNP, PLINK 2 exports {P(AA)=0, P(AC)=0.7, P(CC)=0.3}, while when dosage(C)=1, {P(AA)=0, P(AC)=1, P(CC)=0} is exported. (Genotype-posterior-probability export is not supported when multiallelic variants with non-integer dosages are present.)
PLINK 1.9 and 2.0 support seven chromosome coding schemes in output files. You can select between them by providing the desired human mitochondrial code:
26: Always numeric; see the --chr documentation for details. (XY, PAR1, and PAR2 are all assigned the XY numeric code, so this isn't quite a one-to-one mapping.) This was the default in PLINK 1.x.
M: Autosomes numeric, X/Y/M single-character, XY/PAR1/PAR2 as usual.
MT: Autosomes numeric, X/Y single-character, MT two-character, XY/PAR1/PAR2 as usual. This is the default in PLINK 2.
0M: Autosomes numeric, 0X/0Y/MT two-character, XY/PAR1/PAR2 as usual.
chr26: PAR1/PAR2 as usual, other chromosomes are 'chr' followed by a numeric code.
chrM: Autosomes are 'chr' followed by a numeric code, X/Y/XY/M are preceded by 'chr', PAR1/PAR2 as usual.
chrMT: Autosomes are 'chr' followed by a numeric code, X/Y/XY/MT are preceded by 'chr', PAR1/PAR2 as usual.
PLINK correctly interprets all of these encodings in input files.
--output-missing-genotype allows you to change the character (default '.' unless that breaks PLINK 1.9, in which case it's '0'1) used to represent missing genotypes in PLINK-format files generated by --make-[b]pgen/--make-bed/--export, while --output-missing-phenotype changes the string (default 'NA' for .psam, '-9' for older formats) representing missing phenotypes. Note that these defaults are mostly different from PLINK 1.x.
These flags do not affect --pmerge[-list] or the autoconverters, since they generate files that may be reloaded during the same run. Add --make-[b]pgen/--make-bed if you want to change missing genotype/phenotype coding when performing those operations.
1: This applies to "--export ped" and "--export tped".
--set-invalid-haploid-missing (--set-hh-missing before a6) causes heterozygous haploid hardcalls and all female chrY calls to be erased during --make-[b]pgen/--make-bed.
Note that the most common source of heterozygous haploid errors is imported data which doesn't follow PLINK's convention for representing the X chromosome pseudo-autosomal region. This should be addressed with --split-par below, not --set-invalid-haploid-missing.
This can no longer be combined with --export.
Unknown-sex chrY genotypes are not erased; this is a change from PLINK 1.x.
By default, dosages associated with the erased hardcalls are also erased. To keep all dosages instead, add the 'keep-dosage' modifier.
If phased haploid dosages are present, the phase information is cleared.
Mitochondrial DNA is subject to heteroplasmy, so PLINK 2 normally saves MT dosages near 0.5 as 'heterozygous' genotypes, and these are not erased by --set-hh-missing. However, some analytical methods don't use these mixed MT genotype calls, and instead assume that they don't exist. The --set-mixed-mt-missing flag can be used with --make-[b]pgen/--make-bed to generate a dataset with mixed MT hardcalls erased.
--split-par <last bp position of head> <first bp position of tail>
--split-par <build code>
--merge-par
PLINK 2 prefers to represent the X chromosome's pseudo-autosomal region as 'PAR1' and 'PAR2' regions; this removes the need for special handling of male X heterozygous calls. This has a major computational advantage over PLINK 1.x's 'XY' convention: splitting and remerging no longer require resorting of the variants.
Thus, PLINK 1.9's --split-x flag has been retired in favor of --split-par, which takes the base-pair boundaries of the pseudo-autosomal regions, and treats all chrX variants in those regions as if their chromosome codes were PAR1/PAR2 instead. As (typo-resistant) shorthand, you can pass one of the following build codes to --split-par:
'b36'/'hg18': NCBI build 36/UCSC human genome 18, boundaries 2709521 and 154584237
'b37'/'hg19': GRCh37/UCSC human genome 19, boundaries 2699520 and 154931044
'b38'/'hg38': GRCh38/UCSC human genome 38, boundaries 2781479 and 155701383
'chm13': T2T-CHM13 sequence, boundaries 2394410 and 153925835
--split-par errors out if the dataset already contains a PAR1 or PAR2 region.
Conversely, --merge-par treats all variants in PAR1/PAR2 as if their chromosome code was X.
Note that "--export vcf" has special-case logic for chrX/PAR1/PAR2: chromosome codes are all saved as chrX, but male ploidies are rendered using the PAR1/PAR2 boundaries. It should not be combined with --merge-par.
--merge-x
To import PLINK 1.x-style data with 'XY' codes,
Use --merge-x + --sort-vars + --make-bed, to convert the 'XY' chromosome codes back to 'X' and put the variants back in standard order.
You can then use --split-par to add the new PAR1/PAR2 codes when appropriate.
As of the 18 Oct 2023 alpha 6 build, PLINK 2 includes unknown-sex samples for most purposes on chrY: usually, most or all of the genotypes for actually-female samples are missing, and that's not true for the actually-male samples, so results are similar or even identical to what you'd get with complete sex information. However, missingness-rate is an exception; otherwise "--geno 0.1" and similar filters would break. (Het-haploid-rate is also an exception since --geno/--mind may take it into account.)
When you do want unknown-sex samples to be included in chrY missingness-rate and het-haploid-rate computations, use the --y-nosex-missing-stats flag. This affects --geno/--mind, --genotyping-rate, and --missing. (It does not affect --freq or --geno-counts.)
Whole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you don't want to throw out all of that data, you'll usually want to assign them chromosome-and-position-based IDs.
--set-missing-var-ids (which just replaces missing IDs) and --set-all-var-ids (which overwrites everything) provide one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one @ and one # must be present.) For example, given a .pvar file starting with
#CHROM POS ID REF ALT
chr1 10583 . G A
chr1 886817 . T C
chr1 886817 . C CATTTT
"--set-missing-var-ids @:#[b37]" would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'... and the third variant also gets the name 'chr1:886817[b37]'.
To maintain unique IDs in this situation, you can include '$r'/'$a' in your template string to refer to the REF/first ALT allele. So, if we're using a bash shell, we can try again with
--set-missing-var-ids @:#[b37]\$r,\$a
which would name the first variant 'chr1:10583[b37]G,A', the second variant 'chr1:886817[b37]T,C', and the third variant 'chr1:886817[b37]C,CATTTT'. Note the extra backslashes: they are necessary in bash because '$' is a reserved character there.
(PLINK 1.9's '$1'/'$2' syntax for referring to those two alleles in ASCII-sort order is still supported as well, and it has a place when no reference genome exists. However, we recommend avoiding it most of the time, since it does not distinguish between deletions and insertions in some cases, whereas '$r'/'$a' doesn't have that problem.)
In combination with either flag above, --var-id-multi can be used to specify a special template to use for just multiallelic variants (since it may not make sense to mention the first ALT allele in this case), and --var-id-multi-nonsnp does the same for variants that are both multiallelic and not SNPs (i.e. at least one allele code has length > 1).
Allele names associated with indels are occasionally very, very long, and the synthetic variant ID names which would be generated from such long alleles are very inconvenient to work with. As a result, if any allele codes are longer than 23 characters, PLINK 2 requires you to use --new-id-max-allele-len to explicitly specify how they should be handled. Its first parameter is a length threshold, and its optional second parameter specifies how allele codes longer than the length threshold should be handled (default is now 'error'; 'missing' causes such variants to be assigned the unnamed-variant ID, while 'truncate' does what it sounds like and is a bit dangerous).
As of alpha 6, these flags are applied to all --pmerge[-list] inputs.
--recover-var-ids provides a simple way to invert a --set-all-var-ids (or other variant-ID-changing) operation: given a .pvar/VCF/.bim file with the original IDs, it replaces the current IDs with the originals whenever there is an unambiguous CHROM+POS+alleles match. Allele order is also required to match, unless a .bim file is provided; and in the latter subcase, you can specify 'strict-bim-order' to require A1=ALT, A2=REF.
If any variant has multiple matching records in the original-ID file, and the IDs conflict, --recover-var-ids writes the affected (current) ID(s) to plink2.recoverid.dup, and normally errors out. If the original-ID file has the same number of variants in the same order, you can still recover the old IDs with the 'rigid' modifier in this case (or with a simple bash script, but this is still slightly more convenient). Alternatively, to proceed and assign the missing-ID code to these ambiguous variants, add the 'force' modifier. (The .recoverid.dup file is still written when 'rigid' or 'force' is specified; we strongly suggest using e.g. --rm-dup to resolve the ambiguities when you have time.)
--recover-var-ids normally expects to replace all variant IDs, and errors out if any are left untouched. Add the 'partial' modifier when you actually want to update just a proper subset.
'.' is the default missing-variant-ID code. You can use --missing-var-code to change this; e.g. "--missing-var-code NA" would be appropriate for a .pvar file starting with
#CHROM POS ID REF ALT
chr1 10583 NA G A
chr1 886817 NA T C
chr1 886817 NA C CATTTT
--update-chr <filename> [chr col. number] [variant ID col.] [skip] --update-cm <filename> [cm col. number] [variant ID col.] [skip]
--update-name <filename> [new ID col. number] [old ID col.] [skip]
--update-map <filename> [bp col. number] [variant ID col.] [skip]
--update-alleles ['allow-mismatch'] ['strict-missing'] <filename> --allele1234 ['multichar']
--alleleACGT ['multichar']
--update-chr, --update-cm, --update-map, and --update-name update variant chromosomes, centimorgan positions, base-pair positions, and IDs, respectively. By default, the new value is read from column 2 and the (old) variant ID from column 1, but you can adjust these positions with the second and third parameters. The optional fourth 'skip' parameter is either a nonnegative integer, in which case it indicates the number of lines to skip at the top of the file, or a single nonnumeric character, which causes each line with that leading character to be skipped. (Note that, if you want to specify '#' as the skip character, you need to surround it with single- or double-quotes in some Unix shells.)
For example, if the --update-name file is
SNP_A-1919191 rs123456
SNP_A-64646464 rs222222
and no column numbers are specified, SNP_A-1919191 will be renamed to rs123456, and SNP_A-64646464 will be renamed to rs222222. (Note that "--update-name <filename> 1 2" would invert the operation if all variant IDs are unique.)
Strictly speaking, you can use Unix tail, cut, paste, and/or sed to perform the same job (albeit with more time and hassle) as the three optional parameters we have introduced. If you have not used these Unix commands before, we recommend that you familiarize yourself with what they do because they are still likely to come in handy in other scenarios.
You can combine --update-chr, --update-cm, and/or --update-map in the same run. (However, to avoid confusion regarding whether old or new variant IDs apply, we force --update-name to be run separately.)
When invoking --update-chr, you must use --make-bed/--make-[b]pgen and --sort-vars in the same run, and no other output commands. Otherwise, we still recommend that you use --make-bed/--make-[b]pgen once instead of --update-... over and over, but it's not absolutely required.
Also note that if you're trying to change how chromosome codes are formatted (e.g. "23" vs. "X" vs. "chrX" for human data), you need to use --output-chr, not --update-chr.
--update-alleles updates variant allele codes. Its input should have the following three fields:
Variant ID
Comma-separated old allele codes
Comma-separated new allele codes
For example, if the --update-alleles file is
rs10001 A,B G,T
rs10002 A,B A,C
allele A for rs10001 will be changed to G, allele B for rs10001 will be changed to T, allele A for rs10002 will be unchanged, and allele B for rs10002 will be changed to C.
Other notes:
The PLINK 1.x 5-column input file format is also accepted.
If you just want to permute REF/ALT allele assignments in the .pvar/.bim files without changing the meaning of the genotype data, you must use a flag like --ref-allele instead.
Usually, if any old allele code doesn't line up with an allele code in the main dataset, the variant is skipped (and logged to plink2.allele.no.snp). If you want --update-alleles to perform a partial update when an old allele code matches and another one doesn't, add the 'allow-mismatch' modifier.
By default, if an allele code in the main dataset is missing, it is treated as a wildcard. The 'strict-missing' modifier causes it to only match missing allele codes.
--allele1234 interprets and/or recodes A/C/G/T alleles in the input as 1/2/3/4, while --alleleACGT does the reverse. With the 'multichar' modifier, these will translate multi-character alleles as well, e.g. '--allele1234 multichar' converts 'TT' to '44'.
These update sample IDs, parental codes, and sexes, respectively. --update-parents also updates founder/nonfounder status in the current run when appropriate.
--update-ids expects a file with old sample IDs and new sample IDs.
If there is a header line starting with '#OLD_FID' or '#OLD_IID', #OLD_FID must be followed by 'OLD_IID', the remaining columns must be a subset of {OLD_SID, NEW_FID, NEW_IID, NEW_SID} (ok for all to be present), they must appear in the aforementioned order, and NEW_IID cannot be omitted.
If there's no recognized header line, the file body must contain exactly two or four columns. If it has two, it's interpreted as <old IID>, <new IID> (and old FIDs are 0). If it has four, it's interpreted as <old FID>, <old IID>, <new FID>, <new IID>.
When no new-FID column is present, it is treated as always-0.
When no new-SID column is present, the old SID value is always retained.
For example, if the --update-ids file is
1001 I0001
1002.dup I0002
the sample with FID=0, IID=1001 will have its IID changed to I0001, and the sample with FID=0, IID=1002 will have its IID changed to I0002.
To avoid confusion regarding whether old or new IDs should be used in the latter files, we do not allow --update-ids to be used in the same run as --update-parents or --update-sex.
--update-parents expects a file with sample IDs in front, followed by parental ID columns.
If there is a recognized header line (starting with '#FID' or '#IID'), it defaults to loading paternal IDs from the first column titled 'PAT', and maternal IDs from the following column (which must be titled 'MAT').
Otherwise, if the file contains exactly 3 columns, the last two are interpreted as PAT/MAT and the first as IID; and if it contains 4 or more, the first two are interpreted as FID/IID and columns 3-4 are interpreted as PAT/MAT. (As a consequence, PLINK .fam and .ped files are valid input for --update-parents.)
PLINK does not check whether the new parents actually exist in the current dataset.
--update-sex expects a file with sample IDs in front, and a sex information column.
If there is a recognized header line (starting with '#FID' or '#IID'), it defaults to loading sex information from the first column titled 'SEX' (any capitalization); otherwise it assumes the 3rd column. To force a specific column number, use the 'col-num=' modifier.
Only the first character in the sex column is processed. By default, '1'/'M'/'m' is interpreted as male, '2'/'F'/'f' is interpreted as female, and '0'/'N'/'U'/'u' is interpreted as unknown-sex. To change this to '0'/'M'/'m' = male, '1'/'F'/'f' = female, anything else other than '2' = unknown-sex, add the 'male0' modifier.
--ref-allele ['force'] <filename> [REF col. number] [variant ID col.] [skip]
--alt-allele ['force'] <filename> [ALT col. number] [variant ID col.] [skip]
--alt1-allele ['force'] <filename> [ALT1 col. number] [variant ID col.] [skip]
--ref-from-fa ['force']
--ref-allele, --alt-allele, and --alt1-allele reorder REF/ALT alleles. For each input row, --ref-allele sets the given allele to REF, --alt1-allele sets the given allele to ALT1, and --alt-allele can set multiple comma-separated ALT alleles. Column and skip parameters work the same way as with --update-chr and friends.
In combination with a FASTA file, --ref-from-fa sets REF alleles when it can be done unambiguously. (Note that this is never possible for deletions and some insertions.)
These can only be used in runs with --make-bed/--make-[b]pgen/--export and no other commands.
--ref-allele can be used with --alt[1]-allele in the same run.
"--ref-allele <VCF filename> 4 3 '#'", which scrapes reference allele assignments from a VCF file, is especially useful.
By default, these error out when asked to change a 'known' reference allele. Add the 'force' modifier to permit that (when e.g. switching to a new reference genome).
When --alt[1]-allele moves the previous REF allele to an ALT position, and the new REF allele isn't explicitly set by --ref-allele, the (first) free ALT allele is set to REF and marked as provisional. All other REF allele assignments made by these flags are marked as 'known'.
These flags did not update INFO Number=A and Number=R entries before July 2024. (Note that reannotation is still recommended for Number=A: for any allele moved from REF to ALT, PLINK 2 sets the corresponding INFO values to missing.)
--maj-ref ['force']
--maj-ref sets major alleles to REF, like PLINK 1.x automatically did. (This is now opt-in instead of opt-out; --keep-allele-order is no longer necessary to prevent allele-swapping.) For multiallelic variants, this also sorts ALT alleles by decreasing frequency, with ties going to the originally-earlier allele.
This is always based on current-dataset allele frequencies. To reduce potential for confusion, July 2024 and later builds do not permit --maj-ref and --read-freq in the same run.
This can only be used in runs with --make-bed/--make-[b]pgen/--export and no other commands.
By default, this only affects variants marked as having 'provisional' reference alleles. (This behavior was implemented incorrectly before 25 Jun 2024.) Add 'force' to apply this to all variants.
All REF allele assignments made by --maj-ref are marked as provisional.
This flag did not update INFO Number=A and Number=R entries before July 2024.
--real-ref-alleles
When a PLINK 1 fileset is loaded, PLINK 2 normally treats its A2 alleles as provisional-REF. Use --real-ref-alleles to specify that they're from a real reference genome.
In combination with a FASTA file, --normalize tries to left-normalize all variants, using the algorithm described in Tan A, Abecasis GR, Kang HM (2015) Unified representation of genetic variants. It currently assumes no differences in capitalization between the FASTA and the allele codes, and skips variants with one or more symbolic alleles (starting with '<').
The 'list' modifier causes the IDs of all modified variants to be written to plink2.normalized.
By default, variants with a '*' overlapping-deletion allele are left alone. (This was not true before 25 Apr 2022.) The 'adjust-overlapping-deletions' modifier allows such variants to be normalized based on the other alleles; this is usually valid, but it can occasionally overshoot the left end of the deletion, and in some contexts it can lose a bit of information.
Note that left-normalization has a "blind spot" when it comes to non-tandem-repeat deletions of differing lengths ending at the same position: they won't end up in the same multiallelic variant after split + left-normalize + join. Consider handling this case separately.
In the order of operations, --normalize happens before the --make-[b]pgen/--make-bed step where variant-splitting occurs when specified. Unfortunately, it is the other order of operations that is usually desired here, and when it is, it's necessary to split the job across two PLINK 2 runs. This detail is not obvious to most "bcftools norm" users, so a warning (which will be upgraded to an error in a future build) is now printed when such a job is not split. You can disable this warning/error with --allow-normalize-with-split.
This allows you to specify how samples should be sorted when generating new datasets. The four modes are:
'none'/'0': Stick to the order the samples were loaded in. This is the PLINK default for all operations except merges.
'natural'/'n': "Natural sort" of family and within-family IDs, similar to the logic used in macOS and Windows file selection dialogs; e.g. 'id2' < 'ID3' < 'id10'. This is the PLINK 2 default when merging datasets.
'ascii'/'a': Sort in ASCII order, e.g. 'ID3' < 'id10' < 'id2'. This may be more appropriate than natural sort if you need an ordering that's trivial to regenerate in other software, or if your IDs mix letters and digits in a random and meaningless fashion.
'file'/'f': Use the order in another file (named in the second parameter), which must be in PLINK 2 sample-ID-list format.
If covariates are defined, an updated version (with all filters applied) is automatically written to plink2.cov whenever --make-pgen, --make-just-psam, --export, or a similar command is present. However, if you do not wish to simultaneously generate a new sample file, you can use --write-covar to just produce a pruned covariate file.
The following column sets are supported:
maybefid: FID, if the column was present in the input.
fid: Force FID column to be written when absent from input.
(IID is always present, and positioned here.)
maybesid: SID, if the column was present in the input.
sid: Force SID column to be written when absent from input.
maybeparents: Father and mother IIDs, '0' = missing, if columns in input.
parents: Force PAT and MAT columns to be written even when absent in input.
sex: '1' = male, '2' = female, 'NA' = missing.
pheno1: First active phenotype. If no phenotypes are loaded, all entries are set to the --output-missing-phenotype string.
phenos: All active phenotypes, if any. (Can be combined with pheno1 to force at least one phenotype column to be written.)
(Covariates are always present, and positioned here.)
--variance-standardize linearly transforms named quantitative phenotypes and covariates to mean-zero, variance 1. If no parameters are provided, all quantitative phenotypes and covariates are affected. --covar-variance-standardize does the same for just quantitative covariates.
--quantile-normalize forces named quantitative phenotypes and covariates to a N(0, 1) distribution, preserving only the original rank orders; if no parameters are provided, all quantitative phenotypes and covariates are affected. --pheno-quantile-normalize does the same for just quantitative phenotypes, while --covar-quantile-normalize does this for just quantitative covariates.
--split-cat-pheno splits n-category phenotype(s) into n (or n-1 if 'omit-most' or 'omit-last' is used to exclude one category) binary phenotypes, with names of the form '<original phenotype name>=<category name>'. (As a consequence, affected phenotypes and categories are not permitted to contain the '=' character.)
This happens after all sample filters.
If no phenotype or covariate names are provided, all categorical phenotypes (but not covariates) are processed by --split-cat-pheno.
'omit-most' causes the largest category to be omitted (breaking ties in favor of removing the first-seen category), while 'omit-last' always removes the last-seen category. (It is often necessary to omit one category to avoid creating linear dependence between the covariates, which breaks --glm.)
By default, generated covariates are coded as 1=false, 2=true. To code them as 0=false, 1=true instead, add the 'covar-01' modifier.
--pheno-svd <# output phenotypes> ['force'] ['scols='<col set descriptor>]
['pcols='<col set descriptor>]
--pheno-svd variance=<#> ['force'] ['scols='<col set descriptor>]
['pcols='<col set descriptor>]
--pheno-svd performs singular value decomposition of full rows (i.e. samples with no missing phenotype values) of the phenotype matrix, generates new phenotypes (named 'SVDPHENO1', 'SVDPHENO2', etc.) equal to the top left-singular vectors, and saves them to plink2.svd.pheno. Singular values + input-phenotype weights are written to plink2.svd.pheno_wts.
By default, if less than half of the remaining samples have full phenotype rows, this command errors out. In this case, you may want to use a method like softImpute to fill in some missing phenotype values. Sample R command sequence:
Alternatively, you can use 'force' to override the error; this usually isn't the best idea.
Other notes:
The first argument determines the number of new phenotypes to generate. You can either directly specify a number, or use 'variance=' to specify how much variance in the original phenotype matrix they must explain.
A mix of binary and quantitative phenotypes is permitted. Binary phenotypes are encoded as control=0, case=1 before the SVD is performed.
If there are any later commands in the same run (e.g. --glm), they will only see the newly-generated phenotypes.
'scols=' can be used to customize how sample IDs appear in the .svd.pheno file, while 'pcols=' can be used to customize the .svd.pheno_wts file.
--pmerge merges one binary fileset with the main fileset. The 'vzs' modifier works as with --pfile.
--pmerge-list merges all of the filesets specified in the given file. If there is a main fileset, it's also included in the merge (as if it were the first entry in the --pmerge-list file). The lines of the --pmerge-list file are interpreted as follows:
If a line contains three entries, they are normally treated as full filenames for a binary fileset. The .pgen/.bed must appear first, then the .pvar/.bim, then the .psam/.fam.
If a line contains exactly one entry, its interpretation depends on the mode:
bfile: Prefix for .bed + .bim + .fam fileset.
bpfile: Prefix for .pgen + .bim + .fam fileset.
pfile (default): Prefix for .pgen + .pvar + .psam fileset.
pfile-vzs: Prefix for .pgen + .pvar.zst + .psam fileset.
You can specify a common directory prefix for these files with --pmerge-list-dir.
In both cases, the result is written to plink2.pgen + .pvar[.zst] + .psam (unless a later operation in the same run would overwrite one of these files, in which case the prefix is plink2-merge). The .pvar is normally uncompressed, but you can request compression with --pmerge-output-vzs.
Merge tends to be a much more expensive operation than e.g. VCF autoconversion, so (unlike the case with VCF autoconversion) PLINK 1.9 and 2.0 default to keeping its output files around. You can use --delete-pmerge-result to request deletion at the end of the run.
By default, --pmerge[-list] performs "outer joins": the merged fileset contains the union of the samples, variants, and phenotypes in the input filesets. To specify intersections instead, use the --sample-inner-join, --variant-inner-join, and/or --pheno-inner-join flags.
--merge-mode, --merge-parents-mode, --merge-sex-mode, and --merge-pheno-mode define conflict resolution behavior for genotypes/dosages, parental IDs, sexes, and phenotypes, respectively. The following modes are supported for these flags:
'nm-match'/'1' (default): If nonmissing values match, keep that; otherwise set to missing.
'nm-first'/'2': First nonmissing value is kept.
'first'/'4': First value is kept, even if it's missing.
Note that PLINK 1.x's --merge-mode 6/7 has been replaced by --pgen-diff. (Tip: to find all genotype-conflict positions in a multiway merge, you can perform both "--merge-mode nm-match" and "--merge-mode nm-first" merges, and then run a "--pgen-diff include-missing" comparison between them.)
--merge-xheader-mode defines conflict resolution behavior for .pvar header entries. (For '##' header lines where the first '=' character is followed by a '<', the key is everything up to the first comma (or '>' if there is none) in the '<' expression, and the value is everything after the comma; otherwise, the key is everything up to the '=' and the value is everything after.) The following modes are supported:
erase: Remove all header lines. (Must be used with "--merge-info-mode erase".)
match: Discard when there's any difference in the values (even capitalization).
first (default): First value is kept.
--merge-qual-mode, --merge-filter-mode, --merge-info-mode, and --merge-cm-mode define conflict resolution behavior for QUAL, FILTER, INFO, and CM entries, respectively. The following modes are supported:
erase: Remove the column.
nm-match: If nonmissing values match, keep that; otherwise set to missing.
nm-first: First nonmissing value is kept (INFO/CM default).
first: First value is kept. For QUAL/FILTER/CM, "first value" is defined as the first column appearance, so a column with only missing values is treated differently than an omitted column. For INFO entries, "first value" is defined as the first key appearance (even for keys with value type Flag, so the 'nm-match', 'nm-first', and 'first' modes have identical behavior for INFO flags).
min (--merge-qual-mode only, and its default): Keep smallest value. Missing value is treated as negative infinity.
np-union (--merge-filter-mode only, and its default): Keep all non-PASS values when at least one is present; otherwise, PASS if at least one present.
--merge-pheno-sort and --merge-info-sort define the phenotype column and INFO key sort orders, respectively. The following modes are supported:
'none'/'0' (default): Keep in the loaded order.
'ascii'/'a': ASCII order.
'natural'/'n': "Natural sort" order.
--merge-max-allele-ct causes a merged variant to be excluded from the result if it has more than the specified number of alleles.
Other notes:
All input filesets must be sorted by position, and have the same chromosome order. (When this isn't true, use --make-pgen + --sort-vars on each fileset first.)
REF alleles for a variant must match, unless mismatches are flagged as provisional-REF.
Variants are only merged if their IDs and positions match; this is a change from PLINK 1.x.
By default, if an input .pvar file appears to have a 'split' multiallelic variant under a single ID, --pmerge[-list] errors out, since such variants must be 'joined' (with e.g. "bcftools norm -m +") before a correct merge can occur. (Exception: if "--merge-max-allele-ct 2" is specified, all variants that would be incorrectly merged are excluded instead, so no error is thrown.) If you want to keep such variants split, you'll need to assign the pieces unique IDs during the merge, with e.g. --set-all-var-ids (which now applies to all --pmerge[-list] input files). There's also an edge case where this error is a false alarm; we've yet to see a real instance of this scenario, but if you are sure that your data does not contain any same-position same-ID variant groups that should be joined, you can declare this with --multiallelics-already-joined.
Large --pmerge-list jobs may require a significant amount of disk space for temporary files.
--write-samples writes IDs of all samples which pass the filters and inclusion thresholds you've specified to plink2.id, while --write-snplist does the same for variants (output filename plink2.snplist[.zst]).
By default, --write-samples (and almost all other .id-generating commands) includes a header line in the output file. You can use --no-id-header to generate headerless .id file(s) instead. This normally forces two-column FID/IID output; add the 'iid-only' modifier to produce single-column IID output instead.
Meanwhile, since the actual variants referred to by the .snplist file can be ambiguous when duplicate variant IDs are present, --write-snplist now errors out in that case unless you specify 'allow-dups'.