Information

How do PLINK files and HapMap Phased files differ?


I know that PLINK and HapMap files show the same information, but can you give a thorough explanation of how exactly they differ.


According to http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped:

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:

Family ID Individual ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype

[… ]

Genotypes (column 7 onwards) should also be white-space delimited; they can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All SNPs (whether haploid or not) must have two alleles specified. Either Both alleles should be missing (i.e. 0) or neither. No header row should be given. For example, here are two individuals typed for 3 SNPs (one row = one person):

FAM001 1 0 0 1 2 A A G G A C FAM001 2 0 0 1 2 A A A G 0 0…

And here is what I find in the begining of a HapMap .ped file I got a few years ago (hapmap3_r2_b36_fwd.YRI.qc.poly.ped):

Y001 NA18488 0 0 2 -9 C C T T… Y014 NA18519 0 0 1 -9 C C T T…

So far, it seems to me than this is plain .ped format: the number of "header" columns is the same, and seems to conform to the specifications given in the above-mentioned web page.

Now let's have a look at the .map files.

By default, each line of the MAP file describes a single marker and must contain exactly 4 columns:

chromosome (1-22, X, Y or 0 if unplaced) rs# or snp identifier Genetic distance (morgans) Base-pair position (bp units)

[… ]

Note: Most analyses do not require a genetic map to be specified in any case; specifying a genetic (cM) map is most crucial for a set of analyses that look for shared segments between individuals. For basic association testing, the genetic distance column can be set at 0.

[… ]

The autosomes should be coded 1 through 22. The following other codes can be used to specify other chromosome types:

X X chromosome -> 23 Y Y chromosome -> 24 XY Pseudo-autosomal region of X -> 25 MT Mitochondrial -> 26

The numbers on the right represent PLINK's internal numeric coding of these chromosomes: these will appear in all output rather than the original chromosome codes.

Here we have something that may be different. The end of the .map file corresponding to the HapMap .ped file looks like this:

26 rs28357376 0 15825 26 rs2853510 0 15925 26 rs2854125 0 16149

The HapMap .map file uses "plink's internal numeric coding" for the chromosome instead of the letter code (MT).

Otherwise, it looks a pretty standard .map file, with no genetic distance indicated.


How do PLINK files and HapMap Phased files differ? - Biology

HapMap 3 is the third phase of the International HapMap project. This phase increases the number of DNA samples covered from 270 in phases I and II to 1,301 samples from a variety of human populations. This is the draft release 3.

The definitive data are available from the HapMap ftp site. The data available from these pages at the Sanger Institute are raw unfiltered data, provided as a resource to the community.

Populations

The following population samples were studied:

ASW African ancestry in Southwest USA CEU Utah residents with Northern and Western European ancestry from the CEPH collection CHB Han Chinese in Beijing, China CHD Chinese in Metropolitan Denver, Colorado GIH Gujarati Indians in Houston, Texas JPT Japanese in Tokyo, Japan LWK Luhya in Webuye, Kenya MXL Mexican ancestry in Los Angeles, California MKK Maasai in Kinyawa, Kenya TSI Toscani in Italia YRI Yoruba in Ibadan, Nigeria

Data released

HapMap 3 Release 3
  • 99.3% platform concordance
  • 99.7% call rate
  • 1198 founders and 199 non-founders
  • 683 males, 714 females
  • 23238 monomorphic SNPs removed from consensus
I) Population samples for genotyping
Population samples for genotyping: Number of individuals with Hapmap 3 genotypes in this release (Number of individuals total): Number of SNPs included in this release (after QC)
ASW 71 (of 90) 1 632 186
CEU 162 (of 180) 1 634 020
CHB 82 (of 92) 1 637 672
CHD 70 (of 90) 1 619 203
GIH 83 (of 90) 1 631 060
JPT 82 (of 89) 1 637 610
LWK 83 (of 90) 1 631 688
MXL 71 (of 90) 1 614 892
MKK 171 (of 180) 1 621 427
TSI 77 (of 90) 1 629 957
YRI 163 (of 180) 1 634 666
Total 1115 (of 1261) 1 525 445
Consensus (polymorphic) dataset * 1115 (of 1261) 1 490 422

* Consensus (polymorphic) dataset of this release (35,023 monomorphic SNPs removed)

Ii) Population Samples for PCR Resequencing

For each population the number of individuals for whom sequence was generated is shown:

ASW 55
CEU 119
CHB 90
CHD 30
GIH 60
JPT 91
LWK 60
MXL 27
MKK 0
TSI 60
YRI 120
Total 712

Data download

The definitive data are available from the HapMap ftp site. The data available from these pages at the Sanger Institute are raw unfiltered data, provided as a resource to the community.

FTP Download

The data can be downloaded from the Sanger Institute ftpsite --> Hapmap ftp site.

To access ENCODE III PCR resequencing data:
    - list of 712 unrelated samples sequenced [60 KB] - genotypes of 10,076 SNP sites by 712 samples [626 KB] - QC+ genotypes of 6,223 SNPs sites by 692 samples [8700KB]
Data Release Policy

The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. The report from that meeting can be viewed here.

The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work.

The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.

Production and QC

I) Genotyping

Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs).

Data from the two platforms was merged using PLINK ( --merge-mode 1 ), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

Quality control at the individual level was performed separately for different platforms. Only individuals with QC passed genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the data sets of this release:

Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population) <3 Mendel errors (per population only applies to YRI, CEU, ASW, MXL, MKK) SNP must have a rsID and map to a unique genomic location The "consensus" data set contains data for all individuals (558 males, 557 females 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus/polymorphic" data set has In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36

Ii) PCR Resequencing

The sequence based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the following regions:

Region Chromosome Coordinates Status
ENm010 7 27 124 046 - 27 224 045 ENCODE I
ENr321 8 119 082 221 - 119 182 220 ENCODE I
ENr232 9 130 925 123 - 131 025 122 ENCODE I
ENr123 12 38 826 477 - 38 926 476 ENCODE I
ENr213 18 23 919 232 - 24 019 231 ENCODE I
ENr331 2 220 185 590 - 220 285 589 New
ENr221 5 56 071 007 - 56 171 006 New
ENr233 15 41 720 089 - 41 820 088 New
ENr313 16 61 033 950 - 61 133 949 New
RNr133 21 39 444 467 - 39 544 466 New

Following filtration of low quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

In the "QC+" data set, we applied the HapMap QC parameters, specifically, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (P<0.001). In the QC+ data set, overall false positive rate is

3.2%, based on limited number of validation assays.

Caveats

I) Genotyping

Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues. Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs). There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.

Ii) PCR Resequencing

All variant calls have not been validated: we estimate that there is currently a false positive rate of

12% among all calls, with a slightly higher rate (

14%) if considering just the singletons. Additional validation is ongoing PCR sequencing of additional samples (Masai) is also ongoing.

Analysis plans

Below are the analysis plans that the consortium pursuing:

  • SNP allele frequency estimation
  • Population differentiation
  • Linkage disequilibrium analysis
  • SNP Tagging
  • Imputation efficiency
  • Genomic locations of human CNVs
  • Genotypes for CNVs
  • Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
  • Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
  • Linkage disequilibrium properties of CNVs
  • Tagging and imputation of CNVs
  • Signals of selection around CNVs
  • Association of SNPs and CNVs with expression phenotypes

Institutions and funding

The data for HapMap3 has been produced by the following institutions:

  • Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC)
  • Broad Institute (BI)
  • Wellcome Sanger Institute (WTSI)

Funding for phase 3 of the International HapMap project has been provided by:


Why are my results different from an analysis using program X?

This is obviously a difficult question to answer without specific details. Therefore, if you send me a question along these lines and want to get an answer, please make it as specific as possible, to put it bluntly! Ideally, include example data that replicates the problem / illustrates the difference.

  • The analytic routines themselves are slightly different. Are the results dramatically different? Do not expect exact numerically similarity between similar analyses (i.e. even for a simple case, --assoc, --fisher and --logistic will give slightly different p-values for a simple single SNP test, but this is to be expected). So, is the difference really meaningful? Perhaps more importantly, are you sure the other routine really is implementing a similar test, with similar assumptions, etc?
  • A common reason for apparent differences between PLINK and other analysis packages is that PLINK implements some default filtering of the data, i.e. first removing individuals or SNPs with below threshold genotyping rate. Look at the LOG file to check that exactly the same set of individuals were actually included in both analyses. In other words: be sure to check how missing data were handled in each case.

How do PLINK files and HapMap Phased files differ? - Biology

Access data from the third phase of the International HapMap Project.

HapMap 3 is the third phase of the International HapMap project. This phase increases the number of DNA samples covered from 270 in phases I and II to 1,301 samples from a variety of human populations. This is the draft release 3.

Archive Page

This page is maintained as a historical record and is no longer being updated.

Archive Page: This page is maintained as a historical archive and is no longer being updated.

About

The definitive data are available from the HapMap ftp site. The data available from these pages at the Sanger Institute are raw unfiltered data, provided as a resource to the community.

Populations

The following population samples were studied:

  • ASW – African ancestry in Southwest USA
  • CEU – Utah residents with Northern and Western European ancestry from the CEPH collection
  • CHB – Han Chinese in Beijing, China
  • CHD – Chinese in Metropolitan Denver, Colorado
  • GIH – Gujarati Indians in Houston, Texas
  • JPT – Japanese in Tokyo, Japan
  • LWK – Luhya in Webuye, Kenya
  • MXL – Mexican ancestry in Los Angeles, California
  • MKK – Maasai in Kinyawa, Kenya
  • TSI – Toscani in Italia
  • YRI – Yoruba in Ibadan, Nigeria

Production and AC

Genotyping

Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs).

Data from the two platforms was merged using PLINK (–merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

Quality control at the individual level was performed separately for different platforms. Only individuals with QC passed genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the data sets of this release:

Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population) <3 Mendel errors (per population only applies to YRI, CEU, ASW, MXL, MKK) SNP must have a rsID and map to a unique genomic location The “consensus” data set contains data for all individuals (558 males, 557 females 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The “consensus/polymorphic” data set has In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.

PCR Resequencing

The sequence based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the regions shown in the table.

Following filtration of low quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

In the “QC+” data set, we applied the HapMap QC parameters, specifically, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (P<0.001). In the QC+ data set, overall false positive rate is

3.2%, based on limited number of validation assays.

Caveats

Genotyping

Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues. Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs). There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.

PCR Resequencing

All variant calls have not been validated: we estimate that there is currently a false positive rate of

12% among all calls, with a slightly higher rate (

14%) if considering just the singletons. Additional validation is ongoing PCR sequencing of additional samples (Masai) is also ongoing.


Covariates and additional SNPs

Covariates can be included with the --covar option, the same as for --linear and --logistic models. By default, all covariates in that file with be used. Covariates always feature under both the alternate and null models.

./plink --file mydata --hap-snps rs1001-rs1005 --chap --covar myfile.cov

which generates an additional set of entries in the plink.chap output file, representing the coefficients (no other statistical tests are performed for the covariates, i.e. no p-values, etc): In a similar manner, additional SNPs can be included, which can be SNPs other than those included in the --hap-snps command. These SNPs are not considered in any way during the phasing process: the alleles are simply entered in an allelic dosage manner. The command --condition and a list of SNPs, or --condition-list followed by a filename with a list of SNP names, includes these.

./plink --file mydata --hap-snps rs1001-rs1005 --chap --condition rs1006

which adds the following lines in the output file Unlike for standard covariates, it is also possible to request that a SNP effect be dropped under the null model, which allows, for example, for a test of a SNP controlling for a set of haplotypes at a different locus: here, one would want to include all haplotype effects under the null, and use the --test-snp command to drop one or more of the conditioning SNPs:

./plink --file mydata --hap-snps rs1001-rs1005 --chap --null-group % --condition rs1006 --test-snp rs1006

which would instead show and an extra degree of freedom would be added to the model comparison test. As the --null-group % command was used to effectively control for all haplotypic effects whilst testing this particular SNP, rs1006, the test will be a 1 df test, It is also possible to specify more than one conditioning SNP (and to drop none, some or all of these under the null): for example,


Dimension reduction

PLINK 1.9 provides two dimension reduction routines: --pca, for principal components analysis (PCA) based on the variance-standardized relationship matrix, and --mds-plot, for multidimensional scaling (MDS) based on raw Hamming distances. Top principal components are generally used as covariates in association analysis regressions to help correct for population stratification, while MDS coordinates help with visualizing genetic distances.

--pca-cluster-names <name(s). >
--pca-clusters <filename>

By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix you can change the number by passing a numeric parameter. Eigenvectors are written to plink .eigenvec, and top eigenvalues are written to plink .eigenval . The 'header' modifier adds a header line to the .eigenvec file(s), and the 'tabs' modifier makes the .eigenvec file(s) tab- instead of space-delimited. You can request variant weights with the 'var-wts' modifier, and dump the matrix by using --pca in combination with --make-rel/--make-grm-gz/--make-grm-bin.

This is a simple port of GCTA's --pca flag, which generates the same files from a previously computed relationship matrix. For more full-featured principal component analysis, including automatic outlier removal, high-speed randomized approximation for very large datasets, and LD regression, try EIGENSOFT 6.

If clusters are defined (via --within), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --pca-cluster-names and/or --pca-clusters. --pca-cluster-names accepts a space-delimited sequence of cluster names on the command line, while --pca-clusters takes the name of a file with one cluster name per line. If you also want the MAFs used in the relationship matrix calculation to be based on only samples in those clusters, dump those MAFs in a separate run with --freqx + --keep-cluster-names/--keep-clusters, and then load them during your PCA run with --read-freq.

In combination with --cluster, --mds-plot produces a Haploview-friendly multidimensional scaling report. By default, multidimensional scaling is performed on an inter-sample distance matrix use the 'by-cluster' modifier to perform it on an inter-cluster distance matrix (calculated by averaging all inter-sample distances for each cluster pair) instead.

The default, singular value decomposition-based algorithm is designed to give the same results as PLINK 1.07 and the R cmdscale() function (up to rounding errors and sign flips, anyway). The 'eigendecomp' modifier requests a faster eigendecomposition-based algorithm which yields slightly different results.

The 'eigvals' modifier causes top eigenvalues to be written to plink .mds.eigvals (one per line first value corresponds to the first dimension in the .mds file, etc.).


File format reference

A text file with a header line, and then one line per set or polymorphic variant with the following 8-11 fields:

CHRChromosome code. Not present with set tests.
'SNP'/'SET'Variant/set identifier
UNADJUnadjusted p-value
GCDevlin & Roeder (1999) genomic control corrected p-value. Requires an additive model.
QQP-value quantile. Only present with 'qq-plot' modifier.
BONFBonferroni correction
HOLMHolm-Bonferroni (1979) adjusted p-value
SIDAK_SS&Scaronidák single-step adjusted p-value
SIDAK_SD&Scaronidák step-down adjusted p-value
FDR_BHBenjamini & Hochberg (1995) step-up false discovery control
FDR_BYBenjamini & Yekutieli (2001) step-up false discovery control

Variants/sets are sorted in p-value order. (As a result, if the QQ field is present, its values just increase linearly.)

.allele.no.snp (allele mismatch report)

Produced by --update-alleles when there is a mismatch between the loaded alleles for a variant and columns 2-3 of the --update-alleles input file.

A text file with no header line, and one line per mismatching variant with the following three fields:

  1. Variant identifier
  2. Expected allele #1 (from --update-alleles input file)
  3. Expected allele #2
.assoc, .assoc.fisher (case/control association allelic test report)

Produced by --assoc acting on a case/control phenotype.

A text file with a header line, and then one line per variant typically with the following 9-10 fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
A1Allele 1 (usually minor)
F_AAllele 1 frequency among cases
F_UAllele 1 frequency among controls
A2Allele 2
CHISQAllelic test chi-square statistic. Not present with 'fisher'/'fisher-midp' modifier.
PAllelic test p-value
ORodds(allele 1 | case) / odds(allele 1 | control)

If the 'counts' modifier is present, the 5th and 6th fields are replaced with:

C_AAllele 1 count among cases
C_UAllele 1 count among controls

If --ci 0.xy has also been specified, there are three additional fields at the end:

SEStandard error of odds ratio estimate
LxyBottom of xy% symmetric approx. confidence interval for odds ratio
HxyTop of xy% approx. confidence interval for odds ratio

.assoc.dosage (dosage association analysis report)

A text file with a header line, and then usually one line per variant with the following 8-10 fields:

CHRChromosome code. Requires --map.
SNPVariant identifier.
BPBase-pair coordinate. Requires --map.
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
FRQAllele 1 frequency
INFOR-squared quality metric/information content
'BETA'/'OR'Regression coefficient (for quantitative traits) or odds ratio
SEStandard error of effect (not odds ratio) estimate
PAssociation test p-value

If the 'case-control-freqs' modifier is present, the FRQ column is replaced with FRQ_A and FRQ_U columns reporting case and control frequencies, respectively, and NCHROBS will not include missing-phenotype samples. (Unless the phenotype is quantitative instead of case/control then phenotypes are ignored and FRQ_A and FRQ_U are both equal to the overall FRQ value.)

.assoc.linear, .assoc.logistic (multi-covariate association analysis report)

A text file with a header line, and T lines per variant typically with the following nine fields (where T is normally the number of terms, but the 'genotypic' and 'hethom' modifiers and the --tests flag can change this):

CHRChromosome code. Not present with 'no-snp' modifier.
SNPVariant identifier. Not present with 'no-snp'.
BPBase-pair coordinate. Not present with 'no-snp'.
A1Allele 1 (usually minor). Not present with 'no-snp'.
TESTTest identifier
NMISSNumber of observations (nonmissing genotype, phenotype, and covariates)
'BETA'/'OR'Regression coefficient (--linear, "--logistic beta") or odds ratio (--logistic without 'beta')
STATT-statistic
PAsymptotic p-value for t-statistic

If --ci 0.xy has also been specified, the following three fields are inserted before 'STAT':

SEStandard error of beta (log-odds) estimate
LxyBottom of xy% symmetric approx. confidence interval
HxyTop of xy% approx. confidence interval

Refer to the PLINK 1.07 documentation for more details.

.auto.R (R plugin function results)

A text file with no header line, and one line per variant, each with at least four fields. The first four are:

  1. Chromosome code
  2. Variant identifier
  3. Base-pair coordinate
  4. Allele 1 (corresponding to allele counts in GENO matrix usually minor)

Subsequent fields are defined by the plugin function. Lines are permitted to contain different numbers of fields.

.bcf (1000 Genomes Project binary Variant Call Format, version 2)

Variant information + sample ID + genotype call binary file, loaded with --bcf. Cannot currently be generated by PLINK use "--recode vcf<,-fid,-iid>" to produce a VCF file for now.

The specification for this format is at https://github.com/samtools/hts-specs.

.beagle.dat, .chr-*.dat, .chr-*.map (BEAGLE unphased genotype and variant information files)

Produced by "--recode beagle[-nomap]", for use by BEAGLE. In 'beagle' mode, one file pair is generated per autosome, while in 'beagle-nomap' mode, a single .beagle.dat file is generated containing all autosomes. This format cannot be loaded by PLINK.

Each .dat file produced by PLINK is a text file with three header lines, followed by one line per variant with 2N+2 fields where N is the number of samples:

1st header line2nd header line3rd header lineSubsequent contents
'P''I''A' for C/C pheno., 'T' for scalar'M'
'FID''IID''PHE'Variant identifier
FIDs, 2x per sample. IIDs, 2x per samplePhenotypes, 2x per sampleAllele calls (unphased)

Each .chr-*.map file produced by PLINK is a text file with no header line, and one line per variant with the following four fields:

  1. Variant identifier
  2. Base-pair coordinate
  3. Allele 1 (usually minor), 'X' if absent
  4. Allele 2 (usually major), 'X' if absent
.bed (PLINK binary biallelic genotype table)

Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different.

The first three bytes should be 0x6c, 0x1b, and 0x01 in that order. (There are old versions of the .bed format which start with a different "magic number" PLINK 1.9 recognizes them, but will convert sample-major files to the current variant-major format on sight. See the bottom of the original .bed definition page for details that page also contains a more verbose version of the discussion below.)

The rest of the file is a sequence of V blocks of N/4 (rounded up) bytes each, where V is the number of variants and N is the number of samples. The first block corresponds to the first marker in the .bim file, etc.

The low-order two bits of a block's first byte store the first sample's genotype code. ("First sample" here means the first sample listed in the accompanying .fam file.) The next two bits store the second sample's genotype code, and so on for the 3rd and 4th samples. The second byte stores genotype codes for the 5th-8th samples, the third byte stores codes for the 9th-12th, etc.

The two-bit genotype codes have the following meanings:

00Homozygous for first allele in .bim file
01Missing genotype
10Heterozygous
11Homozygous for second allele in .bim file

If N is not divisible by four, the extra high-order bits in the last byte of each block are always zero.

For example, consider the following text fileset:

test.ped:
1 1 0 0 1 0 G G 2 2 C C
1 2 0 0 2 0 A A 0 0 A C
1 3 1 2 1 2 0 0 1 2 A C
2 1 0 0 1 0 A A 2 2 0 0
2 2 0 0 2 2 A A 2 2 0 0
2 3 1 2 1 2 A A 2 2 A A

test.map:
1 snp1 0 1
1 snp2 0 2
1 snp3 0 3

If you load it in PLINK 1.9, a .bed file containing the following sequence of bytes will be autogenerated (you can view it with e.g. Unix xxd):

0x6c 0x1b 0x01 0xdc 0x0f 0xe7 0x0f 0x6b 0x01

and the following .bim file will accompany it:

1 snp1 0 1 G A
1 snp2 0 2 1 2
1 snp3 0 3 A C

(For brevity, we don't reproduce the .fam here.) We can decompose the .bed file as follows:

  • The first three bytes are the magic number.
  • Since there are six samples, each marker block has size 2 bytes (six divided by four, rounded up). Thus genotype data for the first marker ('snp1') is stored in the 4th and 5th bytes.
  • The 4th byte value of 0xdc is 11 01 11 00 in binary. Since the low-order two bits are ' 00 ', the first sample is homozygous for the first allele for this marker listed in the .bim file, which is 'G'. The second sample has genotype code ' 11 ', which means she's homozygous for the second allele ('A'). The third sample's code of ' 01 ' designates a missing genotype call, and the fourth code of ' 11 ' indicates another AA.
  • The 5th byte value of 0x0f is 0000 11 11 in binary. This indicates that the fifth and sixth samples also have the AA genotype at snp1. There is no sample #7 or #8, so the high-order 4 bits of this byte are zero.
  • The 6th and 7th bytes store genotype data for the second marker ('snp2'). The 6th byte value of 0xe7 is 11 10 01 11 in binary. The ' 11 ' code for the first sample means that he's homozygous for the second snp2 allele ('2'), the ' 01 ' code for the second sample indicates a missing call, the ' 10 ' code for the third indicates a heterozygous genotype, and ' 11 ' for the fourth indicates another homozygous '2'. The 7th byte value of 0x0f indicates that the fifth and sixth samples also have homozygous '2' genotypes.
  • Finally, the 8th and 9th bytes store genotype data for the third marker ('snp3'). You can test your understanding of the file format by interpreting this by hand and then comparing to the .ped file above.
.bim (PLINK extended MAP file)

Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.)

A text file with no header line, and one line per variant with the following six fields:

  1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT' '0' indicates unknown) or name
  2. Variant identifier
  3. Position in morgans or centimorgans (safe to use dummy value of '0')
  4. Base-pair coordinate (1-based limited to 2 31 -2)
  5. Allele 1 (corresponding to clear bits in .bed usually minor)
  6. Allele 2 (corresponding to set bits in .bed usually major)

Allele codes can contain more than one character. Variants with negative bp coordinates are ignored by PLINK.

.blocks, .blocks.det (haplotype blocks, estimated using Haploview's default algorithm)

.blocks files contain one line per block, each with an asterisk followed by variant IDs.

.blocks.det files have a header line, followed by one line per block with the following six fields:

CHRChromosome code
BP1First base-pair coordinate
BP2Last base-pair coordinate
KBBlock length in kbs
NSNPSNumber of variants in block
SNPS'|'-delimited variant IDs

.clst (cluster membership file)

Produced by --write-cluster. Valid input for --within.

A text file with no header line, and one line per sample with the following three fields:

Samples may not appear more than once.

.clumped, .clumped.best, .clumped.ranges (reprocessed LD-clumped reports)

The .clumped file normally has one header line, followed by one line per index variant (lowest p-values first) with the following 11-12 fields:

CHRChromosome code
F1-based file number
SNPIndex variant identifier
BPBase-pair coordinate
PIndex variant p-value
TOTALNumber of other variants in clump
NSIGNumber of clumped variants with p &ge .05
S05Number of clumped variants with .01 &le p < .05
S01Number of clumped variants with .001 &le p < .01
S001Number of clumped variants with .0001 &le p < .001
S0001Number of clumped variants with p < .0001
SP2Comma-delimited IDs and file numbers of members with p < --clump-p2 threshold. Not present with --clump-verbose.

With --clump-verbose, the header line above is repeated for every clump, instead of just appearing once, and dashed line dividers are present between clumps. Also, each nonempty clump has its own subsection, with the different header line below, one line corresponding to the index variant (with '(INDEX)' before the variant ID), a blank line, and then one line for each other clump member with the following 6-7 fields:

(blank)Variant identifier
KB[current variant bp coordinate] - [index bp coordinate], signed
RSQSquared correlation coefficient with index variant
ALLELESMinor allele for index variant, more-common-than-expected haplotypes otherwise
F1-based file number
PP-value
ANNOTComma-delimited extra fields. Requires --clump-annotate.

Each nonempty clump also has the following 2-3 footer lines:

  1. 'RANGE:', followed by ' chr <#> : <bp1> .. <bp2>' (including --clump-range-border padding)
  2. 'SPAN:', followed by range length in kbs
  3. "GENES w/SNPs:", followed by names of regions containing at least one variant in the clump (only present with --clump-range)

Finally, with --clump-range + --clump-verbose, there is a final footer line starting with 'GENES:', followed by names of regions physically overlapping the clump. (This is reported even for empty clumps.)

If --clump-range is used without --clump-verbose, region overlaps are reported in a separate .clumped.ranges file instead. This has a header line, followed by one line per clump with the following seven fields:

CHRChromosome code
SNPIndex variant identifier
PIndex variant p-value
NNumber of variants in clump (including index variant)
POSBase-pair range, as ' chr <#> : <bp1> .. <bp2>' (including --clump-range-border padding)
KBRange length in kbs (i.e. (<bp2> - <bp1> + 1) / 1000)
RANGESComma-delimited names of overlapped --clump-range regions, in brackets

Finally, if --clump-best is specified, a .clumped.best file is generated. This has a header line, followed by one line per clump with the following 7-8 fields:

INDEXIndex variant identifier
PSNPID of best proxy (maximum r-squared), or 'NA' if there is none
RSQSquared correlation coefficient between index and proxy
KB<proxy bp coordinate> - <index bp coordinate>, signed
PProxy p-value
ALLELESMore-common-than-expected haplotypes
FProxy file number
(blank)Comma-delimited extra fields for proxy variant. Requires --clump-annotate.

.cluster1, .cluster2, .cluster3, .cluster3.missing (hierarchical clustering reports)

--cluster normally generates three files, with the extensions .cluster1, .cluster2, and .cluster3[.missing]. The .cluster2 file shares the .clst format, so it is valid input for --within. The other two files are also text files with no header line.

.cluster1 files contain one line per cluster, with a cluster name in front ('SOL-0', 'SOL-1', . ), followed by IDs of the cluster's members (formatted as FID + '_' + IID + possibly case/control status in parentheses).

.cluster3[.missing] files contain one line per sample, with their FID and IID as the first two fields (not merged with an underscore here), followed by a sequence of nonnegative integers representing the sample's cluster assignment at each stage of the clustering process.

.cmh (Cochran-Mantel-Haenszel 2x2xK test report)

A text file with a header line, and then one line per variant with the following 12-14 fields (where 0.xy is the --ci parameter, or 0.95 if none was specified):

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
A1Allele 1 (usually minor)
MAFAllele 1 frequency
A2Allele 2 (usually major)
CHISQCochran-Mantel-Haenszel statistic (1df)
PAsymptotic p-value for CMH test statistic
ORCMH odds ratio
SEStandard error of odds ratio estimate
LxyBottom of xy% symmetric approx. confidence interval
HxyTop of xy% approx. confidence interval
CHISQ_BDBreslow-Day test statistic. Requires --bd.
P_BDAsymptotic p-value for Breslow-Day test statistic. Requires --bd.

.cmh2 (Cochran-Mantel-Haenszel IxJxK test report)

A text file with a header line, and then one line per variant with the following five fields:

CHRChromosome code
SNPVariant identifier
CHISQCochran-Mantel-Haenszel IxJxK test statistic
DFChi-square degrees of freedom
PAsymptotic p-value

(DF was not directly reported by PLINK 1.07.)

.cnv (segmental copy number variant data)

Produced by postprocessing the output of Birdsuite or a similar package. Loaded with --cnv-list/--cfile. Must be accompanied by a .fam file.

A text file with an optional header line, and one line per segmental call with the following eight fields:

FIDFamily ID
IIDWithin-family ID
CHRChromosome code
BP1First base-pair coordinate
BP2Last base-pair coordinate
TYPENumber of copies of variant
SCOREConfidence score associated with variant (safe to use dummy value of '0')
SITESNumber of probes in the variant (safe to use dummy value of '0')

.cnv.indiv (per-sample segment summary)

Produced whenever --cfile/--cnv-list loading completes.

A text file with a header line, and one line per sample with the following 6-7 fields:

FIDFamily ID
IIDWithin-family ID
PHEPhenotype
NSEGNumber of segments that sample has
KBTotal kilobase distance spanned by segments
KBAVGAverage segment size
COUNT (Only present with --cnv-count, which is not yet implemented.)

.cnv.overlap (overlapping CNV segment report)

A text file with a header line, and one line per overlap with the following five fields:

FIDFamily ID
IIDWithin-family ID
CHRChromosome code
BP1Segment start (base-pair units)
BP2Segment end

.cnv.summary (per-variant CNV summary)

Produced whenever --cfile/--cnv-list loading completes.

A text file with a header line, and one line per variant with the following five fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
AFFCNV count at variant, all cases
UNAFFCNV count at variant, all controls

.cov (covariate table)

Produced by --write-covar, --make-bed, and --recode when an input covariate table has been named with --covar. Valid input for --covar.

A text file with a header line, and one line per sample with the following 2+C or 6+C fields (where C is the number of covariates):

FIDFamily ID
IIDWithin-family ID
PATPaternal within-family ID. Requires --with-phenotype without 'no-parents'.
MATMaternal within-family ID. Requires --with-phenotype without 'no-parents'.
SEXSex. Requires --with-phenotype without 'no-sex'.
PHENOTYPEMain phenotype value. Only present with --with-phenotype.
Covariate IDs. Covariate values

Note that --covar can also be used with files lacking a header row.

.dfam (sib-TDT association report)

A text file with a header line, and then one line per variant with the following eight fields:

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
OBSNumber of observed A1 alleles
EXPExpected number of A1 alleles
CHISQSib-TDT test statistic
PAsymptotic p-value for sib-TDT test statistic

.diff (merge conflict report)

Produced by --merge/--bmerge + --merge-mode 6 or 7.

A text file with a header line, and then one line per conflict with the following five fields:

SNPVariant identifier
FIDFamily ID
IIDWithin-family ID
NEWGenotype in merge fileset (named in --merge/--bmerge)
OLDGenotype in reference fileset (loaded with e.g. --bfile)

.dist (genomic Hamming distance matrix)

A tab-delimited text file that is either lower-triangular (first line has only one entry containing the <genome 1-genome 2> Hamming distance, second line has two entries containing the <genome 1-genome 3> and <genome 2-genome 3> Hamming distances in that order, etc.) or square. If square, the upper-right triangle may be either zeroed out or the mirror-image of the lower-left triangle, depending on whether the 'square0' or 'square' modifier was used.

When missing values are present, the affected raw Hamming distances are rescaled to be comparable to pairwise distances unaffected by missing data.

.dupvar (duplicate-position-and-alleles variant report)

Normally a tab-delimited text file with a header line, followed by one line per duplicate variant group with the following 4 columns:

CHRChromosome code
POSBase-pair coordinate
ALLELESComma-separated allele codes
IDSSpace-separated variant IDs

With the 'ids-only' modifier, the header and the position/allele columns are omitted only space-delimited lists of variant IDs remain. (This form is directly usable with --extract/--exclude.)

With 'require-same-ref' (and without 'ids-only'), the ALLELES column is replaced with the following two columns:

REFA2 allele
ALTA1 allele (will become a comma-separated list in PLINK 2.0)

.eigenvec, .eigenvec.var (principal components)

Produced by --pca. Accompanied by an .eigenval file, which contains one eigenvalue per line.

The .eigenvec file is, by default, a space-delimited text file with no header line and 2+V columns per sample, where V is the number of requested principal components. The --pca 'header' modifier causes a header line to be written, and the 'tabs' modifier makes this file tab-delimited. The first two columns are the sample's FID/IID, and the rest are principal component weights in the same order as the .eigenval values (if the header line is present, these columns are titled 'PC1', 'PC2', . ).

With the 'var-wts' modifier, an .eigenvec.var file is also generated. It replaces the FID/IID columns with 'CHR', 'VAR', 'A1', and 'A2' columns containing chromosome codes, variant IDs, A1 alleles, and A2 alleles, respectively otherwise the formats are identical.

.epi., .epi..summary (epistatic interaction scan reports)

Produced by --epistasis and --fast-epistasis. 'cc' secondary extension indicates a case/control test, 'co' indicates "--fast-epistasis case-only", and 'qt' indicates --epistasis linear regression on a quantitative trait.

The main report is normally a text file with a header line, followed by one line per variant pair clearing the --epi1 threshold with the following 5-7 fields:

CHR1Variant 1 chromosome code
SNP1Variant 1 identifier
CHR2Variant 2 chromosome code
SNP2Variant 2 identifier
'OR_INT'/'BETA_INT'Odds ratio (case/control) or regression coefficient (QT). Requires --epistasis.
STATChi-square statistic
DFChi-square degrees of freedom. Only present with 'boost'.
PChi-square p-value. Not present with --fast-epistasis 'nop' modifier.

The .summary file is a text file with a header line, followed by one line per variant (or just one line per variant in set #1, if 'set-by-set' or 'set-by-all' was specified) with the following 7-8 fields:

CHRChromosome code
SNPVariant identifier
N_SIGNumber of 'significant' (based on --epi2 value) epistatic test results
N_TOTTotal number of valid test results
PROPProportion significant. Not always present in intermediate --parallel files.
BEST_CHISQLargest chi-square statistic (approximate when 'boost' test and &le --epi1 threshold)
BEST_CHRChromosome of largest-statistic variant
BEST_SNPID of largest-statistic variant

For the 'boost' test, the BEST_CHISQ/BEST_CHR/BEST_SNP entry occasionally doesn't correspond to lowest p-value, since DF is variable.

For two-set tests, if variant v1 is in both sets but v2 is only in set #1, the v1-v2 test is only counted in the v2 summary row. (This is a change from PLINK 1.07.)

.fam (PLINK sample information file)

Sample information file accompanying a .bed binary genotype table. (--make-just-fam can be used to update just this file.) Also generated by "--recode lgen" and "--recode rlist".

A text file with no header line, and one line per sample with the following six fields:

  1. Family ID ('FID')
  2. Within-family ID ('IID' cannot be '0')
  3. Within-family ID of father ('0' if father isn't in dataset)
  4. Within-family ID of mother ('0' if mother isn't in dataset)
  5. Sex code ('1' = male, '2' = female, '0' = unknown)
  6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

With the use of additional loading flag(s), PLINK can also correctly interpret some .fam files missing one or more of these fields.

If there are any numeric phenotype values other than <-9, 0, 1, 2>, the phenotype is interpreted as a quantitative trait instead of case/control status. In this case, -9 normally still designates a missing phenotype use --missing-phenotype if this is problematic.

Several PLINK commands (e.g. --cluster) merge the FID and IID with an underscore in their reports for example, a sample with FID = 'Chang' and IID = 'Christopher' would be referenced as 'Chang_Christopher'. We preserve this behavior for backwards compatibility, so you should avoid using underscores in FIDs and IIDs (consider '

If your case/control phenotype is encoded as '0' = control and '1' = case, you'll need to specify --1 to load it properly.

.flipscan, .flipscan.verbose (case/control strand inconsistency report)

The .flipscan file is a text file with a header line, and one line per variant with the following 11 fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
FAllele 1 frequency
POSNumber of positive LD matches
R_POSPositive LD match average correlation
NEGNumber of negative LD matches
R_NEGNegative LD match average correlation
NEGSNPSNegative LD match ID(s), '|'-delimited

If the 'verbose' modifier is present, a .flipscan.verbose file is also generated. This is a text file with a header line, and one line per relevant variant pair (i.e. index variant has at least one negative LD match, and case and/or control correlation has sufficient absolute value) with the following nine fields:

CHR_INDXChromosome code
SNP_INDXIndex variant identifier
BP_INDXIndex variant base-pair coordinate
A1_INDXIndex variant allele 1
SNP_PAIRSecond variant identifier
BP_PAIRSecond variant base-pair coordinate
A1_PAIRSecond variant allele 1
R_ACase-only correlation
R_UControl-only correlation

.frq (basic allele frequency report)

Produced by --freq. Valid input for --read-freq.

A text file with a header line, and then one line per variant with the following six fields:

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
MAFAllele 1 frequency
NCHROBSNumber of allele observations

.frq.cc (case/control phenotype-stratified allele frequency report)

Produced by "--freq case-control". Not valid input for --read-freq.

A text file with a header line, and then one line per variant with the following eight fields:

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
MAF_AAllele 1 frequency in cases
MAF_UAllele 1 frequency in controls
NCHROBS_ANumber of case allele observations
NCHROBS_UNumber of control allele observations

.frq.count (basic allele count report)

A text file with a header line, and then one line per variant with the following seven fields:

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
C1Allele 1 count
C2Allele 2 count
G0Missing genotype count (so C1 + C2 + 2 * G0 is constant on autosomal variants)

.frq.strat (cluster-stratified allele frequency report)

Produced by --freq when used with --within/--family. Not valid input for --read-freq.

A text file with a header line, and then C lines per variant (where C is the number of clusters) with the following 8-9 lines:

CHRChromosome code
SNPVariant identifier
CLSTCluster identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
MAFAllele 1 frequency in cluster
MACAllele 1 count in cluster
NCHROBSNumber of allele observations in cluster

.frqx (genotype count report)

Produced by --freqx. Valid input for --read-freq.

A text file with a header line, and then one line per variant with the following ten fields:

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
C(HOM A1)A1 homozygote count
C(HET)Heterozygote count
C(HOM A2)A2 homozygote count
C(HAP A1)Haploid A1 count (includes male X chromosome)
C(HAP A2)Haploid A2 count
C(MISSING)Missing genotype count

.fst (fixation index report)

A text file with a header line, and then one line per autosomal diploid variant with the following five fields:

CHRChromosome code
SNPVariant identifier
POSBase-pair coordinate
NMISSNumber of genotype calls considered
FSTWright's FST estimate, via Weir and Cockerham's method

.gen (Oxford genotype file format)

Native text genotype file format for Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. Should always be accompanied by a .sample file. Loaded with --data/--gen, and produced by "--recode oxford".

A text file with no header line, and one line per variant with either 3N+5 or 3N+6 fields where N is the number of samples. Each line stores information for a single SNP.

In the 3N+5 case (corresponding to the original specification), the first five fields are:

  1. "SNP ID"
  2. rsID (treated by PLINK as the main variant ID)
  3. Base-pair coordinate
  4. Allele 1 (usually minor)
  5. Allele 2 (usually major)

Unless the chromosome code was declared with --oxford-single-chr (in which case the SNP ID column is ignored), PLINK has no choice but to assume that the "SNP ID" column actually stores chromosome codes. (This is the convention when PLINK exports a 5-leading-column .gen file.)

The newer 3N+6 column flavor has a dedicated chromosome column in front. This was not supported by PLINK 1.9 or 2.0 before 16 Apr 2021.

Each subsequent triplet of values then indicate likelihoods of homozygote A1, heterozygote, and homozygote A2 genotypes at this SNP, respectively, for one sample. If they add up to less than one, the remainder is a no-call probability weight.

Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are currently treated as missing, and the rest are treated as hard calls. (This behavior can be changed with --hard-call-threshold.) This limitation is removed in PLINK 2.0.

.genome (identity-by-descent report)

Produced by --genome. Valid input for --read-genome.

A text file with a header line, and one line per pair of distinct samples typically with the following 14 fields:

FID1First sample's family ID
IID1First sample's within-family ID
FID2Second sample's family ID
IID2Second sample's within-family ID
RTRelationship type inferred from .fam/.ped file
EZIBD sharing expected value, based on just .fam/.ped relationship
Z0P(IBD=0)
Z1P(IBD=1)
Z2P(IBD=2)
PI_HATProportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)
PHEPairwise phenotypic code (1, 0, -1 = case-case, case-ctrl, and ctrl-ctrl pairs, respectively)
DSTIBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)
PPCIBS binomial test
RATIOHETHET : IBS0 SNP ratio (expected value 2)

The pedigree relationship type codes are as follows:

With the 'full' modifier, there are five additional fields at the end:

IBS0Number of IBS 0 nonmissing variants
IBS1Number of IBS 1 nonmissing variants
IBS2Number of IBS 2 nonmissing variants
HOMHOMNumber of IBS 0 SNP pairs used in PPC test
HETHETNumber of IBS 2 het/het SNP pairs used in PPC test

.grm (GCTA text relationship matrix)

A text file with no header line, and one line per pair of samples (not necessarily distinct) with the following four fields:

  1. 1-based index of first sample in .grm.id file
  2. 1-based index of second sample in .grm.id file
  3. Number of observations (variants where neither sample has a missing call)
  4. Relationship value
.grm.N.bin, .grm.bin (GCTA 1.1+ triangular binary relationship matrix)

These files contain single-precision (4-byte) floating point values. Using 1-based matrix indices, the first value in each file is the (1, 1) relationship value (.grm.bin) or observation count (.grm.N.bin) the second and third values are the (2, 1) and (2, 2) relationships/counts the fourth through sixth values are the (3, 1), (3, 2) and (3, 3) relationships/counts in that order and so on.

Note that .grm.bin files generated by GCTA versions before 1.1 have a different format.

.gvar (genetic variant format)

Produced by packages such as Birdsuite. Loaded with --gfile. Must be accompanied by .fam and .map files.

A text file with no header line, and one line per variant call with the following seven fields:

  1. Family ID
  2. Within-family ID
  3. Variant name
  4. Code for allele from first parent
  5. Copy number for first allele (can be non-integer)
  6. Code for allele from second parent
  7. Copy number for second allele
.het (method-of-moments F coefficient estimates)

A text file with a header line, and one line per sample with the following six fields:

FIDFamily ID
IIDWithin-family ID
O(HOM)Observed number of homozygotes
E(HOM)Expected number of homozygotes
N(NM)Number of non-missing autosomal genotypes
FMethod-of-moments F coefficient estimate

.hh (heterozygous haploid and nonmale Y chromosome call list)

Produced automatically when the input data contains heterozygous calls where they shouldn't be possible (haploid chromosomes, male X/Y), or there are nonmissing calls for nonmales on the Y chromosome.

A text file with one line per error (sorted primarily by variant ID, secondarily by sample ID) with the following three fields:

.hom (run-of-homozygosity list)

Produced when a flag in the --homozyg family is present. Accompanied by at least a .hom.indiv and a .hom.summary file.

A text file with a header line, and one line per run with the following thirteen fields:

FIDFamily ID
IIDWithin-family ID
PHEPhenotype value
CHRChromosome code
SNP1ID of first SNP in run
SNP2ID of last SNP in run
POS1Base-pair coordinate of SNP1
POS2Base-pair coordinate of SNP2
KBLength of region in kb
NSNPNumber of SNPs in run
DENSITYInverse SNP density in kb/SNP
PHOMProportion of calls homozygous
PHETProportion of calls heterozygous

Note that PHOM + PHET can be less than 1 when missing calls are present.

.hom.indiv (sample-based runs-of-homozygosity report)

Produced when a flag in the --homozyg family is present.

A text file with a header line, and one line per sample with the following six fields:

FIDFamily ID
IIDWithin-family ID
PHEPhenotype value
NSEGNumber of runs of homozygosity
KBTotal length of runs (kb)
KBAVGAverage length of runs (kb)

.hom.overlap (run-of-homozygosity pool list)

.hom.overlap files contain a header line, and P+2 lines per segment pool (where P is the number of segments in the pool) with the following 13 fields:

HeaderFirst P linesLast two lines
POOLPool ID(same)
FIDFamily ID'CON'/'UNION'
IIDWithin-family IDP
PHEPhenotype value[case ct]:[noncase ct]
CHRChromosome code(same)
SNP1ID of first SNP in segment(same)
SNP2ID of last SNP in segment(same)
BP1Base-pair coordinate of SNP1(same)
BP2Base-pair coordinate of SNP2(same)
KBLength of region in kb(same)
NSNPNumber of SNPs in run(same)
NSIMNumber of matching segments in pool'NA'
GRPAllelic-match group (see --homozyg-match)'NA'

The second-to-last line for each pool describes the consensus match segment, while the last line describes the union of all segments in the pool. Pools are separated by blank lines, and sorted primarily by pool size (largest first) and secondarily by physical position. The first pool in the file has ID 'S1', the second pool has ID 'S2', etc.

PLINK 1.07's production of this file has a minor bug and a few quirks (pairwise allelic matches are judged from (<# mismatches on joint-homozygous overlapping variants> / <# of overlapping variants>) instead of (<# mismatches on joint-homozygous overlapping variants> / <# of joint-homozygous overlapping variants>), contrary to the documentation pools are sorted by reverse physical position some ID numbers are skipped samples within an allelic-match group written in an unsorted order) which are not replicated by PLINK 1.9.

.hom.overlap.S*.verbose (single ROH pool report)

"--homozyg group-verbose" also produces one .hom.overlap. <pool ID> .verbose file per pool. (Be careful with this, lest you inadvertently fill up your entire hard drive.) These files each contain G+3 sections, where G is the number of allelic-match groups. (Note that this format was not really intended to be machine-readable if there is sufficient interest, we may clean it up in the future.)

The first section has a header line, followed by one line per sample in the pool with the following four fields:

(blank)'1)', '2)', etc.
FIDFamily ID
IIDWithin-family ID
GRPAllelic-match group (without trailing '*'s)

It ends with a single blank line.

The second section has a header line, followed by a blank line, followed by one line per variant in the segment union with the following P+1 fields:

SNPVariant identifier
'1', '2', etc.'/'-separated genotype call, [bracketed] when it's part of a ROH

There are single blank lines marking the beginning and end of the consensus match segment, and two consecutive blank lines at the end of this section.

The next G sections each start with the following S+6 header lines (where g is the 1-based allelic-match group index, S is the size of the group, and p is the 1-based index assigned to the sample in the first field of the first section):

  • 1. 'Group g'
  • 2. (blank line)
  • 3-(S+2). 4 fields: 'p)', FID, IID, phenotype value
  • S+3. (blank line)
  • S+4. (blank line)
  • S+5. S+1 fields: 'SNP', p1, . pS
  • S+6. (blank line)

This is followed by one line per variant with the following S+2 fields:

  • 1. Variant identifier
  • 2. Consensus haplotype, or '?' if there isn't one
  • 3-(S+2). Genotype call from section 2 (including brackets)

Single blank lines mark the beginning and end of the consensus match segment, as well as the end of the section.

The final section starts with two additional blank lines, followed by one line per variant with the following G+1 fields:

.hom.summary (SNP-based runs-of-homozygosity report)

Produced when a flag in the --homozyg family is present.

A text file with a header line, and one line per SNP with the following five fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
AFFNumber of cases with a run-of-homozygosity including this SNP
UNAFFNumber of non-cases with a ROH including this SNP

Note that samples with missing phenotypes are counted in the 'UNAFF' column. If the phenotype is quantitative, everyone will be counted in 'UNAFF'.

.homog (chi-square partitioning odds ratio homogeneity test report)

A text file with a header line, followed by K+3 lines per variant with the following 13 fields (where K > 1 is the number of clusters):

CHRChromosome code
SNPVariant identifier
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
F_ACase A1 frequency
F_UControl A1 frequency
N_ACase allele count
N_UControl allele count
TESTType of test: one of
CHISQChi-square association statistic
DFDegrees of freedom
PAsymptotic p-value
OROdds ratio

.hwe (Hardy-Weinberg equilibrium exact test statistic report)

A text file with a header line, and one line per marker with the following nine fields:

CHRChromosome code
SNPVariant identifier
TESTType of test: one of
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
GENO'/'-separated genotype counts (A1 hom, het, A2 hom)
O(HET)Observed heterozygote frequency
E(HET)Expected heterozygote frequency
PHardy-Weinberg equilibrium exact test p-value

.ibc (GCTA inbreeding coefficient report)

A text file with a header line, and one line per sample with the following six fields:

FIDFamily ID
IIDWithin-family ID
NOMISSNumber of nonmissing genotype calls
Fhat1Variance-standardized relationship minus 1
Fhat2Excess homozygosity-based inbreeding estimate (same as PLINK --het)
Fhat3Estimate based on correlation between uniting gametes

.imiss (sample-based missing data report)

Produced by --missing, with a companion .lmiss file.

A text file with a header line, and one line per sample with the following six fields:

FIDFamily ID
IIDWithin-family ID
MISS_PHENOPhenotype missing? (Y/N)
N_MISSNumber of missing genotype call(s), not including obligatory missings or het. haploids
N_GENONumber of potentially valid call(s)
F_MISSMissing call rate

.info (Haploview map file)

Produced by "--recode HV[-1chr]", for use by Haploview. Accompanies a .ped file. With "--recode HV", one .ped + .info fileset is generated per chromosome, and the full file extensions are of the form .chr- <chromosome number> .info . This format cannot be loaded by PLINK.

A text file with no header line, and one line per variant with the following two fields:

.lasso (LASSO variant effect size estimates)

Produced by --lasso. Valid input for --score.

A text file with a header line, and one line per variant with the following four fields:

CHRChromosome code (or 'COV' for covariates)
SNPVariant/covariate identifier
A1Allele 1 (usually minor 'NA' for covariates)
EFFECTA1 effect size estimate on normalized phenotype ('NA' on monomorphic variants)

.ld (inter-variant correlation table or matrix)

If a matrix format was requested, the output is structured like a .dist file (space-delimited instead of tab-delimited if 'spaces' was specified), or its binary equivalent if the file extension ends in .bin. (See the R code snippet under the --distance documentation for an example of how to load the binary form.)

If a table report was requested instead, the file contains a header line, followed by one line per filtered variant pair with the following 7-11 fields:

CHR_AChromosome code for first variant
BP_ABase-pair coordinate of first variant
SNP_AID of first variant
MAF_AAllele 1 frequency for first variant. Requires 'with-freqs'.
CHR_BChromosome code for second variant
BP_BBase-pair coordinate of second variant
SNP_BID of second variant
PHASEIn-phase allele pairs. Requires 'in-phase'.
MAF_BAllele 1 frequency for second variant. Requires 'with-freqs'.
'R'/'R2'Correlation coefficient (squared if --r2).
'D'/'DP'Linkage disequilibrium D, or Lewontin's D-prime. Requires 'd'/'dprime'/'dprime-signed'.

.ldset (high-LD same-set variant pair report)

Produced by --set-r2 when the 'write' modifier is present.

A text file with no header line, and one section per set. A section has one line for each variant in the set, starting with the following two fields:

These are followed by a (space-delimited) list of ID(s) of other same-set variants which have pairwise r 2 &ge 0.5 with the current variant.

Note that sets containing no significant variants are not present in this report this is a change from PLINK 1.07's --write-set-r2's behavior. (Use "--set-p 1" if this is a problem.)

.lgen (PLINK long-format genotype file)

Produced by "--recode lgen" and "--recode lgen-ref". Accompanied by a .fam, .map, and possibly a .ref file. Loaded with --lfile.

A text file with no header line, and one line per genotype call (or just not-homozygous-major calls if 'lgen-ref' was invoked) usually with the following five fields:

  1. Family ID
  2. Within-family ID
  3. Variant identifier
  4. Allele call 1 ('0' for missing)
  5. Allele call 2

There are several variations which are also handled by PLINK see the original discussion for details.

.list (genotype list file)

Produced by "--recode list". This format cannot be loaded by PLINK.

A text file with no header line, and four lines per variant. Each line starts with the following three fields:

This is followed by two additional fields (FID, then IID) for each sample with the specified genotype call at the variant.

.lmiss (variant-based missing data report)

Produced by --missing, with a companion .imiss file.

A text file with a header line, and K line(s) per variant with the following 5-7 fields (where K is the number of cluster(s) if --within/--family was specified, or 1 if it wasn't):

CHRChromosome code
SNPVariant identifier
CLSTCluster identifier. Only present with --within/--family.
N_MISSNumber of missing genotype call(s), not counting obligatory missings or het. haploids
N_CLSTCluster size (does not include nonmales on chrY). Only present with --within/--family.
N_GENONumber of potentially valid call(s)
F_MISSMissing call rate

.map (PLINK text fileset variant information file)

Variant information file accompanying a .ped text pedigree + genotype table. Also generated by "--recode rlist".

A text file with no header file, and one line per variant with the following 3-4 fields:

  1. Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
  2. Variant identifier
  3. Position in morgans or centimorgans (optional also safe to use dummy value of '0')
  4. Base-pair coordinate

All lines must have the same number of columns (so either no lines contain the morgans/centimorgans column, or all of them do).

.mdist (genomic distance proportion matrix)

A text file that is space-delimited if produced with --distance-matrix and tab-delimited otherwise. Shape and contents are identical to that of .dist files, except that all values are divided by twice the total variant count to convert them from Hamming distances to fractions between 0 and 1.

.mdist.missing (identity-by-missingness matrix)

A triangular space-delimited text file with identity-by-missingness coefficients.

.mds (Haploview-friendly multidimensional scaling report)

A text file with a header line with the following D+3 fields (where D is the number of requested dimensions), and one line per sample with the same fields:

FIDFamily ID
IIDWithin-family ID
SOLCluster index (0-based)
Cx. Position on dimension x (1-based dimension indices)

.mendel, .imendel, .fmendel, .lmendel (Mendel error reports)

The .mendel file is a text file with a header line, and one line per error with the following six columns:

FIDFamily ID
KIDChild within-family ID
CHRChromosome code
SNPVariant identifier
CODENumeric error code
ERRORDescription of error

Note that '*/*' in the error description does not (necessarily) refer to a missing genotype call instead, it means a Mendel error is present regardless of what that parent's genotype is.

The .lmendel file has a header line, and one line per variant with the following three columns:

CHRChromosome code
SNPVariant identifier
NNumber of Mendel errors

The .imendel file has a header line, and one subsection per nuclear family. Each subsection contains one line per family member with the following three columns:

FIDFamily ID
IIDWithin-family ID
NNumber of errors implicating this sample (only considering nuclear family)

Samples may appear more than once in this file.

Finally, the .fmendel file has a header line, and one line per nuclear family with the following five columns:

FIDFamily ID
PATPaternal within-family ID (0 if missing)
MATMaternal within-family ID (0 if missing)
CHLDNumber of offspring in nuclear family
NNumber of Mendel errors in nuclear family

.meta (meta-analysis)

A text file with a header line, and then one line per analyzed variant with the following 8-(F+14) fields (where F is the number of input files):

CHRChromosome code. Not present with 'no-map' modifier.
BPBase-pair coordinate. Not present with 'no-map' modifier.
SNPVariant identifier
A1Allele 1. Not present with 'no-map' or 'no-allele' modifier.
A2Allele 2. Not present with 'no-map' or 'no-allele' modifier.
NNumber of valid studies for variant
PFixed-effects meta-analysis p-value
P(R)Random-effects meta-analysis p-value
'BETA'/'OR'Fixed-effects BETA/OR estimate
'BETA(R)'/'OR(R)'Random-effects BETA/OR estimate
Qp-value for Cochran's Q statistic
II 2 heterogeneity index (0-100 scale)
WEIGHTED_ZWeighted Z-score, as computed by METAL. Requires 'weighted-z' modifier.
P(WZ)p-value for weighted Z-score. Requires 'weighted-z' modifier.
F[x].Study x (0-based input file indices) effect estimate. Requires 'study' modifier.

.mibs (identity-by-state matrix)

A text file that is space-delimited if produced with --distance-matrix and tab-delimited otherwise. Possible shapes are the same as for .dist and .mdist files. Each identity-by-state value is just equal to one minus the corresponding .mdist value.

.missing (case/control nonrandom missingness test report)

A text file with a header line, and then one line per nondegenerate variant with the following 5 fields:

CHRChromosome code
SNPVariant identifier
F_MISS_AMissing call frequency, cases
F_MISS_UMissing call frequency, controls
PFisher's exact test p-value

.missing.hap (adjacent variant-based nonrandom missingness test report)

A text file with a header line, and then one section per autosomal diploid variant with 5+ missing calls. Each section contains one line per considered flanking haplotype, followed by a 'HETERO' line covering flanking heterozygosity (just one flanking call needs to be heterozygous), with the following 9 fields:

SNPCentral variant identifier
HAPLOTYPEHaplotype allele(s), or 'HETERO'
F_0Haplotype frequency, central call missing
F_1Haplotype frequency, central call nonmissing
M_H1#(central call missing, this hap.) / #(central call nonmissing, this hap.)
M_H2#(central call missing, other hap.) / #(central call nonmissing, other hap.)
CHISQChi-square statistic
PChi-square p-value
FLANKINGFlanking variant ID(s), '|'-delimited

Haplotype frequencies are estimated via the EM algorithm.

.model (case/control full model association report)

A text file with a header line, and then 1-5 lines per variant with the following 8-10 fields:

CHRChromosome code
SNPVariant identifier
A1A1 allele (usually minor)
A2A2 allele (usually major)
TESTType of test: one of
AFF'/'-separated genotype or allele counts among cases
UNAFF'/'-separated genotype or allele counts among controls
CHISQChi-square statistic. Not present with 'fisher'/'fisher-midp' modifier.
DFChi-square degrees of freedom. Not present with 'fisher'/'fisher-midp'.
PP-value

Note that the Cochran-Armitage trend test is based on the full 2x3 genotype contingency table, even though only the 2x2 allele count table is displayed in the AFF/UNAFF columns on that line.

.*.mperm (max(T) permutation test report)

Produced by several association analysis commands when the 'mperm=<value>' modifier is used.

A text file with a header line, and then typically one line per variant with the following four fields:

CHRChromosome code
SNPVariant identifier
EMP1Empirical p-value (pointwise), or lower-p-value permutation count
EMP2Corrected empirical p-value (max(T) familywise) or permutation count

In the --linear/--logistic no-snp case, there is instead one line per variable with the following three fields:

TESTTest identifier
EMP1Empirical p-value, or lower-p-value permutation count
NPNumber of permutations performed

.nearest (nearest neighbor distance report)

A text file with a header line, and n2-n1+1 lines per sample with the following 7-8 fields:

FIDFamily ID
IIDWithin-family ID
NNNearest neighbor level
MIN_DSTIBS distance of NNth nearest neighbor
ZZ score of MIN_DST
FID2FID of NNth nearest neighbor
IID2IID of NNth nearest neighbor
PROP_DIFFProportion of neighbors below --ppc threshold. Not present without --ppc.

.occur.dosage (dosage data variant occurrence report)

A text file with no header line, and one line per variant with the following 2 fields:

.out.dosage (merged dosage data file)

A text file with a header line, and one line per variant with the following 3 initial fields:

SNPVariant ID
A1Allele 1 (usually minor)
A2Allele 2 (usually major)

This is followed by N 2-field blocks in the header line (with FID/IIDs), and N blocks of m dosage data fields in subsequent lines (where m is the --dosage 'format' parameter).

.ped (PLINK/MERLIN/Haploview text pedigree + genotype table)

Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file Haploview requires an accompanying .info file instead. Loaded with --file, and produced by --recode.

Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call) the 9th and 10th are allele calls for the second variant and so on.

If all alleles are single-character, PLINK 1.9 will correctly parse the more compact "compound genotype" variant of this format, where each genotype call is represented as a single two-character string. This does not require the use of an additional loading flag. You can produce such a file with "--recode compound-genotypes".

.*.perm (adaptive permutation test report)

Produced by several association analysis commands when the 'perm' modifier is used.

A text file with a header line, and then one line per variant with the following 4-7 fields:

CHRChromosome code
SNPVariant identifier
BETARegression slope for real data. Only present with "--qfam emp-se".
EMP_BETASample mean of permutation regression slopes. Only present with "--qfam emp-se".
EMP_SESample stdev of permutation regression slopes. Only present with "--qfam emp-se".
EMP1Empirical p-value (pointwise), or lower-p-value permutation count
NPNumber of permutations performed for this variant

.pphe (phenotype permutations)

Produced by --make-perm-pheno. Valid input for --pheno.

A text file with no header line, and one line per sample with the following P+2 fields (where P is the requested number of permutations):

Missing phenotypes are always represented by the --[output-]missing-phenotype value (this is a very minor change from PLINK 1.07).

.prob (meta-analysis rejected variant list)

Produced by --meta-analysis, when at least one variant is rejected.

A text file with no header line, and then one line per problem with the following 3 fields:

Multiple problems may be reported for a single (filename, variant ID) pair.

.profile (allelic scoring results)

A text file with a header line, and then one line per sample with the following 4-6 fields:

FIDFamily ID
IIDWithin-family ID
PHENOPhenotype value
CNT# of nonmissing alleles used for scoring. May require 'include-cnt'.
CNT2Sum of named allele counts. Not present with --dosage.
'SCORE'/'SCORESUM'Score (normally an allele-based average, unless 'sum' modifier used)

.qassoc (quantitative trait association test report)

Produced by --assoc acting on a quantitative phenotype.

A text file with a header line, and then one line per variant with the following 9-11 fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
NMISSNumber of nonmissing genotype calls
BETARegression coefficient
SEStandard error
R2Regression r-squared
TWald test (based on t-distribution)
PWald test asymptotic p-value
LINLin statistic. Only present with 'lin' modifier.
LIN_PLin test p-value. Only present with 'lin'.

.qassoc.gxe (quantitative trait interaction test report)

A text file with a header line, and then one line per variant with the following 10 fields:

CHRChromosome code
SNPVariant identifier
NMISS1Nonmissing genotype calls in first group
BETA1Regression coefficient for first group
SE1Regression coefficient standard error for first group
NMISS2Nonmissing genotype calls in second group
BETA2Regression coefficient for second group
SE2Regression coefficient standard error for second group
Z_GXEZ score, test for interaction
P_GXEAsymptotic p-value

.qassoc.means (quantitative trait association genotype-stratified mean report)

A text file with a header line, and then five lines per variant with the following six fields:

CHRChromosome code
SNPVariant identifier
VALUEType of value: one of
G11Value for homozygous A1 genotype
G12Value for heterozygous genotype
G22Value for homozygous A2 genotype

.qfam.* (family-based quantitative trait association report)

Produced by the --qfam family of commands.

A .qfam. file has a header line, and one line per variant with the following nine fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
A1Allele 1 (usually minor)
TESTTest type ('TOT', 'BET', or 'WITH')
NINDNumber of samples in linear regression
BETARegression coefficient
STATT-statistic (just for permutation test don't use it directly)
RAW_PUncorrected p-value

.range.report (reprocessed gene-based report)

The .range.report file has one subsection per nonempty gene. Each subsection contains a header line of the form "<gene name> -- <start/end coordinate pairs, comma-separated if necessary> ( <kb length> ) [border description, if necessary]" this is followed by a blank line, the original report's header line with 'DIST' inserted in front, and the lines in the original report which concerned SNPs in the gene (preceded by <current pos> - <gene start coordinate> DIST values). Subsections are separated by two blank lines.

There are four small changes from PLINK 1.07:

  • Genes now appear in natural-sorted instead of ASCII-sorted order (e.g. ABCA1 < ABCA3 < ABCA10, instead of the old ABCA1 < ABCA10 < ABCA3).
  • kb lengths are larger by 0.001, since intervals in gene region files are fully closed instead of half-open.
  • If --gene-list-border was specified, intervals and lengths in header lines do not include the additional padding.
  • When a gene contains several disjoint regions on the same chromosome, they are now reported in a single subsection.
.raw (additive + dominant component file)

Produced by "--recode A" and "--recode AD", for use with R. This format cannot be loaded by PLINK.

A text file with a header line, and then one line per sample with V+6 (for "--recode A") or 2V+6 (for "--recode AD") fields, where V is the number of variants. The first six fields are:

FIDFamily ID
IIDWithin-family ID
PATPaternal within-family ID
MATMaternal within-family ID
SEXSex (1 = male, 2 = female, 0 = unknown)
PHENOTYPEMain phenotype value

This is followed by one or two fields per variant:

<Variant ID>_<counted allele>Allelic dosage (0/1/2/'NA' for diploid variants, 0/2/'NA' for haploid)
<Variant ID>_HETDominant component (1 = het, 0 otherwise). Requires "--recode AD".

If 'include-alt' was specified, the header line also names alternate allele codes in parentheses, e.g. 'rs5939319_G(/A)'.

.recode..txt (BIMBAM genotype, phenotype, and variant position file)

Produced by "--recode bimbam", for use by BIMBAM. This format cannot be loaded by PLINK.

The .recode.geno.txt file produced by PLINK is a comma-delimited text file. It starts with two short header lines: N on its own line (where N is the number of samples), followed by number of variants on its own line. The third header line starts with 'IND', and is followed by the IIDs of all samples.

The main body of the file has one line per variant with N+1 fields: the variant ID, followed by compound genotypes (with missing genotypes denoted by '??').

The .recode.pheno.txt file produced by PLINK is just a sequence of sample phenotype values, one per line.

The .recode.pos.txt file produced by PLINK is a text file with no header line, and one line per variant with the following 2-3 (space-delimited) fields:

  1. Variant identifier
  2. Base-pair coordinate
  3. Chromosome code (not present with 'bimbam-1chr')
.recode.phase.inp (fastPHASE format)

Produced by "--recode fastphase[-1chr]", for use by fastPHASE. With "--recode fastphase", one file is generated per chromosome, and the full file extensions are of the form .chr- <chromosome number> .recode.phase.inp . This format cannot be loaded by PLINK.

Each .phase.inp file produced by PLINK starts with two short header lines: number of samples on its own line, followed by V on its own line (where V is the number of variants). The third header line starts with 'P', and is followed by the base-pair coordinates of all variants.

The main body of the file has three lines per sample. The first line in each triplet is:

The second and third lines each have a single M-character string, with one character per allele call. Missing calls are coded as '?'.

.recode.strct_in (Structure format)

Produced by "--recode structure", for use by Structure. This format cannot be loaded by PLINK.

A text file with two header lines: the first header line lists all V variant IDs, while each entry in the second line is the difference between the current variant's base-pair coordinate and the previous variant's bp coordinate (or -1 when the current variant starts a new chromosome). This is followed by one line per sample with the following 2V+2 fields:

  • 1. Within-family ID
  • 2. Positive integer, unique for each FID
  • 3-(2V+2). Genotype calls, with the A1 allele coded as '1', A2 = '2', and missing = '0'
.ref (long-format reference allele file)

Reference allele file which accompanies a .lgen file when it's generated with "--recode lgen-ref". Loaded with --lfile + --reference.

A text file with no header line, and one line per polymorphic variant with the following 2-3 fields:

  1. Variant identifier
  2. Major allele
  3. Minor allele (not present if there is no minor allele)
.rel (text relationship matrix)

Contents are identical to that of a .grm/.grm.bin file. Possible shapes are essentially the same as for .dist files the only difference is that .dist files have an omitted or zero diagonal while .rel files do not.

.rlist (rare genotype list file)

Produced by "--recode rlist". Accompanied by .fam and .map files. This format cannot be loaded by PLINK.

A text file with no header line, and 0-3 lines per variant. Each line starts with the following four fields:

  1. Variant identifier
  2. Genotype class ('HOM' = homozygous minor, 'HET' = heterozygous, 'NIL' = missing call)
  3. Allele 1 ('0' for missing)
  4. Allele 2

This is followed by two additional fields (FID, then IID) for each sample with the specified genotype call at the variant. If there are no such samples, the entire line is omitted from the file. (As a result, any variants with nothing but homozygous major genotypes are not mentioned at all.)

.sample (Oxford sample information file)

Sample information file accompanying a .gen genotype dosage file. Loaded with --data/--sample, and produced by "--recode oxford".

The .sample space-delimited files emitted by --recode have two header lines, and then one line per sample with 3-5 relevant fields:

First header lineSecond header lineSubsequent contents
ID_10Family ID
ID_20Within-family ID
missing0Missing call frequency
sexDSex code ('1' = male, '2' = female, '0' = unknown)
phenotype'B'/'P'Binary ('0' = control, '1' = case) or continuous phenotype

A specification for this format is on the QCTOOL v2 website.

.set ('END'-terminated variant set membership list file)

Produced by --write-set, and loaded with --set.

A text file with a sequence of variant set definitions. Each set definition starts with the set ID, followed by IDs of all variants in the set, followed by 'END'. Spaces, tabs, and newlines are acceptable and equivalent token delimiters the files emitted by --write-set have a single token per line and a blank line between sets, but you can e.g. describe an entire set per line instead, and --set will still read the file correctly.

GENE1
rs123456
rs10912
rs66222
END

GENE2 rs66222 rs929292
rs288222 END

assigns variants rs123456 and rs10912 to set 'GENE1', rs929292 and rs288222 to 'GENE2', and rs66222 to both sets.

When multiple set definitions share the same set ID, that currently results in an error rather than a merge.

.set. (set association permutation test report)

Produced by --assoc/--model/--linear/--logistic/--tdt/--mh/--bd when run with the 'set-test' modifier.

A text file with a header line, and then one line per set with the following 6-7 fields:

SETSet ID
NSNPSet size
NSIGRaw number of significant variants
ISIGFinal size of most-significant-variants subset (after --set-r2 and --set-max thresholds)
EMP1Empirical set p-value, or lower-p-value permutation count
NPNumber of permutations performed. Requires 'perm-count'.
SNPS'|'-delimited IDs for most-significant-variants subset ('NA' if empty)

Calculation of NSIG is no longer cut short when the --set-max value is hit.

.set.table (variant set membership table)

A tab-delimited text file with a header line, and then one line per variant with the following 3+S columns (where S is the number of sets):

SNPVariant identifier
CHRChromosome code
BPBase-pair coordinate
Set IDs. 1 = member, 0 = nonmember

Variants which aren't a member of any set still appear in the table.

PLINK 1.07 wrote double-tabs on most lines between the 3rd and 4th columns this no longer occurs.

.sexcheck (X chromosome-based sex validity report)

A text file with a header line, and then one line per sample with the following 6-7 fields:

FIDFamily ID
IIDWithin-family ID
PEDSEXSex code in input file
SNPSEXImputed sex code (1 = male, 2 = female, 0 = unknown)
STATUS'OK' if PEDSEX and SNPSEX match and are nonzero, 'PROBLEM' otherwise
FInbreeding coefficient, considering only X chromosome. Not present with 'y-only'.
YCOUNTNumber of nonmissing genotype calls on Y chromosome. Requires 'ycount'/'y-only'.

.simfreq (simulation parameter file)

If generated by --simulate without the 'tags' or 'haps' modifier, it is a text file with no header line, and one line per SNP set with the following 6 fields:

  1. Number of SNPs in set (always 1 in autogenerated file)
  2. Label of this set of SNPs
  3. Reference allele frequency lower bound
  4. Reference allele frequency upper bound (equal to lower bound in autogenerated file)
  5. odds(case | heterozygote) / odds(case | homozygous for alternate allele)
  6. odds(case | homozygous for ref. allele) / odds(case | homozygous for alt. allele)

With 'tags' or 'haps', each line has the following 9 fields instead:

  1. Number of SNPs in set (always 1 in autogenerated file)
  2. Label of this set of SNPs
  3. Reference allele frequency lower bound, causal variant
  4. Reference allele frequency upper bound, causal variant
  5. Reference allele frequency lower bound, marker
  6. Reference allele frequency upper bound, marker
  7. Marker-causal variant LD
  8. odds(case | heterozygote) / odds(case | homozygous for alternate allele)
  9. odds(case | homozygous for ref. allele) / odds(case | homozygous for alt. allele)

With --simulate-qt, in both subcases the last two fields are replaced with:

.tags.list (tagging variant report)

Produced by --show-tags, when used in 'all' mode or with the --list-all flag.

A text file with a header line, and then one line per target variant with the following eight fields:

SNPVariant identifier
CHRChromosome code
BPBase-pair coordinate
NTAGNumber of other variants tagging this
LEFTBase-pair coordinate of earliest tag variant, including this
RIGHTBase-pair coordinate of latest tag variant, including this
KBSPAN(RIGHT - LEFT + 1) / 1000
TAGS'|'-delimited list of IDs of other variants tagging this (or 'NONE')

.tdt (transmission disequilibrium test report)

Produced by --tdt (unless parent-of-origin analysis was requested).

A text file with a header line, and then one line per autosomal/chrX variant typically with the following 14-15 fields:

CHRChromosome code
SNPVariant identifier
BPBase-pair coordinate
A1Allele 1 (usually minor)
A2Allele 2 (usually major)
TTransmitted A1 allele count
UUntransmitted A1 allele count
ORTDT odds ratio
CHISQTDT chi-square statistic. Not present with 'exact'/'exact-midp'.
PChi-square (default) or binomial test (if 'exact'/'exact-midp' specified) p-value
A:U_PARParental affected A2 excess:unaffected A2 excess
CHISQ_PARParental discordance chi-square statistic
P_PARParental discordance chi-square p-value
CHISQ_COMCombined test chi-square statistic
P_COMCombined test chi-square p-value

The last five fields do not appear if no considered trio has parents with discordant phenotypes.

If --ci 0.xy has also been specified, the following two fields are inserted after 'OR':

LxyBottom of xy% symmetric approx. confidence interval for TDT odds ratio
UxyTop of xy% approx. confidence interval for TDT odds ratio

.tdt.poo (parent-of-origin analysis)

A text file with a header line, and then one line per autosomal/chrX variant with the following 11 fields:

CHRChromosome code
SNPVariant identifier
A1:A2Allele 1 code:allele 2 code
T:U_PATPaternal A1:A2 transmission counts
CHISQ_PATPaternal chi-square statistic
P_PATPaternal chi-square p-value
T:U_MATMaternal A1:A2 transmission counts
CHISQ_MATMaternal chi-square statistic
P_PATMaternal chi-square p-value
Z_POOZ score for paternal/maternal odds ratio difference
P_POOAsymptotic parent-of-origin test p-value

.tfam (PLINK sample information file)

Sample information file accompanying a .tped file identical format to .fam files.

.tped (PLINK transposed text genotype table)

Variant information + genotype call text file. Must be accompanied by a .tfam file. Loaded with --tfile, and produced by "--recode transpose".

Contains no header line, and one line per variant with 2N+4 fields where N is the number of samples. The first four fields are the same as those in a .map file. The fifth and sixth fields are allele calls for the first sample in the .tfam file ('0' = no call) the 7th and 8th are allele calls for the second individual and so on.

.traw (variant-major additive component file)

Produced by "--recode A-transpose", for use with R. This format can only be loaded by PLINK 2.0.

A text file with a header line, and then one line per variant with the following N+6 fields (where N is the number of samples):

CHRChromosome code
SNPVariant identifier
(C)MPosition in morgans or centimorgans
POSBase-pair coordinate
COUNTEDCounted allele (defaults to A1)
ALTOther allele(s), comma-separated
<FID>_<IID>. Allelic dosages (0/1/2/'NA' for diploid variants, 0/2/'NA' for haploid)

Since this format is new to PLINK 1.9, it is tab-delimited by default use the 'spacex' modifier to force spaces.

.twolocus (4x4 joint genotype count table, single variant pair)

A text file with 1-3 sections, depending on whether cases and/or controls are present. The first section starts with two header lines:

This is followed by two tables. Each table has two header lines of its own:

then rows corresponding to A1/A1, A1/A2, A2/A2, and missing first variant genotypes, then a fifth row with (sub)totals. The first table contains raw counts, while the second table contains proportions of the grand total.

This is followed by a 'Cases' section if there is at least one case, and finally a 'Controls' section if there is at least one control.

.var.ranges (equal-size variant ranges)

A text file with a header line, and then one line per range with the following two fields:

FIRSTFirst variant ID
LASTLast variant ID

.vcf (1000 Genomes Project text Variant Call Format)

Variant information + sample ID + genotype call text file. Loaded with --vcf, and produced by "--recode vcf" (or vcf-fid/vcf-iid). Do not use PLINK for general-purpose VCF handling: all information in VCF files which cannot be represented by the PLINK 1 binary format is ignored.

The VCFv4.2 files emitted by --recode normally start with 5+C header lines, where C is the number of chromosomes:

  • 1. ##fileformat=VCFv4.2
  • 2. ##fileDate= <yyyymmdd date>
  • 3. ##source=PLINKv1.90
  • 4-(C+3). ##contig=<ID= <chromosome code> ,length= <last bp coordinate value + 1, or 2 31 - 3 if unknown> >
  • C+4. ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
  • C+5. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

(The INFO line is omitted when --real-ref-alleles is specified.)

This is followed by a tab-delimited header line with the following N+9 fields (where N is the number of samples), and one tab-delimited line per variant with the same fields:

#CHROMChromosome code/name
POSBase-pair coordinate
IDVariant identifier
REFAllele 2 code (missing = 'N')
ALTAllele 1 code (missing = '.')
QUALLeft blank ('.')
FILTERLeft blank ('.')
INFONormally 'PR' '.' when --real-ref-alleles specified
FORMAT'GT' (signaling the presence of genotype calls)
Sample IDs. Genotype calls ('/'-separated if diploid, 0=ref, 1=alt, '.'=missing)

Allele codes are supposed to either start with '<', only contain characters in the set , or represent a breakend. --recode issues a warning if an allele code does not satisfy this restriction.


Broad Institute

This is draft release 1 for genome-wide SNP genotyping and targeted sequencing in DNA samples from a variety of human populations (sometimes referred to as the "HapMap 3" samples).

This release contains the following data:

  • SNP genotype data generated from 1115 samples, collected using two platforms: the Illumina Human1M (by the Wellcome Trust Sanger Institute) and the Affymetrix SNP 6.0 (by the Broad Institute). Data from the two platforms have been merged for this release.
  • PCR-based resequencing data (by Baylor College of Medicine Human Genome Sequencing Center) across ten 100-kb regions (collectively referred to as "ENCODE 3") in 712 samples.

Since this is a draft release, we ask you to check this site regularly for updates and new releases.

Data Production Institutions

Funding Agencies

HapMap 3 Samples

The HapMap 3 sample collection comprises 1,301 samples (including the original 270 samples used in Phase I and II of the International HapMap Project) from 11 populations, listed below alphabetically by their 3-letter labels. For more information about these samples, click here.

label population sample number of samples
ASW African ancestry in Southwest USA 90
CEU Utah residents with Northern and Western European ancestry from the CEPH collection 180
CHB Han Chinese in Beijing, China 90
CHD Chinese in Metropolitan Denver, Colorado 100
GIH Gujarati Indians in Houston, Texas 100
JPT Japanese in Tokyo, Japan 91
LWK Luhya in Webuye, Kenya 100
MEX Mexican ancestry in Los Angeles, California 90
MKK Maasai in Kinyawa, Kenya 180
TSI Toscans in Italy 100
YRI Yoruba in Ibadan, Nigeria 180

ENCODE 3 Regions

Five of the ten ENCODE 3 regions overlap with the HapMap-ENCODE regions the other five are regions selected at random from the ENCODE target regions (excluding the 10 HapMap-ENCODE regions). All ENCODE 3 regions are 100-kb in size, and are centered within each respective ENCODE region. Read more about the ENCODE project here.

region chromosome coordinates (NCBI build 36) status
ENm010 7 27,124,046-27,224,045 HapMap-ENCODE
ENr321 8 119,082,221-119,182,220 HapMap-ENCODE
ENr232 9 130,925,123-131,025,122 HapMap-ENCODE
ENr123 12 38,826,477-38,926,476 HapMap-ENCODE
ENr213 18 23,919,232-24,019,231 HapMap-ENCODE
ENr331 2 220,185,590-220,285,589 New
ENr221 5 56,071,007-56,171,006 New
ENr233 15 41,720,089-41,820,088 New
ENr313 16 61,033,950-61,133,949 New
ENr133 21 39,444,467-39,544,466 New

Data Content Of This Release

label number of samples number of QC+ SNPs number of polymorphic QC+ SNPs
ASW 71 1632186 1536247
CEU 162 1634020 1403896
CHB 82 1637672 1311113
CHD 70 1619203 1270600
GIH 83 1631060 1391578
JPT 82 1637610 1272736
LWK 83 1631688 1507520
MEX 71 1614892 1430334
MKK 171 1621427 1525239
TSI 77 1629957 1393925
YRI 163 1634666 1484416
consensus 1115 1525445 1490422

label number of samples
ASW 55
CEU 119
CHB 90
CHD 30
GIH 60
JPT 91
LWK 60
MEX 27
MKK 0
TSI 60
YRI 120
total 712

Quality Control For This Release

Genotyping concordance between the two platforms was 0.9931 (computed over 249889 overlapping SNPs). Data from the two platforms was merged using PLINK (--merge-mode 1), keeping only genotype calls if there is consensus between non-missing genotype calls (that is, merged genotype is set to missing if the two platforms give different, non-missing calls).

Quality control at the individual level was performed separately by the two sites. Only individuals with genotype data on both platforms were kept in this release. The following criteria were used to keep SNPs in the QC+ data sets:

  • Hardy-Weinberg p>0.000001 (per population)
  • missingness <0.05 (per population)
  • <3 Mendel errors (per population only applies to YRI, CEU, ASW, MEX, MKK)
  • SNP must have a rsID and map to a unique genomic location

The "consensus" data set contains data for 1115 individuals (558 males, 557 females 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus|polymorphic" data set has 35023 monomorphic SNPs (across the entire data set) removed.

In all genotype files, alleles are expressed as being on the (+/fwd) strand of NCBI build 36.

The sequence-based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the ENCODE 3 regions. Following filtering low-quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. We also filtered out SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.

In the QC+ data set, we filtered out samples with low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (p<0.001). In the QC+ data set, the overall false positive rate is

3.2%, based on a limited number of validation assays.

Caveats In This Release

  • Missing from this release are Illumina SNPs that are A/T or C/G due to strandedness issues.
  • Missing from this release are Illumina SNPs that are mitochondrial (as they do not have rsIDs).
  • There may be few remaining SNPs (Illumina) in this release that are still on (-/rev) strand of NCBI build 36, but they are not A/T or C/G SNPs, so easy to identify downstream.

All variant calls have not yet been validated: we estimate that there is currently a false positive rate of

12% among all calls, with a slightly higher rate (

14%) if considering just the singletons. Additional validation is ongoing. PCR sequencing of additional samples (MKK) is also ongoing.

How To Download This Release

    - tarball of QC+ polymorphic genotype data per population, formatted as PLINK PED and MAP files [833 MB] - PED file of QC+ polymorphic genotype data (consensus) [738 MB] - MAP file of QC+ polymorphic genotype data (consensus) [11 MB] - family (pedigree) relationships and population labels for 1,301 HapMap 3 samples [37 KB] - list of the 270 samples used in Phase I and II of the International HapMap Project [2 KB]

To access the ENCODE III PCR resequencing data, please visit the BCM-HGSC public ftp site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Encode or download here:

    - README file [3 KB] - list of 712 unrelated samples sequenced [61 KB] - genotypes of 10,076 SNP sites by 712 samples [641 KB] - QC+ genotypes of 6,223 SNPs sites by 692 samples [9 MB]

Analysis Plans

Listed below are the analysis plans that we are currently pursuing:

  • SNP allele frequency estimation
  • Population differentiation
  • Linkage disequilibrium analysis
  • SNP tagging
  • Imputation efficiency
  • Genomic locations of human CNVs
  • Genotypes for CNVs
  • Population genetic properties of CNVs (allele frequencies, population differentiation, etc.)
  • Mutation rate (frequency of de novo CNV) and potential mutational mechanisms
  • Linkage disequilibrium properties of CNVs
  • Tagging and imputation of CNVs
  • Signals of selection around CNVs
  • Association of SNPs and CNVs with expression phenotypes

Data Release Policy

The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the "Fort Lauderdale" meeting. An NHGRI policy statement based on the outcome of the meeting is on the NHGRI web site (http://www.genome.gov/10506537).

The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of "community resource projects", with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. "Responsible use" was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.


Data management

In contrast, the fileset left behind by --keep-autoconv is just the result of step 1.

--make-just-bim is a variant of --make-bed which only generates a .bim file, and --make-just-fam plays the same role for .fam files. Unlike most other PLINK commands, these do not require the main input to include a .bed file (though you won't have access to many filtering flags when using these in no-.bed mode).

Use these cautiously. It is very easy to desynchronize your binary genotype data and your .bim/.fam indexes if you use these commands improperly. If you have any doubt, stick with --make-bed.

Generate text fileset

--recode creates a new text fileset, after applying sample/variant filters and other operations. By default, the fileset includes a .ped and a .map file, readable with --file.

  • The '12' modifier causes A1 (usually minor) alleles to be coded as '1' and A2 alleles to be coded as '2', while '01' maps A1&rarr0 and A2&rarr1. (PLINK forces you to combine '01' with --[output-]missing-genotype when this is necessary to prevent missing genotypes from becoming indistinguishable from A1 calls.)
  • The '23' modifier causes a 23andMe-formatted file to be generated. This can only be used on a single sample's data (a one-line --keep file may come in handy here). There is currently no special handling of the XY pseudo-autosomal region.
  • The 'AD' modifier causes an additive (0/1/2) + dominant (het = 1, otherwise 0) component file, suitable for loading from R, to be generated. 'A' is the same, except without the dominance component.
    • By default, A1 alleles are counted this can be customized with --recode-allele. --recode-allele's input file should have variant IDs in the first column and allele IDs in the second.
    • By default, the header line for .raw files only names the counted alleles. To include the alternate allele codes as well, add the 'include-alt' modifier.
    • Haploid additive components are 0/2-valued instead of 0/1-valued, to maintain a consistent scale on the X chromosome.

    plink --bfile binary_fileset --recode --out new_text_fileset

    generates new_text_fileset .ped and new_text_fileset .map from the data in binary_fileset .bed + .bim + .fam , while

    plink --bfile binary_fileset --recode vcf-iid --out new_vcf

    generates new_vcf .vcf from the same data, removing family IDs in the process.

    Irregular output coding

    Normally, autosomal/sex/mitochondrial chromosome codes in PLINK output files are numeric, e.g. '23' for human X. --output-chr lets you specify a different coding scheme by providing the desired human mitochondrial code supported options are '26' (default), 'M', 'MT', '0M', 'chr26', 'chrM', and 'chrMT'. (PLINK 1.9 correctly interprets all of these encodings in input files.)

    --output-missing-genotype <char>
    --output-missing-phenotype <string>

    --output-missing-genotype allows you to change the character (normally the --missing-genotype value) used to represent missing genotypes in PLINK output files, while --output-missing-phenotype changes the string (normally the --missing-phenotype value) representing missing phenotypes.

    Note that these flags do not affect --[b]merge/--merge-list or the autoconverters, since they generate files that may be reloaded during the same run. Add --make-bed if you want to change missing genotype/phenotype coding when performing those operations.

    Set blocks of genotype calls to missing

    If clusters have been defined, --zero-cluster takes a file with variant IDs in the first column and cluster IDs in the second, and sets all the corresponding genotype calls to missing. See the PLINK 1.07 documentation for an example.

    This flag must now be used with --make-bed and no other output commands (since PLINK no longer keeps the entire genotype matrix in memory).

    Heterozygous haploid errors

    Normally, heterozygous haploid and nonmale Y chromosome genotype calls are logged to plink .hh and treated as missing by all analysis commands, but left undisturbed by --make-bed and --recode (since, once gender and/or chromosome code errors have been fixed, the calls are often valid). If you actually want --make-bed/--recode to erase this information, use --set-hh-missing. (The scope of this flag is a bit wider than for PLINK 1.07, since commands like --list and --recode-rlist which previously did not respect --set-hh-missing have been consolidated under --recode.)

    Note that the most common source of heterozygous haploid errors is imported data which doesn't follow PLINK's convention for representing the X chromosome pseudo-autosomal region. This should be addressed with --split-x below, not --set-hh-missing.

    Mitochondrial DNA is subject to heteroplasmy, so PLINK 1.9 permits 'heterozygous' genotypes and treats MT more like a diploid than a haploid chromosome. However, some analytical methods don't use mixed MT genotype calls, and instead assume that no 'heterozygous' MT calls exist. The --set-mixed-mt-missing flag can be used with --make-bed/--recode to export a dataset with mixed MT calls erased.

    X chromosome pseudo-autosomal region

    --split-x <last bp position of head> <first bp position of tail> ['no-fail']
    --split-x <build code> ['no-fail']
    --merge-x ['no-fail']

    PLINK prefers to represent the X chromosome's pseudo-autosomal region as a separate 'XY' chromosome (numeric code 25 in humans) this removes the need for special handling of male X heterozygous calls. However, this convention has not been widely adopted, and as a consequence, heterozygous haploid 'errors' are commonplace when PLINK 1.07 is used to handle X chromosome data. The new --split-x and --merge-x flags address this problem.

    Given a dataset with no preexisting XY region, --split-x takes the base-pair position boundaries of the pseudo-autosomal region, and changes the chromosome codes of all variants in the region to XY. As (typo-resistant) shorthand, you can use one of the following build codes:

    • 'b36'/'hg18': NCBI build 36/UCSC human genome 18, boundaries 2709521 and 154584237
    • 'b37'/'hg19': GRCh37/UCSC human genome 19, boundaries 2699520 and 154931044
    • 'b38'/'hg38': GRCh38/UCSC human genome 38, boundaries 2781479 and 155701383

    By default, PLINK errors out if no variants would be affected by the split. This behavior may break data conversion scripts which are intended to work on e.g. VCF files regardless of whether or not they contain pseudo-autosomal region data use the 'no-fail' modifier to force PLINK to always proceed in this case.

    Conversely, in preparation for data export, --merge-x changes chromosome codes of all XY variants back to X (and 'no-fail' has the same effect). Both of these flags must be used with --make-bed and no other output commands.

    Mendel errors

    In combination with --make-bed, --set-me-missing scans the dataset for Mendel errors and sets implicated genotypes (as defined in the --mendel table) to missing.

      causes samples with only one parent in the dataset to be checked, while --mendel-multigen causes (great-) n grandparental data to be referenced when a parental genotype is missing.
  • It is no longer necessary to combine this with e.g. "--me 1 1 " to prevent the Mendel error scan from being skipped.
  • Results may differ slightly from PLINK 1.07 when overlapping trios are present, since genotypes are no longer set to missing before scanning is complete.
  • Fill in missing calls

    It can be useful to fill in all missing calls in a dataset, e.g. in preparation for using an algorithm which cannot handle them, or as a 'decompression' step when all variants not included in a fileset can be assumed to be homozygous reference matches and there are no explicit missing calls that still need to be preserved.

    For the first scenario, a sophisticated imputation program such as BEAGLE or IMPUTE2 should normally be used, and --fill-missing-a2 would be an information-destroying operation bordering on malpractice. However, sometimes the accuracy of the filled-in calls isn't important for whatever reason, or you're dealing with the second scenario. In those cases you can use the --fill-missing-a2 flag (in combination with --make-bed and no other output commands) to simply replace all missing calls with homozygous A2 calls. When used in combination with --zero-cluster/--set-hh-missing/--set-me-missing, this always acts last.

    You may want to combine this with --a2-allele below.

    Update variant information

    Whole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you don't want to throw out all of that data, you'll usually want to assign them chromosome-and-position-based IDs.

    --set-missing-var-ids provides one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one @ and one # must be present.) For example, given a .bim file starting with

    chr1 . 0 10583 A G
    chr1 . 0 886817 C T
    chr1 . 0 886817 CATTTT C
    chrMT . 0 64 T C

    " --set-missing-var-ids @:#[b37] " would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'. and then error out when naming the third variant, since it would be given the same name as the second variant. (Note that this position overlap is actually present in 1000 Genomes Project phase 1 data.)

    To maintain unique IDs in this situation, you can include '$1' and '$2' in your template string as well these refer to the first and second allele names in ASCII-sort order. So, if we're using a bash shell, we can try again with

    which would name the first variant 'chr1:10583[b37]A,G', the second variant 'chr1:886817[b37]C,T', the third variant 'chr1:886817[b37]C,CATTTT', and the fourth variant 'chrMT:64[b37]C,T'. Note the extra backslashes: they are necessary in bash because '

    How-to: Edit GWAS and Genomic Prediction File Formats

    Among the many tools used for GWAS and Genomic Prediction, there seems to equally as many file formats to navigate. Running multiple different algorithms then requires switching between file formats, which can be a tedious if not time-absorbing task. For both my future sanity as well as yours, I've provided a brief overview of some common formats and recommended routes for file conversion.

    To start, I highly recommend that all genotype files be originally generated in VCF format. If you can do so following the GATK best practices for variant calling, all the better. The reason I recommend this is because VCF files are generally the most descriptive of all the genotype formats and conversion from VCF format to others results in a loss of informaton.

    Once you have a VCF file, genotype file conversion is relatively simple. You can use Tassel which provides options to convert among several genotype file formats:

    VCF | HapMap | HapMap Diploid | Plink | HDF5 | Phylip (interleaved or sequential) | Table

    This may be done using the Tassel GUI (graphical user interface) or on command line.

    Alternatively, you can use vcftools to convert a VCF or BCF (binary VCF) to the following formats:

    BEAGLE | IMPUTE | LDhat | Plink | Plink TPED (transposed) | Numeric Matrix

    Conversion among genotype files is typically relatively simple but can be more complicated when converting between phased or unphased genotypes or when files with multi-allelic sites need to be converted to contain only bi-allelic sites.

    As for phenotype formats, the main issue is typically determining what format you need for your program of choice and then figuring out how best to convert it.

    Plink format for phenotypes has the following tab-delimited structure:

    In this example, I don't have FIDs so the IID is just repeated for both.

    FID IID Phenotype1 . PhenotypeN
    Indiv1 Indiv1 0 . 0
    Indiv2 Indiv2 3 . 2

    This is slightly different from the Tassel tab-delimited format for phenotypes:

    <Trait> Phenotype1 . PhenotypeN
    Indiv1 0 . 0
    Indiv2 3 . 2

    Tassel will allow you to save a phenotype file into either Tassel or Plink format.

    If you're running GAPIT, then the phenotype file is read into an R dataframe. That means that the delimiter used is less important since files can be read with "read.csv" or "read.table". However, you will generally want the file to looks like this:

    Taxa Phenotype1 . PhenotypeN
    Indiv1 0 . 0
    Indiv2 3 . 2

    And, when you read the file, you want to designate that there are headers for the phenotype file ("head=TRUE"). Don't do this for the genotype file, though (in HapMap format). Even though there is a header, it should be read as "head=FALSE" so that the header occurs on the first row.

    It's also worth noting that while the first 11 HapMap columns are required for GAPIT, only three of them are used ("rs" a.k.a. SNP name, "chrom" and "pos"). So, the other eight columns may be filled with "NA".

    Now, let's assume instead that you want to run Plink or GEMMA. GEMMA can use Plink file formats, so let's use that common format. Conversion from VCF to Plink files is easily acheived using Tassel or vcftools.

    To begin, Plink requires both a PED (pedigree) and MAP (genetic map) file. Plink PED file format requires all markers be biallelic and the file look like so (header included here for clarity -- not in actual PED file):

    FamilyID IndividualID PaternalID MaternalID Sex Phenotype Snp1 . SnpN
    Indiv1 Indiv1 0 0 0 0 A . G
    Indiv2 Indiv2 0 0 0 0 A . G

    And the Plink MAP file looks like (again, header included for clarity only):

    chromosome SnpID GeneticDistance BasePairPosition
    1 S01_3918 0 3918
    1 S01_12215 0 12215
    1 S01_23664 0 23664

    The BED, BIM and FAM files are referred to as the Plink binary fileset. These are what you want to run GEMMA and Plink.

    The last style of genotype matrix typically seen is the numeric genotype matrix. Depending on the software, you may want this matrix in 0, 1, 2 format or in -1, 0, 1. This matrix format is useful in genomic prediction as a design matrix or during the calculation of a genomic relationship matrix.

    Conversion to 012 is easily accomplished using vcftools.

    Then, conversion to a -101 format simply requires that the matrix be read into either R or python and one be subtracted from the matrix. For example:

    Do note, however, that different programs require missing data in particular formats. Here, the 012 matrix has missing as -1 and the -101 has -2. So, change for NAs as necessary.

    That should cover the majority of file conversions needed for performing GWAS or Genomic Prediction. Good luck!


    Watch the video: ..Ότι ενεφράγη στόμα λαλούντων άδικα (January 2022).