Last data update: 2014.03.03

R: Principal Component Analysis
pcaR Documentation

Principal Component Analysis

Description

The function pca performs a Principal Component Analysis of a genotypic matrix using the lfmm, geno, ancestrymap, ped or vcf format. The function computes eigenvalue, eigenvector, and standard deviation for each principal component and the projection of each individual on each component. The function pca returns an object of class "pcaProject" containing the output data and the input parameters.

Usage

pca (input.file, K, center = TRUE, scale = FALSE)

Arguments

input.file

A character string containg the path to the genotype input file, a genotypic matrix in the lfmm format.

K

An integer corresponding to the number of principal components calculated. By default, all principal components are calculated.

center

A boolean option. If true, the data matrix is centered (default: TRUE).

scale

A boolean option. If true, the data matrix is centered and scaled (default: FALSE).

Value

pca returns an object of class pcaProject containing the following components:

eigenvalues

The vector of eigenvalues.

eigenvectors

The matrix of eigenvectors (one column for each eigenvector).

sdev

The vector of standard deviations.

projections

The matrix of projections (one column for each projection).

The following methods can be applied to the object of class pcaProject returned by pca:

plot

Plot the eigenvalues.

show

Display information about the analysis.

summary

Summarize the analysis.

tracy.widom

Perform Tracy-Widom tests on the eigenvalues.

load.pcaProject(file.pcaProject)

Load the file containing a pcaProject object and return the pcaProject object.

remove.pcaProject(file.pcaProject)

Erase a pcaProject object. Caution: All the files associated with the object will be removed.

export.pcaProject(file.pcaProject)

Create a zip file containing the full pcaProject object. It allows to move the project to a new directory or a new computer (using import). If you want to overwrite an existing export, use the option force == TRUE.

import.pcaProject(file.pcaProject)

Import and load an pcaProject object from a zip file (made with the export function) into the chosen directory. If you want to overwrite an existing project, use the option force == TRUE.

Author(s)

Eric Frichot

See Also

lfmm.data snmf lfmm tutorial

Examples

# Creation of the genotype file "genotypes.lfmm"
# with 1000 SNPs for 165 individuals.
data("tutorial")
write.lfmm(tutorial.R,"genotypes.lfmm")

#################
# Perform a PCA #
#################

# run of PCA
# Available options, K (the number of PCs calculated), 
#                    center and scale. 
# Creation of   genotypes.pcaProject - the pcaProject object.
#               a directory genotypes.pca containing:
# Create files: genotypes.eigenvalues - eigenvalues,    
#               genotypes.eigenvectors - eigenvectors,
#               genotypes.sdev - standard deviations,
#               genotypes.projections - projections,
# Create a pcaProject object: pc.
pc = pca("genotypes.lfmm", scale = TRUE)

#######################
# Display Information #
#######################

# Display information about the analysis.
show(pc)

# Summarize the analysis.
summary(pc)

#####################
# Graphical outputs #
#####################

par(mfrow=c(2,2))

# Plot eigenvalues.
plot(pc, lwd=5, col="red",xlab=("PCs"),ylab="eigen")

# PC1-PC2 plot.
plot(pc$projections)
# PC3-PC4 plot.
plot(pc$projections[,3:4])

# Plot standard deviations.
plot(pc$sdev)

#############################
# Perform Tracy-Widom tests #
#############################

# Perfom Tracy-Widom tests on all eigenvalues.
# Create file: genotypes.tracyWidom - tracy-widom test information, 
#          in the directory genotypes.pca/.
tw = tracy.widom(pc)

# Plot the percentage of variance explained by each component.
plot(tw$percentage)

# Display the p-values for the Tracy-Widom tests. 
tw$pvalues

##########################
# Manage an pca project #
##########################

# All the file of pca for a given file are 
# automatically saved into a pca project directory and a file.
# The name of the pcaProject file is the same name as 
# the name of the input file with a .pcaProject extension 
# ("genotypes.pcaProject").
# The name of the pcaProject directory is the same name as
# the name of the input file with a .pca extension ("genotypes.pca/")
# There is only one pca Project for each input file including all the runs.

# An pcaProject can be load in a different session.
project = load.pcaProject("genotypes.pcaProject") 

# An pcaProject can be exported to be imported in another directory
# or in another computer
export.pcaProject("genotypes.pcaProject")

dir.create("test", showWarnings = TRUE)
#import
newProject = import.pcaProject("genotypes_pcaProject.zip", "test")
# remove
remove.pcaProject("test/genotypes.pcaProject")


# An pcaProject can be erased.
# Caution: All the files associated with the project will be removed.
remove.pcaProject("genotypes.pcaProject")

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(LEA)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/LEA/main_pca.Rd_%03d_medium.png", width=480, height=480)
> ### Name: pca
> ### Title: Principal Component Analysis
> ### Aliases: pca eigenvalues eigenvectors projections sdev load.pcaProject
> ###   eigenvalues,pcaProject-method eigenvectors,pcaProject-method
> ###   import.pcaProject import.pcaProject,character-method
> ###   export.pcaProject export.pcaProject,character-method
> ###   projections,pcaProject-method sdev,pcaProject-method
> ###   load.pcaProject,character-method tracy.widom,pcaProject-method
> ###   $,pcaProject-method remove.pcaProject
> ###   remove.pcaProject,character-method plot,pcaProject-method
> ###   show,pcaProject-method summary,pcaProject-method
> ### Keywords: pca tutorial
> 
> ### ** Examples
> 
> # Creation of the genotype file "genotypes.lfmm"
> # with 1000 SNPs for 165 individuals.
> data("tutorial")
> write.lfmm(tutorial.R,"genotypes.lfmm")
[1] "genotypes.lfmm"
> 
> #################
> # Perform a PCA #
> #################
> 
> # run of PCA
> # Available options, K (the number of PCs calculated), 
> #                    center and scale. 
> # Creation of   genotypes.pcaProject - the pcaProject object.
> #               a directory genotypes.pca containing:
> # Create files: genotypes.eigenvalues - eigenvalues,    
> #               genotypes.eigenvectors - eigenvectors,
> #               genotypes.sdev - standard deviations,
> #               genotypes.projections - projections,
> # Create a pcaProject object: pc.
> pc = pca("genotypes.lfmm", scale = TRUE)
[1] "******************************"
[1] " Principal Component Analysis "
[1] "******************************"
summary of the options:

        -n (number of individuals)          50
        -L (number of loci)                 400
        -K (number of principal components) 50
        -x (genotype file)                  /home/ddbj/DataUpdator-rgm3/target/genotypes.lfmm
        -a (eigenvalue file)                /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.eigenvalues
        -e (eigenvector file)               /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.eigenvectors
        -d (standard deviation file)        /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.sdev
        -p (projection file)                /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.projections
        -s data centered and scaled 

> 
> #######################
> # Display Information #
> #######################
> 
> # Display information about the analysis.
> show(pc)
* pca class *

project directory:               /home/ddbj/DataUpdator-rgm3/target/ 
pca result directory:            genotypes.pca/ 
input file:                      genotypes.lfmm 
eigenvalue file:                 genotypes.eigenvalues 
eigenvector file:                genotypes.eigenvectors 
standard deviation file:         genotypes.sdev 
projection file:                 genotypes.projections 
pcaProject file:                   genotypes.pcaProject 
number of individuals:           50 
number of loci:                  400 
number of principal components:  50 
centered:                        TRUE 
scaled:                          TRUE 
> 
> # Summarize the analysis.
> summary(pc)
Importance of components:
                             PC1        PC2        PC3        PC4        PC5
Standard deviation     6.4135800 5.78728000 4.15808000 3.69458000 3.47388000
Proportion of Variance 0.1049337 0.08544031 0.04410613 0.03482123 0.03078531
Cumulative Proportion  0.1049337 0.19037400 0.23448013 0.26930135 0.30008666
                              PC6        PC7        PC8        PC9      PC10
Standard deviation     3.31251000 3.23157000 3.18106000 3.13701000 3.0478600
Proportion of Variance 0.02799169 0.02664041 0.02581408 0.02510419 0.0236976
Cumulative Proportion  0.32807835 0.35471876 0.38053284 0.40563703 0.4293346
                             PC11       PC12       PC13      PC14       PC15
Standard deviation     3.03857000 2.94752000 2.92203000 2.8995500 2.86908000
Proportion of Variance 0.02355332 0.02216296 0.02178128 0.0214474 0.02099903
Cumulative Proportion  0.45288795 0.47505092 0.49683219 0.5182796 0.53927863
                             PC16       PC17       PC18       PC19       PC20
Standard deviation     2.85347000 2.77617000 2.73353000 2.66595000 2.65867000
Proportion of Variance 0.02077123 0.01966097 0.01906163 0.01813082 0.01803199
Cumulative Proportion  0.56004985 0.57971082 0.59877246 0.61690328 0.63493527
                             PC21       PC22       PC23       PC24      PC25
Standard deviation     2.63015000 2.61658000 2.57644000 2.56496000 2.5182600
Proportion of Variance 0.01764714 0.01746546 0.01693383 0.01678322 0.0161776
Cumulative Proportion  0.65258241 0.67004787 0.68698170 0.70376492 0.7199425
                             PC26       PC27       PC28       PC29       PC30
Standard deviation     2.50940000 2.47612000 2.46346000 2.44649000 2.38631000
Proportion of Variance 0.01606408 0.01564077 0.01548123 0.01526873 0.01452669
Cumulative Proportion  0.73600661 0.75164737 0.76712860 0.78239732 0.79692401
                             PC31       PC32      PC33       PC34       PC35
Standard deviation     2.38020000 2.33529000 2.3108400 2.23289000 2.21740000
Proportion of Variance 0.01445245 0.01391225 0.0136224 0.01271893 0.01254301
Cumulative Proportion  0.81137646 0.82528871 0.8389111 0.85163004 0.86417305
                             PC36       PC37       PC38       PC39       PC40
Standard deviation     2.18767000 2.14222000 2.11608000 2.10842000 2.05300000
Proportion of Variance 0.01220898 0.01170689 0.01142296 0.01134046 0.01075204
Cumulative Proportion  0.87638203 0.88808892 0.89951188 0.91085234 0.92160438
                             PC41      PC42        PC43       PC44        PC45
Standard deviation     2.02122000 1.9917300 1.975050000 1.93240000 1.821420000
Proportion of Variance 0.01042174 0.0101199 0.009951123 0.00952597 0.008463215
Cumulative Proportion  0.93202612 0.9421460 0.952097138 0.96162311 0.970086323
                              PC46        PC47        PC48        PC49 PC50
Standard deviation     1.808260000 1.715340000 1.697050000 1.622970000    0
Proportion of Variance 0.008341276 0.007506123 0.007346837 0.006719439    0
Cumulative Proportion  0.978427600 0.985933723 0.993280561 1.000000000    1
> 
> #####################
> # Graphical outputs #
> #####################
> 
> par(mfrow=c(2,2))
> 
> # Plot eigenvalues.
> plot(pc, lwd=5, col="red",xlab=("PCs"),ylab="eigen")
> 
> # PC1-PC2 plot.
> plot(pc$projections)
> # PC3-PC4 plot.
> plot(pc$projections[,3:4])
> 
> # Plot standard deviations.
> plot(pc$sdev)
> 
> #############################
> # Perform Tracy-Widom tests #
> #############################
> 
> # Perfom Tracy-Widom tests on all eigenvalues.
> # Create file: genotypes.tracyWidom - tracy-widom test information, 
> #          in the directory genotypes.pca/.
> tw = tracy.widom(pc)
[1] "*******************"
[1] " Tracy-Widom tests "
[1] "*******************"
summary of the options:

        -n (number of eigenvalues)          50
        -i (input file)                     /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.eigenvalues
        -o (output file)                    /home/ddbj/DataUpdator-rgm3/target/genotypes.pca/genotypes.tracywidom
> 
> # Plot the percentage of variance explained by each component.
> plot(tw$percentage)
> 
> # Display the p-values for the Tracy-Widom tests. 
> tw$pvalues
 [1] 8.000e-09 8.000e-09 8.000e-09 1.503e-04 3.152e-02 4.215e-01 6.565e-01
 [8] 6.859e-01 6.738e-01 9.363e-01 7.937e-01 9.827e-01 9.709e-01 9.425e-01
[15] 9.240e-01 8.119e-01 9.734e-01 9.870e-01 9.996e-01 9.972e-01 9.973e-01
[22] 9.904e-01 9.959e-01 9.835e-01 9.953e-01 9.777e-01 9.801e-01 9.354e-01
[29] 8.465e-01 9.499e-01 8.092e-01 8.500e-01 7.492e-01 9.638e-01 9.114e-01
[36] 8.908e-01 9.402e-01 9.173e-01 7.407e-01 8.460e-01 8.067e-01 7.215e-01
[43] 4.396e-01 2.428e-01 6.246e-01 2.789e-01 6.168e-01 3.909e-01 7.257e-01
> 
> ##########################
> # Manage an pca project #
> ##########################
> 
> # All the file of pca for a given file are 
> # automatically saved into a pca project directory and a file.
> # The name of the pcaProject file is the same name as 
> # the name of the input file with a .pcaProject extension 
> # ("genotypes.pcaProject").
> # The name of the pcaProject directory is the same name as
> # the name of the input file with a .pca extension ("genotypes.pca/")
> # There is only one pca Project for each input file including all the runs.
> 
> # An pcaProject can be load in a different session.
> project = load.pcaProject("genotypes.pcaProject") 
> 
> # An pcaProject can be exported to be imported in another directory
> # or in another computer
> export.pcaProject("genotypes.pcaProject")
  adding: genotypes.pcaProject (deflated 58%)
  adding: genotypes.pca/ (stored 0%)
  adding: genotypes.pca/genotypes.sdev (deflated 55%)
  adding: genotypes.pca/genotypes.eigenvalues (deflated 48%)
  adding: genotypes.pca/genotypes.projections (deflated 54%)
  adding: genotypes.pca/genotypes.tracywidom (deflated 52%)
  adding: genotypes.pca/genotypes.eigenvectors (deflated 60%)
  adding: genotypes.lfmm (deflated 89%)
An export of the pca project hase been created:  genotypes_pcaProject.zip 

> 
> dir.create("test", showWarnings = TRUE)
Warning message:
In dir.create("test", showWarnings = TRUE) : 'test' already exists
> #import
> newProject = import.pcaProject("genotypes_pcaProject.zip", "test")
The project has been imported into directory,  test 

> # remove
> remove.pcaProject("test/genotypes.pcaProject")
[1] TRUE
> 
> # An pcaProject can be erased.
> # Caution: All the files associated with the project will be removed.
> remove.pcaProject("genotypes.pcaProject")
[1] TRUE
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>