Manipulation of genetic data (SNPs), computation of Genetic Relationship Matrix, Linkage Disequilibrium, etc.
Efficient algorithms for Linear Mixed Model (AIREML, diagonalisation trick).

Introducing gaston

Gaston offers functions for efficient manipulation of
large genotype (SNP) matrices, and state-of-the-art implementation of algorithms
to fit Linear Mixed Models, that can be used to compute heritability
estimates or to perform association tests.

Thanks to the packages Rcpp,
RcppParallel,
RcppEigen, gaston
functions are mainly written in C++.

Many functions are multithreaded;
the number of threads can be setted through RcppParallel
function setThreadOptions.
It is advised to try several values for the number of threads, as
using too many threads might be conterproductive due to an important
overhead.

Some functions have a verbose argument, which controls the
function verbosity. To mute all functions at once you can use
options(gaston.verbose = FALSE).

Genotype matrices

An S4 class for genotype matrices is defined, named bed.matrix.
Each row corresponds to an individual, and each column to a SNP. They can
be read from files using read.bed.matrix
and saved using write.bed.matrix. The function read.vcf reads
VCF files; it relies on the package WhopGenome.

In first approach, a bed.matrix behaves as a "read-only" matrix containing only
0, 1, 2 and NAs, unless the genotypes are standardized (use standardize<-).
They are stored in a compact form, each genotype being coded on 2 bits (hence
4 genotypes per byte).

Bed.matrices can be converted to numerical matrices with as.matrix,
and multiplied with numeric vectors or matrices with %*% (this
feature can be used e.g. to simulate quantitative phenotypes, see a basic example in the example
section of association.test).

It is possible to subset bed.matrices just as base matrices, writing e.g.
x[1:100,] to extract the first 100 individuals, or x[1:100,1000:1999]
for extract the SNPs 1000 to 1999 for these 100 individuals. The use of logical
vectors for subsetting is allowed too. The functions
select.inds and select.snps can also be used for
subsetting with a nice syntax.

Some basic descriptive statistics can be added to a bed.matrix with set.stats (since
gaston 1.4, this function is called by default by all functions that create a bed.matrix, unless
options(gaston.auto.set.stats = FALSE) was set.
Hardy-Weinberg Equilibrium can be tested for all SNPs with set.hwe.

Crossproducts of standardized matrices

If X is a standardized n x q genotype matrix, a Genetic Relationship Matrix
(GRM) of the individuals can be computed as

GRM = XX<c3><a2><c2><80><c2><99>/(q-1)

where q is the number of SNPs.
This computation is done by the function GRM. The eigen decomposition of the GRM produces
the Principal Components (PC) of the data set. If needed, the loadings
corresponding to the PCs can be retrieved using bed.loadings.

Doing the above crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium:

LD = X<c3><a2><c2><80><c2><99>X/(n-1)

where n is the number of individuals. This computation is done by the function
LD (usually, only parts of the whole LD matrix is computed). This method is
also used by LD.thin to extract a set of SNPs in low linkage disequilibrium
(it is often recommended to perform this operation before computing the GRM).

Linear Mixed Models

lmm.aireml is a function for linear mixed models parameter estimation
and BLUP computations.

The model considered is of the form

Y = X beta + omega_1 + ... + omega_k + epsilon

with omega_i ~ N(0, tau_i K_i) for i = 1, ..., k and
epsilon ~ N(0, sigma^2 I_n).

Note that very often in genetics a mixed model is written as

Y = X beta + Zu + epsilon

with Z a standardized genotype matrix, and u ~ N(0, tau I_q). In that case,
denoting omega = Zu, omega ~ N(0,tau ZZ')
and letting K=ZZ' we get a mixed model of the previous form.

When k=1 in the above general model (only one random term omega), the likelihood
can be computed very efficiently using the eigen decomposition of
K = var(omega). This "diagonalization trick"
is used in lmm.diago.likelihood and lmm.diago, to compute
the likelihood and for parameter estimation, respectively.

Two small functions complete this set of functions: lmm.simu, to
simulate data under a linear mixed model, and random.pm, to generate
random positive matrices. Both are used in examples and can be useful for data simulation.

Author(s)

Herv<c3><83><c2><a9> Perdry and Claire Dandine-Roulland