Research Article Volume 8 Issue 1
Maracaibo Territorial Polytechnic University of Maracaibo, Venezuela
Correspondence: Pérez González José Raúl, Maracaibo Territorial Polytechnic University of Maracaibo, Venezuela
Received: February 20, 2024 | Published: March 20, 2024
Citation: Raúl PGJ,Valladares M, Daniel D. Theory of estimation of parameters and genetic values under mixed models. Int J Avian & Wildlife Biol. 2024;8(1):27-33. DOI: 10.15406/ijawb.2024.08.00210
In animal breeding, it is essential to know genetic parameters such as heritability, with the aim of being able to predict genetic values (GV) and efficiently direct selection programs. A mixed model refers to those cases where the researcher considers fixed and random factors in a statistical model. Models widely used in the area of animal genetic improvement are the reproductive model and the animal model, which consider the reproductive or animal factor as random and a group of non-genetic effects as fixed. These mixed models allow us to obtain both heritability values (h2) for a trait, as well as genetic predictions such as the expected progeny difference (EPDs) or the predicted transmission ability (PTA) for each animal. An example of birth weight (BW) in cattle was used to calculate the VG, h2 and e2 using a mixed model, with a fixed and a random factor. The ANOVA, ML and REML methods were used to calculate h2, e2 and the VG first using all the information and subsequently assuming the last lost data, under a reproductive model and an animal model. The results found using the 3 methods were the same for REML and ANOVA in balanced data and different for the 3 methods in unbalanced data, where in the unbalanced case the ANOVA estimated a negative variance component, therefore, it can be concluded that estimate genetic values and parameters using ANOVA, ML and REML, but with the risk of estimating negative variance components using ANOVA or null (or overestimated) heritabilities with likelihood-based methods when the data structure or model is not the same correct.
Keywords: heritability, ANOVA, REML, ML, mixed models
In animal breeding, it is essential to know genetic parameters such as heritability in order to be able to predict genetic values (GV) and efficiently conduct selection programs. Genetic parameters are ratios between estimated population variances, known as variance components, which are calculated using linear models containing fixed and random factors, generally known as mixed models.1 For the correct estimation of parameters and genetic values, it is necessary to have a broad knowledge of estimation using mixed models. Therefore, this article reviews the estimation of variance components and genetic values using ANOVA, ML and REML under a reproductive model and an animal model, explaining the virtues and limitations of each method in balanced and unbalanced data.
Theoretical framework mixed models
A mixed model refers to those cases where the researcher considers both fixed and random factors in a statistical model.2 A model widely used in the area of animal breeding is the reproductive model or Sire Model, which considers the reproductive factor as random and a group of non-genetic effects as fixed.2 The reproductive model allows obtaining both heritability values (h2) for a trait, as well as genetic predictions such as the expected difference of progeny (DEPs) or the predicted transmission ability (PTA) for each breeder.3 In matrix algebra the reproductive model takes the following form:
y=Xb+Zs+ey=Xb+Zs+e
Where 𝑦 is a vector for the data, 𝑋 is an incidence matrix relating the data to the fixed effects, 𝑏 is a vector of unknown parameters for the fixed effects, 𝑍 is an incidence matrix relating the data to the random effects, 𝑠 is a vector of unknown predictions for each player, and 𝑒 is a vector of residuals.
The covariance structure of the above model is:
VAR[se]=[Iσ2s00Iσ2s]
Where 𝐼 is an identity matrix, 𝜎𝑠2 is the variance between breeders and 𝜎𝑒2 is the residual variance.
The Henderson normal equations, necessary to find the genetic values of the breeders, for the above model are given by3:
[X'XX'ZZ'XZ'Z+Iα][bisi]=[X′yZ'y]
Where 𝛼 is a ratio of the residual variance to the variance between breeders:
α=σ2eσ2s
According to Román and Aranguren,4 it is possible to substitute 𝐼𝛼 by 𝐴−1𝛼 in the normal Henderson equations, with the objective of improving predictions using all the parentage information between males, therefore, the new equations are:
[X'XX'ZZ'XZ'Z+A−1α][bisi]=[X'yZ'y]
Where 𝐴−1 is the inverse of the kinship matrix.
And the covariance structure taking into account the introduction of 𝐴 is5:
VAR[se]−[Aσ2s00Iσ2e]
Another model widely used in genetic evaluation is the animal model, which uses all the parentage information in the pedigree, and unlike the reproductive model, allows obtaining genetic predictions of all the animals in the herd, whether or not data is present or not:
y=Xb+Za+e
Where 𝑎 is a vector of genetic predictions for each animal the covariance structure of the above model is as follows:
VAR[ae]=[Aσ2a00Iσ2e]=[G00R]
Where 𝐺 is a variance and covariance matrix for the random effects and 𝑅 is a matrix of residuals.
The Henderson normal equations for this model are given by:6
[X′XX′ΖΖ′XΖ′Ζ+Α−1α][biai]=[X′yZ′y] >
Where 𝛼 in this model is a ratio of the residual variance to the additive variance:
α=σ2eσ2a
Where 𝜎𝑎2 is additive genetic variance
Genetic parameters
Using mixed models, it is possible to estimate the variance components, and from them calculate the hereability, which is given by:5
h2=σ2aσ2p
Where ℎ2 is heritability, and 𝜎𝑝2 is phenotypic variance, therefore, heritability is defined as a quotient between the additive variance and the phenotypic variance. The additive component (additive variance) of the numerator of the formula of ℎ2 can be estimated using several procedures, a well-known one is to use a reproductive model to estimate the variance between breeders, which is ¼ of the additive variance, therefore, a formula to estimate 𝜎𝑎2 is:
σ2a=4σ2s
Where 4𝜎𝑠2 is four times the variance among breeders, therefore, heritability can be calculated as:7
h2=42sσ2p
If the heritability is known, the heritability component can be 𝜎𝑎2 component can be calculated using the following formula:8
σ2a=h2σ2p
Another parameter of interest is the environmental proportion coefficient, which indicates how much of the differences observed in the phenotype (data) of the animals are due to non-genetic (environmental) factors, this coefficient has the following mathematical formula:9
e2=σ2enσ2p
Where 𝜎𝑒𝑛2 is the environmental variance. The variance component 𝜎𝑒𝑛2 is calculated using the difference between 𝜎𝑝2 𝜎𝑎2 therefore, the formula for 𝜎𝑒𝑛2 is:8
σ2en=σ2p−σ2a
Finally, the variance 𝜎𝑝2 is the sum of the variance components:
σ2p=σ2a+σ2en=σ2s+σ2e
Variance component estimation using a reproductive model analysis of variance
There are several classical methods for estimating the variance components needed to compute ℎ2 y 𝑒2including analysis of variance (ANOVA), maximum likelihood (ML) and restricted maximum likelihood (REML).
ANOVA is a technique that attempts to separate out different sources of variability. 𝜎𝑝2 into different sources of variability, this involves the separation of sums of squares (SC), degrees of freedom (GL) and mean squares (MS) for each source of variation. Variance components estimated using ANOVA are calculated by equating the expected values of the CM (E (CM)) for each source of variation, with their respective CM and solving the resulting system of equations.10 CMs are a ratio of SC to GLs for each source of variation:10
CM=SCGL
In the case of a fixed factor and a random factor, without interaction, the reproductive model, in elementary algebra, is given by:
yijk=μ+si+bj+eijk
And the ANOVA square for the above model is presented in Table 1.
FV |
SC |
GL |
CM |
E ( CM) |
Factor Fijo |
SCb |
nfijo -1 |
SCbnfijo−1 |
|
Padres |
SCs |
ns -1 |
SCsns−1 |
E(CMs)=σ2e+nfijokσ2s |
residual |
SCtotal−∑SCresto | GLtotal−∑GLresto | SCeGLtotal−GLresto |
E(CMe)=σ2e |
Total |
y′y−R(μ) |
n-1 |
|
|
Table 1 ANOVA for Henderson's method III
Where 𝑘 is the number of replicates of the design, 𝑛𝑓𝑖𝑗𝑜 is the number of levels of the fixed effect and 𝑛𝑠 is the number of levels of the random factor. The variance components are calculated by equating the CM to their E (CM):
CMs=σ2e+nfijokσ2s
CMe=σ2e
And the unique solution of this system of equations is:
σ2s=CMs−CMenfijok
σ2e=CMe
In balanced data, the CS can be estimated directly without the need for adjustment, for a model with two non-interacting factors:
SCs=∑y2s.k−(∑y)2n
SCb=∑y2b.w−(∑y)2n
Where ∑ 𝑦𝑠2. is the sum of the sum of the sum of the data for each player squared, ∑ 𝑦𝑏2. is the sum of the sum of the sum of the data for each level of the fixed effect and w is the number of replicates for the fixed effect. For the unbalanced case, the SCs have to be calculated using the type III SCs for the random factor (sire), since type III calculates the SCs of an effect by correcting them with respect to any other effect that does not contain it and orthogonal to any effect (if it exists) that contains it. Type III CS can be expressed as:11
SCs=SC(μ,s,b)−SC(μ,b)
The 𝑆𝐶𝑠 is corrected for the effects of 𝜇 𝑦 𝑏where 𝜇 is the intercept or herd mean effect. In order to find the values of 𝑆𝐶𝑠 it is necessary to fit a complete model and calculate (𝜇, 𝑠, 𝑏, ) and subtract 𝑆𝐶(𝜇, 𝑏) a reduced model.
Maximum likelihood
The maximum likelihood (ML) method is a classical method of parameter estimation proposed by Fisher,12 but it was not until Hartley and Rao,13 that it was used for mixed models in general. Knowing the likelihood function as a function of the parameters of a statistical model given some data, in ML we try to obtain estimators of the variance components that maximize the likelihood function, that is, that have the maximum probability of representing the population parameters.
The likelihood function is defined as the product of the likelihood function of the data, but in practice, the natural logarithm of the likelihood function is used because it is more manageable, if the distribution of the data is normal, in matrix algebra the natural logarithm of the likelihood function is defined as:11
Ln(L)=−0.5(n).In(2π)−0.5In|V|−0.5(y−Xb)′V−1(y−Xb)
Where (𝐿) is the natural logarithm of the likelihood function and 𝑉 = 𝑍𝐺𝑍′ + 𝑅 is the variance and phenotypic covariance matrix of the model. To find the estimators that maximize the likelihood, we need to find the maximum of equation (𝐿)This is achieved with different methodologies, for example, if the data structure is balanced and we have a mixed model, with a random effect and a fixed one with no interaction, the derivative of 𝐿𝑛(𝐿)with respect to the parameters to be estimated σ2s y σ2e will lead us to a system of equations whose solution is:
σ2s=SCsns−σ2enfijok
σ2e=[1−nfijo−1ns(nfijok−1)]CMe
An important point of ML estimation, for this model, is that even with balanced data, it is possible to find estimators different from the ones presented above, since these solutions will be valid if the inequality 𝐶𝑀𝑠 > 𝐶𝑀𝑒is met, but on the other hand, if the inequality is 𝐶𝑀𝑠 < 𝐶𝑀𝑒 ML estimates for this model and balanced data are given by:11
σ2e=σ2s=0SCtotal(ns)(nfijo)(k)
That is all phenotypic variability is residual, which may indicate that the model used is incorrect or that the number of data is insufficient, thus increasing the variability of the error. The variance σ2p is the sum of the variance components σ2e y σ2s whose sum gives an estimate of σ2p given mathematically by:
σ2p=σ2s+σ2e=(y−Xb)′(y−Xb)n
Which is biased, since it is associated with n degrees of freedom. If the structure of the daros is unbalanced, the partial derivatives of (𝐿) lead to nonlinear maximum likelihood equations for the parameters to be estimated, therefore, the system of equations cannot be solved with direct methods. Faced with this problem, iterative number methods are used to try to approximate the maximum of (𝐿) which are applied to the logarithmic likelihood itself and not to the equations resulting from its first derivative, in order to be able to simultaneously calculate the variance components and (𝐿)which we can use to find fit criteria for our model, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).
Restricted maximum likelihood
The restricted maximum likelihood method (REML) is a method proposed by Paterson and Thompson,13 which takes into account the loss of degrees of freedom by including fixed effects in the statistical model, therefore, the estimation of variability components are unbiased, since they are associated to degrees of freedom, which leads to an estimation of variance of the model. 𝑛 − (𝑋) degrees of freedom, which leads to an estimate of the variance, which is defined as σ2pwhich is defined as:
σ2p=(y−Xb)′(y−Xb)n−Rango(X)
Where (𝑋) is the rank of the incidence matrix for the fixed effects of the model. For the case where the only fixed effect is 𝜇the variance σ2p is associated with 𝑛 − 1 degrees of freedom.
As in ML, in REML, the objective is to maximize the logarithm of a function of the parameters, but in this case restricted, which is known as restricted likelihood function, which in matrix algebra is defined as:14
Ln(Lr)=−0.5(n−p).In(2π)−0.5In|V|−0.5|X′V−1X|−0.5(y−Xb)′V−1(y−Xb)
Where (𝐿𝑟) is the logarithm of the restricted likelihood function. If we have a balanced data structure and by deriving (𝐿𝑟) as a function of the variability components of the model (model above), we can solve a system of equations that give rise to estimates given by:11
σ2s=CMs−CMenfijok
σ2e=CMe
These are identical to estimates using an ANOVA, since a property of REML is that in a balanced data structure, REML estimates = ANOVA as long as the inequality is satisfied. 𝐶𝑀𝑠 > 𝐶𝑀𝑒Otherwise, estimates via ANOVA would be negative and in REML all phenotypic variability is residual. In unbalanced data structure, the derivative of (𝐿𝑟) with respect to the variance components, gives rise to nonlinear equations, which cannot be solved directly, therefore, in these cases, as in ML, iterative numerical methods are used to approximate the value of the variance components.
REML estimates using kinship information in an animal model
In the case of a simple animal model, where each animal has only one data (and there are animals without data), the ANOVA method cannot be applied, since it is not possible to estimate the variation within groups using this methodology, because the classification variable is each animal that has a unique record, but the ML and REML estimations are applicable since they allow introducing kinship information in the matrix. 𝐴. In a mixed model, maximizing (𝐿𝑟) is equivalent to minimize −2 (𝐿𝑟) Therefore, the objective function to be minimized, in matrix algebra, can be defined as:14
−2Ln(Lr)=(n−p).In(2π)+Ln|R|+Ln|G|+Ln|C|+y′Py
Where 𝐿𝑛|𝐶| is the natural logarithm of the determinant of the coefficient matrix of the normal Henderson equations and ′𝑃𝑦 is the generalized residual sum of squares. Obviously to minimize −2𝐿𝑛(𝐿𝑟) iterative numerical methods are needed, but it has the advantage that it is easier than maximizing 𝐿𝑛(𝐿𝑟) Therefore, most specialized REML programs use sparse matrix algorithms and numerical methods to try to find estimators resulting from the minimization of −2𝐿𝑛(𝐿𝑟).
An example of birth weight (BW) in cattle was used to calculate the VG, ℎ2 y 𝑒2 using a mixed model, with a fixed and a random factor. The database is presented in Table 2. In this problem, we want to eliminate the variability that exists between the sexes, therefore, the sex factor is considered as fixed and the father factor as random, which leads us to the statistical model for this problem:
PN=media+padre+sexo+error
Father |
Animal |
Sex |
y |
1 |
3 |
Male |
36 |
1 |
4 |
Male |
35 |
1 |
5 |
Female |
33 |
1 |
6 |
Female |
28 |
2 |
7 |
Female |
31 |
2 |
8 |
Female |
29 |
2 |
9 |
Male |
28 |
2 |
10 |
Male |
36 |
3 |
11 |
Male |
38 |
3 |
12 |
Male |
37 |
3 |
13 |
Female |
29 |
3 |
14 |
female |
35 |
Table 2 Database of animal records, sex and NP
ANOVA, ML, and REML methods were used to calculate ℎ2 , 𝑒2 and GVs using the data in Table 2, first using all the information and then assuming the last missing data. For the animal model, a similar model was used:
PN=media+animal+sexo+error
Where all the kinship information and the value of the variance components found in the previous model were used to solve the Henderson normal equations.
Balanced data in a reproductive model
To calculate the CM, it is necessary to calculate the SC and GL for each source of variation, for this model and our data structure, we can calculate them using the formulas in Table 1.
GLs=3−1=2GLsexo=2−1=1GLtotal=12−1=11GLe=11−1−2=11−3=8
And since the design is balanced, the SCs are:
SCtotal=362+352+332+…+352−395212=152.916SCs=(132)2+(124)2+(139)24−395212=28.166SCsexo=(185)2+(210)26−395212=52.083SCe=152.916−(28.166+52.083)=72.667
And the CMs come from:
CMS=28.1662=14.083CMe=72.6678=9.083
And from the CM we can calculate the variance components:
σ2s= 14.083−9.0832(2)=1.25
Therefore, ℎ2 using ANOVA is:
h2=4(1.25)1.25+9.083=0.483e2=10.33−510.33=0.515
And these ANOVA estimates, too, are REML, since the data structure is balanced and the 𝐶𝑀𝑠 > 𝐶𝑀𝑒. Now the calculation of the GVs, using the REML estimates, comes from the solutions of the normal Henderson equations, using the estimated value of the variance components, and calculating the value of 𝛼 we have that:
α=9.0831.25=7.2664
Therefore, the equations are:
(60222062222222224+7.2660004+7.2660004+7.266)[b1b2s1s2s3]=[185210132124139]→[602220622222222211.26600011.26600011.266][b1b2s1s2s3]=[185210132124139]
In the previous equations the value of 𝜇 was forced to be zero in order to break the linear dependence between the rows and columns of the coefficient matrix. The solution of this system of equations is given by:
[b1b2s1s2s3]=[[602220622222222211.26600011.26600011.266]]−1[185210132124139]=[30.83350.029−0.6800.650]
ML estimates are:
σ2s=28.1663−8.0732(2)=0.328σ2e= [1−2−13(2(2)−1)]9.083=8.073
Y ℎ2y 𝑒2using ML is:
h2=4(0.328)0.328+8.073=0.156e2=8.401−1.32128.401=0.842
And the equations using ML are:
[602220622222222228.61200028.61200028.612][b1b2s1s2s3]=[185210132124139]
Therefore the solution is:
[b1b2s1s2s3]=[602220622222222228.61200028.61200028.612]−1[185210132124139]=[30.83350.011−0.2670.256]
Introduction of the parentage matrix in a reproductive model
Now it is assumed that animal 1 is the father of animal 2, therefore, the equations take into account all the genealogy between males. First we have to calculate 𝐴−1. Applying Henderson's rules,5 we have:
A−1=[1+1/3−2/30−2/31/40001]=[1/4−2/30−2/31/40001]
Therefore,
A−1α=[14−230−23140001](7.2664)=[1.8166−4.84420−4.84421.81660007.2664]
And adding the Z'Z matrix:
Z′Z+A−1α=[400040004]+[1.8166−4.84420−4.84421.81660007.2664]=[5.8166−0.84420−0.84425.816600011.2664]
Therefore, the Henderson normal equations are:
[602.00002.00002.0000062.00002.00002.00002222225.8166−0.84420.0000−.084425.81660.00000.00000.000011.2664][b1b2s1s2s3]=[185210132124139]
And the solution of these equations is:
[b1b2s1s2s3]=[602.00002.00002.0000062.00002.00002.00002222225.8166−0.84420.0000−0.84425.81660.00000.00000.000011.2664]−1[185210132124139]=[31.628535.7952−0.7765−1.97760.3685]
In this solution, the fixed effect (sex) is adjusted for the random example, and the random effect is adjusted for the fixed effect.
Unbalanced data in a reproductive model
Assuming the last missing data, the 𝑆𝐶𝑡𝑜𝑡𝑎𝑙 is:
SCtotal=SCtotal=362+352+332+…+292−360211=148.1818182
The GLs are:
GLs=3−1=2GLsexo=2−1=1GLtotal=11−1=10GLe=10−1−2=11−3=7
To calculate the ordinary least squares solutions for the reduced model (without the sire factor) must be estimated:
b=[1156550606]−1[3601500]=[0.1666−0.16660−0.16660.36660000][3601500]=[35−50]
Therefore, the 𝑆𝐶𝑠𝑒𝑥𝑜 for the reduced model is:
SCsexo(reducido)=[35−5 0][360150210]−360211=(35)(360)+(−5)(150)−360211=68.1818
To obtain the adjusted SCs, we calculate the SC for the full model and subtract from it, respectively (𝑟𝑒𝑑𝑢𝑐𝑖𝑑𝑜):
SCs=SC(μ,s,sexo,)−SC(μ,sexo)=83.6818182−68.1818=15.5
And the 𝑆𝐶𝑒 is:
SCe=148.1818182−(15.5+68.1818)=64.5
Therefore, the CMs are:
CMs=15.52=7.75CMe=64.57=9.214286
Finally, the variance components are:
σ2s= 7.75−9.2142863.6=−0.4067σ2e= 9.214286
The component σ2s component is a negative estimate of the variance, because the 𝐶𝑀𝑠 < 𝐶𝑀𝑒is negative, therefore, the ML and REML estimates for σ2s are:
σ2s = 0
In other words, all the total variability is residual. Table 3 shows the ML and REML iterations for the calculation of the residual variance with the Newton-Rapson method: As shown in Table 3, when an unbalanced database is used, the estimates for the variance components are different in ANOVA and REML, since with ANOVA we have σ2e = 9.2142 and with REML σ2e = 8.88. For this particular case, ℎ2 = 0 because the variance of the numerator is zero, obviously to find a more credible estimate, one should increase the number of data used in the genetic evaluation or try another model and compare the AIC.
ML |
REML |
||||||
Iteración |
-2ln(L) |
σs2 | σe2 |
iteración |
-2ln() |
σs2 | σe2 |
1 |
53.0420 |
0 |
7.2727 |
1 |
48.6053 |
0 |
8.88 |
2 |
53.0420 |
0 |
7.2727 |
2 |
48.6053 |
0 |
8.88 |
Table 3 Iteration of variance components for ML and REML
Animal model
The solutions of the Henderson normal equations, using the values of σ2e = 9.083 y σ2a = 5 are presented in Table 4 Henderson's normal equations are not presented for this case due to its large dimensions.
Animal |
VG |
1 |
0.422982 |
2 |
-0.984574 |
3 |
1.10566 |
4 |
0.217214 |
5 |
0.809321 |
6 |
-0.651756 |
7 |
-0.273233 |
8 |
-0.857664 |
9 |
-2.32642 |
10 |
0.113062E-01 |
11 |
1.33545 |
12 |
1.04324 |
13 |
-0.117947 |
14 |
1.63535 |
Table 4 Genetic values using an animal model
Genetic values and parameters can be estimated using ANOVA, ML and REML, but with the risk of estimating negative variance components using ANOVA or zero (or overestimated) heritabilities with likelihood-based methods when the data structure or model is not correct. When the data structure is unbalanced, mathematical calculations with ANOVA, ML and REML are more complex and require computational algorithms with higher performance.
None.
The authors declared that there are no conflicts of interest.
©2024 Raúl, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.