Improving Prediction of Genotypic Values in Historical Crop Trial Data via Stepwise Adjustment Method ()
1. Introduction
Plant researchers have been spending decades of efforts to conduct regional and national crop trials before entries of interest are released to the market. However, these historical crop trial data have been underutilized because processing long-term crop trial data simultaneously is still statistically challenging due to high unbalanced data structures. Revisiting long-term historical crop trial data with appropriate statistical methods will help us reveal more desirable genetic information that can be used for crop improvement.
Multi-environment crop trial aims to determine those entries with high yield potential and stable performance across a wide range of environmental conditions. With multi-environment trials, genotype-by-environment (GE) interaction, which is relevant to yield stability, is one of most important parameters of interest to many statisticians [1]-[4]. Many approaches have been proposed and applied to multi-environment trial data analyses. Linear regression-based yield stability analysis has been a major focus in multi-environment crop yield trials [5]-[9]. Additive main and multiplicative interaction (AMMI) method [10] has gained more popularity because it targets both mean performance and environment-specific performance. GGEbiplot, a graphical tool, can be used for multi-environment trial data for yield stability or GE interaction [11]. Linear mixed model (LMM) approaches have also been applied to analyze multi-environment trial data to explore genotypic effects, environmental effects, and GE interaction effects with its flexibility for unbalanced data structure and/or missing data points [12].
It is unlikely that the above-mentioned methods can be used to conduct yield stability or GE interaction across years for a crop trial data set spanning years because of serious imbalance in the data [13] [14]. Thus, a different focus such as genetic progress among years rather than GE or yield stability investigation could be more appropriate [4] [15]. Estimation of historical genetic gain/progress will be impacted by the degree of confounding of genotypic and environmental effects. It is statistically impossible to compare genotypes developed today and 20 years ago without adjusting environmental conditions among years or without the same checks being used. Likely, this is one of key factors preventing extensive investigation on long-term crop trial data across years. One way to investigate genetic progress is to select several varieties released from different time periods and evaluate them under one or a few environments [15]-[17] so that these varieties can be evaluated under the same environmental conditions. However, with this approach, there could be several potential issues: 1) Only a small portion of varieties from the historical trials could be evaluated; 2) The representative environmental conditions could be narrow, not representing a wide range of crop growing regions where crop trials have been conducted for decades; 3) additional huge cost for labor and land is needed. The second approach is to use overlapped/carryover standards (or checks) to conduct environmental adjustment from year to year [18]. For example, in the national cotton variety trial, the same three or four standards are used within each cycle (equivalent to every three years, while there is one or two standards that roll over to the next cycle). With this approach, it is possible to re-visit the long-term crop trial data without adding field trial cost. However, with only one standard rolled over to the following cycle such as in cotton trial [18], it is statistically unknown if a small number of standards like one or two are sufficient for adjustment. Additionally, if one standard in a particular growing season or location is missing, then the adjustment could cause a critical issue.
It is well-known that good performers from the current year trial will likely remain the following year. In other words, these good performers could be repeated for at least one more year. We observed that a range of genotypes were overlapped/repeated between every two consecutive years in various national and regional crop trial data (interested readers can refer to published crop trial reports from various institutes). Therefore, it will enhance the value of long-term crop trial data when a new method can be provided with its efficiency validated.
In this study, our first objective was to propose a stepwise method to adjust environmental effects among years using overlapped entries between every two consecutive years. Our second objective was to evaluate the adjustment efficiency of this new method by Monte Carlo simulation. Our third objective was to integrate this method to adjust a 16-year soybean trial data set from South Dakota and to re-estimate annual genetic progress accordingly. The purpose of this study was to provide a new method to better predict genotype performance from long-term historical crop trial data so that desirable genetic information can be further utilized for genetic study and crop improvement
2. Methods and Simulations
2.1. Stepwise Adjustment Method
Assuming there is at least overlapping entries between any two consecutive years, the procedure for the stepwise adjustment method can be described as follows,
Step 1: Assume that
is overlapped entries between years 1 and 2. Calculate the mean values of
overlapped entries for years 1 and 2,
and
, respectively.
Step 2: Calculate the mean difference between the first year and the second year using the equation (1)
(1)
where,
is the mean difference between the first year and the second year based on
overlapped entries. If
is positive, it indicates that year 2 is higher than year 1. In the same manner, if
is negative, it indicates that year 2 is lower than year 1.
Step 3: Adjust phenotypic values for year 2 with the equation (2)
(2)
where,
is the vector of the adjusted phenotypic values for year 2 and
is the vector of the non-adjusted phenotypic values for year 2. The adjusted values
are comparable to the phenotypic performance under the mean environmental conditions in year 1.
Step 4: Adjust phenotypic values for the year
when
.
Given
and there are
overlapped entries between consecutive year
and year
, the adjustment for year
can be conducted using the following equation 3:
(3)
where,
is the vector of the adjusted phenotypic values for year
;
is the vector of the non-adjusted phenotypic values for year
; and
can be calculated by the following equation (4),
(4)
where,
is the non-adjusted mean value of
overlapped genotypes for year
;
is the adjusted mean value of
overlapped genotypes for year
.
is the mean difference of phenotypic values of
overlapped genotypes between years
(adjusted mean) and
(non-adjusted mean). The adjustment process is recursive until the last year/season in the trial. All adjusted phenotypic value vectors from the second year are statistically comparable to the mean environmental conditions under year 1.
2.2. Simulation Study
It will be helpful to numerically evaluate the effectiveness of the above stepwise adjustment method. An effective way to evaluate a statistical method and/or model is Monte Carlo simulation technique [3] [19] [20]. However, it is important that the simulated data should be generated from a linear model representing a multi-year and multi-location crop trial that can be described as follows:
(5)
where
= phenotypic value for
genotype for
block under
location and
year;
= population mean;
= year effect;
= location effect;
= genotypic effect;
= year-by-location interaction effect;
= genotype-by-year interaction effects;
=genotype-by-location interaction effect;
= genotype-by-year-by-location interaction effects;
= block effect within each location and year; and
= random error. In our simulation study, we set all effects as random except for population mean.
If all entries are grown in all years and all locations, then all effects in the above linear model can be dissected either by ANOVA (analysis of variance) or linear mixed model approaches. However, as mentioned above, many genotypes are not repeated among years, genotypic effects cannot be separated from environmental effects among years or genotype-by-environment interaction effects, which could impact the efficiency of the stepwise adjustment method. Therefore, our simulated crop trial data were generated reflecting real-world scenarios from the linear model (Eq. 1) and nine settings of variance components (Table 1) were applied to reflect the potential impacts from year effects and genotype-by-environmental effects. However, interested readers may use other settings to evaluate the outcomes for other particular purposes.
Other parameters used for simulation study are as follows. Within each year 20 entries are repeated four times with an RCB design under each location. The numbers of overlapping entries for simulation are 1, 2, 4, 6, 8, 10, and 15. The process is repeated five (5), 10, and 15 years, respectively. The simulation for each of 189 (9 × 7 × 3 = 189) settings was repeated 100 times.
Table 1. Nine (9) sets of variance components used for simulation study.
Variance component† |
Set1 |
Set2 |
Set3 |
Set4 |
Set5 |
Set6 |
Set7 |
Set8 |
Set9 |
|
40 |
40 |
40 |
40 |
40 |
40 |
40 |
40 |
40 |
|
40 |
40 |
40 |
60 |
60 |
60 |
80 |
80 |
80 |
|
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
|
20 |
10 |
0 |
20 |
10 |
0 |
20 |
10 |
0 |
|
20 |
10 |
0 |
20 |
10 |
0 |
20 |
10 |
0 |
|
20 |
10 |
0 |
20 |
10 |
0 |
20 |
10 |
0 |
|
20 |
10 |
0 |
20 |
10 |
0 |
20 |
10 |
0 |
|
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
|
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
†:
= variance component for genotype;
= variance component for year;
= variance component for location;
= variance component for year-by-location interaction;
= variance component for genotype-by-year interaction;
= variance component for genotype-by-location interaction;
= variance component for genotype-by-year-by-location interaction;
= variance component for block; and
= variance component for random error respectively.
Once data were generated, genotypic values for all entries were calculated from control group data (
), non-adjusted data (
), and adjusted data (
). Correlation coefficients were calculated between
and
and between
and
, respectively. The higher a correlation coefficient is, the better predicted genotypic values are. The function lmm. simudata available in the R library, minque [21] was used to generate simulated trial data. The simulation study was conducted by the R scripts, which were developed by the senior author of this study and all data analyses including simulation and application were conducted at the platform RStudio [22] [23].
3. Results
3.1. Simulation Study
In our simulation study we focused on determining factors that could impact prediction of genotype values. The first factor is numbers of overlapped entries (1, 2, 4, 6, 8, 10, and 15 used in this study) between every two consecutive years/seasons. The second factor is length of trial years (5, 10, and 15 used in this simulation study). Because both year effects and genotype-by-environment interaction effects could impact the prediction of genotypic effects, we also consider variation of years (sets 1 - 3 → 4 - 6 → and 7 - 9, low → high) and variances of two- and three-way interactions (sets 1, 4, 7→ 2, 5, 8 → 3, 6, 9, high → low) (Table 1). As mentioned in Methods, genotypic values were calculated:
,
, and
for control group data, non-adjusted trial data, and adjusted trial data, respectively. Parameters
and
are defined as correlation coefficients between
and
and between
and
, respectively. A high correlation coefficient with control group
implies the genotypic values are better predicted to those for the control group. The simulation results provided in Tables 2-4 are mean correlation coefficients based on 100 simulations of data for each setting.
Table 2. Summarized simulation results for nine (9) sets of variance components settings with seven (7) different overlapped entry numbers with five (5) years of trial.
|
|
Overlapped entry number |
|
|
Correlation coefficient† |
1 |
2 |
4 |
6 |
8 |
10 |
15 |
Mean |
Set1 |
|
0.693 |
0.699 |
0.710 |
0.730 |
0.748 |
0.775 |
0.829 |
0.741 |
|
0.689 |
0.757 |
0.795 |
0.838 |
0.860 |
0.870 |
0.915 |
0.818 |
Set2 |
|
0.739 |
0.713 |
0.728 |
0.743 |
0.771 |
0.786 |
0.868 |
0.764 |
|
0.772 |
0.837 |
0.867 |
0.888 |
0.905 |
0.919 |
0.954 |
0.877 |
Set3 |
|
0.773 |
0.755 |
0.768 |
0.777 |
0.805 |
0.834 |
0.863 |
0.796 |
|
0.974 |
0.983 |
0.987 |
0.989 |
0.991 |
0.993 |
0.995 |
0.987 |
Set4 |
|
0.652 |
0.663 |
0.657 |
0.691 |
0.728 |
0.748 |
0.821 |
0.709 |
|
0.698 |
0.751 |
0.807 |
0.824 |
0.859 |
0.885 |
0.917 |
0.820 |
Set5 |
|
0.677 |
0.677 |
0.678 |
0.739 |
0.719 |
0.764 |
0.839 |
0.728 |
|
0.782 |
0.832 |
0.876 |
0.891 |
0.900 |
0.920 |
0.953 |
0.879 |
Set6 |
|
0.701 |
0.707 |
0.701 |
0.722 |
0.752 |
0.793 |
0.847 |
0.746 |
|
0.973 |
0.981 |
0.986 |
0.989 |
0.991 |
0.993 |
0.995 |
0.987 |
Set7 |
|
0.639 |
0.636 |
0.635 |
0.643 |
0.679 |
0.710 |
0.792 |
0.676 |
|
0.715 |
0.742 |
0.803 |
0.829 |
0.848 |
0.872 |
0.920 |
0.818 |
Set8 |
|
0.652 |
0.643 |
0.675 |
0.669 |
0.710 |
0.713 |
0.815 |
0.697 |
|
0.781 |
0.841 |
0.866 |
0.890 |
0.900 |
0.915 |
0.952 |
0.878 |
Set9 |
|
0.670 |
0.663 |
0.674 |
0.701 |
0.743 |
0.759 |
0.804 |
0.716 |
|
0.971 |
0.981 |
0.987 |
0.989 |
0.991 |
0.992 |
0.995 |
0.987 |
Mean |
|
0.690 |
0.686 |
0.693 |
0.714 |
0.741 |
0.767 |
0.832 |
0.732 |
|
0.833 |
0.869 |
0.896 |
0.912 |
0.924 |
0.935 |
0.959 |
0.904 |
†:
and
are correlation coefficients between predicted genotype values from control group and data without adjustment and between predicted genotype values from control group and adjusted data.
Table 3. Summarized simulation results for nine (9) sets of variance components settings with seven (7) different overlapped entry numbers with 10 years of trial.
|
|
Overlapping entry number |
|
|
Correlation coefficient† |
1 |
2 |
4 |
6 |
8 |
10 |
15 |
Mean |
Set1 |
|
0.652 |
0.669 |
0.685 |
0.696 |
0.721 |
0.761 |
0.833 |
0.717 |
|
0.597 |
0.675 |
0.735 |
0.771 |
0.813 |
0.835 |
0.910 |
0.762 |
Set2 |
|
0.686 |
0.686 |
0.715 |
0.729 |
0.752 |
0.790 |
0.856 |
0.745 |
|
0.682 |
0.754 |
0.835 |
0.854 |
0.886 |
0.892 |
0.943 |
0.835 |
Set3 |
|
0.720 |
0.741 |
0.744 |
0.763 |
0.772 |
0.822 |
0.872 |
0.776 |
|
0.953 |
0.971 |
0.979 |
0.985 |
0.988 |
0.989 |
0.995 |
0.980 |
Set4 |
|
0.619 |
0.642 |
0.645 |
0.654 |
0.672 |
0.720 |
0.801 |
0.679 |
|
0.579 |
0.669 |
0.727 |
0.774 |
0.804 |
0.838 |
0.908 |
0.757 |
Set5 |
|
0.630 |
0.630 |
0.663 |
0.675 |
0.696 |
0.742 |
0.807 |
0.692 |
|
0.692 |
0.781 |
0.828 |
0.859 |
0.877 |
0.894 |
0.947 |
0.840 |
Set6 |
|
0.654 |
0.665 |
0.700 |
0.699 |
0.736 |
0.758 |
0.823 |
0.719 |
|
0.953 |
0.969 |
0.979 |
0.986 |
0.988 |
0.990 |
0.995 |
0.980 |
Set7 |
|
0.574 |
0.586 |
0.600 |
0.621 |
0.649 |
0.703 |
0.785 |
0.645 |
|
0.577 |
0.675 |
0.745 |
0.775 |
0.798 |
0.837 |
0.905 |
0.759 |
Set8 |
|
0.589 |
0.625 |
0.611 |
0.627 |
0.651 |
0.697 |
0.788 |
0.655 |
|
0.684 |
0.770 |
0.831 |
0.844 |
0.879 |
0.896 |
0.946 |
0.836 |
Set9 |
|
0.590 |
0.618 |
0.633 |
0.656 |
0.670 |
0.712 |
0.794 |
0.668 |
|
0.952 |
0.969 |
0.980 |
0.985 |
0.987 |
0.990 |
0.995 |
0.980 |
Mean |
|
0.637 |
0.653 |
0.670 |
0.682 |
0.705 |
0.746 |
0.818 |
0.702 |
|
0.762 |
0.820 |
0.862 |
0.882 |
0.901 |
0.915 |
0.954 |
0.871 |
†:
and
are correlation coefficients between predicted genotype values from control group and data without adjustment and between predicted genotype values from control group and adjusted data.
Table 4. Summarized simulation results for nine (9) sets of variance components settings with seven (7) different overlapped entry numbers with 15 years of trial.
|
|
Overlapped entry number |
|
|
Correlation coefficient† |
1 |
2 |
4 |
6 |
8 |
10 |
15 |
Mean |
Set1 |
|
0.695 |
0.715 |
0.719 |
0.719 |
0.743 |
0.777 |
0.822 |
0.741 |
|
0.688 |
0.759 |
0.799 |
0.829 |
0.848 |
0.873 |
0.915 |
0.816 |
Set2 |
|
0.728 |
0.740 |
0.740 |
0.762 |
0.764 |
0.812 |
0.858 |
0.772 |
|
0.801 |
0.842 |
0.870 |
0.886 |
0.902 |
0.925 |
0.952 |
0.883 |
Set3 |
|
0.768 |
0.757 |
0.791 |
0.786 |
0.818 |
0.826 |
0.868 |
0.802 |
|
0.973 |
0.981 |
0.987 |
0.989 |
0.991 |
0.993 |
0.995 |
0.987 |
Set4 |
|
0.647 |
0.649 |
0.671 |
0.675 |
0.732 |
0.767 |
0.811 |
0.707 |
|
0.707 |
0.752 |
0.792 |
0.839 |
0.851 |
0.879 |
0.914 |
0.819 |
Set5 |
|
0.685 |
0.679 |
0.689 |
0.711 |
0.742 |
0.770 |
0.816 |
0.727 |
|
0.799 |
0.841 |
0.868 |
0.884 |
0.909 |
0.922 |
0.949 |
0.882 |
Set6 |
|
0.711 |
0.697 |
0.716 |
0.740 |
0.751 |
0.768 |
0.841 |
0.746 |
|
0.972 |
0.983 |
0.987 |
0.989 |
0.991 |
0.993 |
0.995 |
0.987 |
Set7 |
|
0.621 |
0.631 |
0.649 |
0.661 |
0.680 |
0.713 |
0.775 |
0.676 |
|
0.702 |
0.761 |
0.794 |
0.831 |
0.845 |
0.884 |
0.914 |
0.819 |
Set8 |
|
0.639 |
0.663 |
0.655 |
0.675 |
0.679 |
0.736 |
0.784 |
0.690 |
|
0.783 |
0.841 |
0.872 |
0.894 |
0.905 |
0.924 |
0.949 |
0.881 |
Set9 |
|
0.658 |
0.627 |
0.681 |
0.694 |
0.714 |
0.739 |
0.793 |
0.701 |
|
0.970 |
0.982 |
0.986 |
0.989 |
0.991 |
0.992 |
0.995 |
0.986 |
Mean |
|
0.686 |
0.686 |
0.703 |
0.716 |
0.737 |
0.767 |
0.821 |
0.731 |
|
0.837 |
0.873 |
0.894 |
0.912 |
0.922 |
0.938 |
0.957 |
0.905 |
†:
and
are correlation coefficients between predicted genotype values from control group and data without adjustment and between predicted genotype values from control group and adjusted data.
Overall, as the number of overlapping entries between consecutive years/seasons increases, both correlation coefficients increase with a few exceptions for non-adjusted group when overlapping number is two or small (last two rows in Tables 2-4), suggesting that overlapping entry number is a key factor to increase the accuracy of predicted genotypic values. Predicted genotypic values from the adjusted data had higher correlation values with genotypic values from control groups than predicted genotypic values from non-adjusted data with genotypic values from control groups with a few exceptions (last two rows and last columns in Tables 2-4), suggesting that the difference caused by the environmental conditions among years can be adjusted by the method proposed in this study. For example, based on the simulation data on five years of trials, the difference in correlation coefficients ranged from 0.127 to 0.203, peaked at 6 to 8 overlapped entries, resulting in 15.3 to 29.3% increase compared to the non-adjusted group (last two rows in Table 2). Similar results can be found for the simulated data with 10 and 15 years of trial (Table 3 and Table 4). The predicted efficiency is related to environmental variations among years. With adjustment, prediction efficiency is better when variance is high (80) compared to smaller year variation (40 and 60, refer to the last columns in Tables 2-4). In addition, as an example shown in Figure 1, the adjusted predicted genotypic values were highly correlated with the preset genotypic values compared to the non-adjusted ones with a year variation of 80, five years of trials, and 15 overlapped genotypes. It is also noticeable that low GE interaction variation results in improved prediction when adjustment is applied (set 3 > 2 > 1; set 6 > 5 > 4, or set 9 > 8 > 7, the last columns in Tables 2-4). Additionally, it appears that prediction efficiency remains about the same for different numbers of trial years when adjustment is applied, suggesting that this method is suitable to adjust long-term historical trial data to predict genotype values across trial years.
![]()
Figure 1. Predicted genotypic values from preset, non-adjusted, and adjusted groups (one case from the simulated data based on set 9 in Table 1).
Overall, with our simulation data, we can conclude that this stepwise adjustment method can help adjust the difference caused by environmental differences among years. Its efficiency is related to the number of overlapping entries between two consecutive years and year variance. As expected, high GE interaction will decrease the adjustment efficiency.
3.2. Demonstration
Though this stepwise adjustment method could be applied to different crop trial data, we choose to use the soybean trial data from six locations in eastern South Dakota covering 2001 through 2016 as our application because soybean production was significantly impacted by the weather conditions among years. The soybean trial data are available at the website http://igrow.org for more detailed information regarding the trial data.
A total 2946 different soybean genotypes were grown over 16 years (2001-2016) [4]. The numbers of overlapping entries between 15 pairs of consecutive years are provided in Table 5. The overlapped entries ranged from 36 to 119 among these 15 pairs of consecutive years (Table 5). The proportion of overlapped genotypes between each two consecutive years varied from 9 to 26%. These overlapped soybean genotypes could be a useful leverage for our new method to adjust the predicted genotypic values.
Table 5. Numbers of overlapping entries between two consecutive years for soybean trials in South Dakota from 2001-2016.
Year |
Overlapped† |
Combined‡ |
Proportion* |
2001 |
416 |
416 |
1.0000 |
2002 |
110 |
699 |
0.1574 |
2003 |
119 |
595 |
0.2000 |
2004 |
98 |
567 |
0.1728 |
2005 |
92 |
546 |
0.1685 |
2006 |
95 |
465 |
0.2043 |
2007 |
81 |
424 |
0.1910 |
2008 |
90 |
388 |
0.2320 |
2009 |
78 |
395 |
0.1975 |
2010 |
36 |
387 |
0.0930 |
2011 |
60 |
338 |
0.1775 |
2012 |
77 |
329 |
0.2340 |
2013 |
60 |
303 |
0.1980 |
2014 |
50 |
290 |
0.1724 |
2015 |
76 |
289 |
0.2630 |
2016 |
41 |
301 |
0.1362 |
†: Number of overlapped genotypes between current and previous years; ‡: total genotypes between current and previous years; *: proportion = overlapped genotypes/total genotypes.
The adjusted and non-adjusted annual means (bushel/acre) are provided in Figure 2. The difference between these two means for a particular year suggests the impact of weather conditions for soybean production. A lower adjusted mean indicates more suitable weather conditions compared to the weather conditions in 2001. The results showed that the non-adjusted annual yield was higher than the adjusted annual yield for most years, for example 2005, 2007, 2009, and 2013 to 2016, indicating that the weather conditions in these years were more suitable for soybean production compared to the weather conditions in 2001 (Figure 2). However, the non-adjusted annual yield means were much lower than the adjusted annual yield means for 2003 and 2012, suggesting that weather conditions in these two years were not suitable for soybean production as compared to those in 2001. One major reason for these two low yielding years (2003 and 2012) was caused by early frost-kill in September while weather conditions were better for soybean seed growth and development maturement in other years. The adjusted and non-adjusted means were similar in 2006, indicating that the weather conditions were similar between 2001 and 2006.
![]()
Figure 2. Non-adjusted and adjusted annual means for soybean yield from South Dakota soybean yield trial in 2001-2016.
After data adjustment, the trend for annual means was smoother than that for the non-adjusted data. The adjustment also resulted in an improved model fitness for the genetic gain model over these 16 years compared to that for non-adjusted data (Table 6). The annual genetic gain estimated from non-adjusted data was 1.35 bushel per acre (one bushel is equivalent to 60 lbs.) while the annual genetic gain for adjusted data was 0.72 bushel per acre, equivalent to about 10 bushels per acre over 16 years. It appears that favorable weather conditions in 2014-2016 led to historically high yield and thus inflated the annual genetic gain if data were not adjusted. The adjusted annual gain was more consistent to the nation-wide annual increase 0.53 bushel per acre based on soybean production from 1981 to 2024 [24]. On average, the annual increase in South Dakota was 0.47 bushel per acre based on the soybean production in the similar areas during the period of 1987 to 2011 [25].
Table 6. Annual genetic gain and model fitness expressed as adjusted coefficient of determination for non-adjusted and adjusted soybean trial data from eastern south Dakota State environment (2001-2016).
|
Unadjusted |
Adjusted |
Genetic gain |
1.35 (bushel/ac) |
0.72 (bushel/ac) |
|
0.48 |
0.85 |
Based on the annual soybean production reports [24] [25], the weather conditions among years in South Dakota played a major impact on soybean production, suggesting the degree of adjustment on genotypic values would be considerably large. The results were consistent with our simulations as shown in Tables 2-4.
4. Discussion
National crop variety trial is a dynamic process with test entries and locations varying from year to year. However, such a dynamic process causes a serious need to better predict genotypic values which were not repeated across years and thus limits full utilization of long-term historical trial data. Because of this data structure issue, within-year data or balanced data across a few years were more popularly analyzed and utilized for selection and stability determination [6] [12] while long-term crop trial data have remained underutilized [26]. To better predict genotype values, a proper adjustment is needed to make the long-term crop trial data statistically comparable to the phenotypic performance under a particular year condition. That was our major goal of this study.
From many annual trial reports published online, we observed that it is a common pratice for a number of entries, especially good performers, to be available for two consecutive years though they vary from year to year. This study thus aimed to use the information of these overlapped entries between every two consective years to adjust environmental impacts aross years. As addressed in Methods section, such a process is a step-by-step (year-by-year) adjustment. The efficiency for this stepwise method was numerically evaluated through Monte Carlo simulation. Our simulation showed that the adjustment method can enhance prediction of genotypic values as evidenced by the improved correlation coefficient with control group compared to non-adjusted results (Tables 2-4). As expected, the adjustment is more significant when year variance is larger, suggesting a key objective in this study was achieved. The simulation also showed that a higher number of overlapped entries between two consecutive years tends to have a higher adjustment efficiency compared to a smaller overlapped entry number. Thus, using more overlapped entries every two consecutive years can improve adjustment efficiency compared to using standards only, given the same data set. The conclusion is desired by researchers. Number of years in trials doesn’t show numerical impact on the adjustment efficiency, indicating that this method can be used for long-term historical trial data adjustment. The statistical model used for our simulation study was representative for a multi-year and multi-location trial, suggesting the adjustment method is practical. Thus, this adjustment method makes the simultanaous analysis of long-term historical data possible and provides a way to make different genotypes grown in different years comparable. However, our simulation showed that the adjustment efficiency can be negatively impacted by interaction effects especially by genotype-by-environment interaction effects but the impact can be reduced by using more overlapping entries. Such a conclusion is expected because interaction effects are confounded with environmental effects among years. Using more overlapped entries could help reduce the impact from GE interactions and thus increase adjustment efficiency.
Based on our Monte Carlo simulations, we conclude that this adjustment method can better predict genotypic values from historical trial data without regrowing the entries in the past years of trials. Therefore, it can lead to several important applications. One major application is to adjust annual genetic gain over trial years. In the application to a 16-year soybean trial data collected in South Dakota environments, we observed that a large number of overlapped genotypes were available every two consective years (Table 5), these overlapped genotypes helped better predict genotypic values as evidenced by an improved genetic gain model fitness of 85% compared to that without adjustment with a low model fitness of 48%. In addition, the annual genetic gain from the trial data without adjustment was 1.35 bushel/ac and was overestimated because high yield in the last few years from favorable weather condition in September. However, the adjusted annua yield gain was 0.72 bushel/ac with much improved model fitness. The adjusted annual gain appeared to be more in line with the national and the state annual increase over decades [24] [25]. Similarly, this adjustment method can be applied to other trial data targeting a particular location, area, and/or region.
The entries used in long-term historical crop trial are a useful germplasm collection that can be used to identify QTLs associated with traits of importance. Because these entries could be very diversified, multiple desirable QTLs will be likely identified. With the rapid development of genotyping technologies, these entries can be genotyped at a low cost within a short timeframe. However, the cost for re-phenotyping on the traits of interest could be high, especially for those traits measured in the field. Using historical crop trial data will save a great cost associated with re-phenotyping [26]. Application of the method in this study to genetic association studies with historical data could improve the power of identifing QTLs and also reduce the false descovery rate. Predicted genotypic values can be further used to identify favorable loci associated with traits of interest. That is another way to better utilize long-term historical crop trial data.
Using overlapped entries between every two consecutive years to adjust long-term historical data has been validated to be statistically sound. Therefore, the method proposed in this study is a useful way to improve utilization of the long-term historical data. This method appears to be applicable to national cotton variety trials because the two or three same standards are unsually applied in every cycle (three years) with one standard rollovered to the following cycle [18]. We also noticed that the same standards/check have been used in national winter wheat trials for years. These same standards used among trial years could be another valuable resource to control the environmental effects among years. Therefore, there is a great need to numerically investigate the statistical properties regarding this type of crop trial design so a better crop trial design can be recommended with consideration of efficiency of both environmental control among years and land cost.
This study showed that overlapped geneotypes between each two consecutive years/seasons can be used to better predict genotype values in long-term historical trial data with the use of the method proposed in this study. However, this stepwise adjustment method requires at least one overlaping genotype to be available between any two consecutive years.
Disclaimer
Mention of trade names or commercial products in this publication is solely for the purpose of providing specifc information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and employer.