Autor: Jose Crossa

Replication Data for: Genomic prediction of kernel zinc content in multiple maize populations using genotyping-by-sequencing and repeat amplification sequencing markers

Thanda Dhliwayo Edna Mageto Michael Olsen Jose Crossa Prasanna Boddupalli XUECAI ZHANG (2020)

An association-mapping panel (DTMA) and two DH populations (DH1 and DH2) were used in the current study, which in total includes 487 materials. The dataset includes three types of files. One is the genotype of 487 lines sequenced by GbS, named DTMA_DH2_DH3-955690.hmp.txt; one is the genotype of 487 lines sequenced by rAmpSeq named genotype-rAmpSeq.csv; and the third type of files are the phenotypic data files named DH1-phenotype.csv, DH2-phenotype.csv and DTMA-phenotype.csv.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Prediction of multiple-trait and multiple-environment genomic data using recommender systems

Osval Antonio Montesinos-Lopez Jose Crossa Ravi Singh Suchismita Mondal Philomin Juliana (2017)

In genomic-enabled prediction, the task of improving the accuracy of the prediction of lines in environments is difficult because the available information is generally sparse and usually has low correlations between traits. In current genomic selection, while researchers have a large amount of information and appropriate statistical models to process it, there is still limited computing efficiency to do so. Although statistical models are usually mathematically elegant, they are also computatio nally inefficient, and they are impractical for many traits, lines, environments, and years because they need to sample from huge normal multivariate distributions. For these reasons, this study explores two recommender systems: a) item-based collaborative filtering (IBCF; method M1) and b) the matrix factorization algorithm (method M2) in the context of multiple traits and multiple environments. The IBCF and matrix factorization methods were compared with two conventional methods on simulated and real data. Results of the simulated and real data sets show that the IBCF technique (method M1) was slightly better in terms of prediction accuracy than the two conventional methods and the matrix factorization method when the correlation was moderately high. The IBCF technique is very attractive because it produces good predictions when there is high correlation between items (environment-trait combinations) and its implementation is computationally feasible, which can be useful for plant breeders who deal with very large data sets.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Replication Data for: Approximate kernels for large data sets In genome-based prediction

Osval Antonio Montesinos-Lopez Johannes Martini Paulino Pérez-Rodríguez Jose Crossa (2020)

The rapid development of molecular markers and sequencing technologies has made it possible to use genomic selection (GS) and genomic prediction (GP) in animal and plant breeding. However, computational difficulties arise when the number of observations is large. This five datasets provided here were used to support a comparative analysis of two genomic-enabled prediction models: the full genomic method single environment (FGSE) and the approximate kernel method for a single environment model (APSE). The data were also used to compare the full genomic method with genotype × environment model (FGGE) to the approximate kernel method with genotype × environment interaction (APGE). The results of the analyses are described in the related publication.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Using an incomplete block design to allocate lines to environments improves sparse genome-based prediction in plant breeding

Osval Antonio Montesinos-Lopez ABELARDO MONTESINOS LOPEZ RICARDO ACOSTA DIAZ Rajeev Varshney Jose Crossa ALISON BENTLEY (2022)

Genomic selection (GS) is a predictive methodology that trains statistical machine-learning models with a reference population that is used to perform genome-enabled predictions of new lines. In plant breeding, it has the potential to increase the speed and reduce the cost of selection. However, to optimize resources, sparse testing methods have been proposed. A common approach is to guarantee a proportion of nonoverlapping and overlapping lines allocated randomly in locations, that is, lines appearing in some locations but not in all. In this study we propose using incomplete block designs (IBD), principally, for the allocation of lines to locations in such a way that not all lines are observed in all locations. We compare this allocation with a random allocation of lines to locations guaranteeing that the lines are allocated to

the same number of locations as under the IBD design. We implemented this benchmarking on several crop data sets under the Bayesian genomic best linear unbiased predictor (GBLUP) model, finding that allocation under the principle of IBD outperformed random allocation by between 1.4% and 26.5% across locations, traits, and data sets in terms of mean square error. Although a wide range of performance improvements were observed, our results provide evidence that using IBD for the allocation of lines to locations can help improve predictive performance compared with random allocation. This has the potential to be applied to large-scale plant breeding programs.

Artículo

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA Bayes Theorem Genome Inflammatory Bowel Diseases Models, Genetic Plant Breeding

Replication Data for: Genomic prediction within and across families in wheat pre-breeding populations

Johannes Martini Fernando Henrique Toledo Carolina Sansaloni Jose Crossa Jaime Cuevas Sivakumar Sukumaran (2020)

The genetic diversity housed in germplasm banks may provide valuable contributions to breeding efforts. It is important to understand the best way to introduce this diversity into elite breeding materials. This files in this dataset provide phenotypic and genotypic data used to compare genomic prediction approaches and different cross-validation scenarios on a set of wheat families obtained from crosses between elite materials and diverse germplasm bank accessions. The linked top cross population (LTP) materials analyzed in the study were screened under yield potential, drought, and heat stress conditions.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Genomic and pedigree prediction with genotype × environment interaction in spring wheat grown in South and Western Asia, North Africa, and Mexico

Sivakumar Sukumaran Jose Crossa Carlos Jara Marta Lopes Matthew Paul Reynolds (2016)

Increases in genetic gains in grain yield can be accelerated through genomic selection (GS). In the present study seven genomic prediction models under two cross validation scenarios were evaluated on the Wheat Association Mapping Initiative population of 287 advanced elite lines phenotyped for grain yield (GY), thousand grain weight (GW), grain number (GN), and thermal time for flowering (TTF) in 18 environments (year location combinations) in major wheat producing countries in 2010 and 2011. The seven genomic prediction models tested herein: four of them (model 1 (L+E), model 2 (L+E+G), model 3 (L+E+A) , and model 4 (L+E+A+G )) with main effects (lines (L), environme nts (E), genetic relationship matrix (G), and pedigree derived matrix (A) and three of them (model 5 (L+E+A+AE), model 6 (L+E+G+GE), and model 7 (L+E+G+A+AE+GE)) with interaction effects between A×E, G×E, and both together with main effects. Moreover, two cross validation (CV) schemes were applied: (1) predicting lines’ performance at untested sites (CV1) and (2) predicting the lines’ performance at some sites with the performance from other sites (CV2). The genomic prediction models with interaction terms, models 6 and 7 had the highest prediction accuracy on average for CV1 for GY (0.31), GN (0.30), and model 5 for TTF (0.26). Models 3 and 7 2, were the best model for GW (0.45 each) under CV1 scenario. For CV2, the prediction accuracy was generally high for the model with interaction terms models 5, 6, and 7 for GY (0.39), model 5 and 7 for GN (0.43. For GW and TTF models prediction accuracy were similar. Results indicated genomic selection can be used to predict genotype by environment (G×E) interaction in multi environment trials to select varieties for release as well as for accelerated breeding.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Replication Data for: Allocation of wheat lines in sparse testing for genome-based multi-environment prediction

Leonardo Abdiel Crespo Herrera Ravi Singh Suchismita Mondal Philomin Juliana DIEGO JARQUIN Jose Crossa (2021)

Sparse testing can be used in plant breeding and genome-based prediction. In sparse testing not all of the lines are sown in all environments. The phenotypic and genotypic data files provided in this dataset were used to execute an analysis of three general cases of the composition of the sparse testing allocation design for wheat breeding.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA

Prediction of multiple-trait and multiple-environment genomic data using recommender systems

Osval Antonio Montesinos-Lopez Jose Crossa Ravi Singh Suchismita Mondal Philomin Juliana (2017)

In genomic-enabled prediction, the task of improving the accuracy of the prediction of lines in environments is difficult because the available information is generally sparse and usually has low correlations between traits. In current genomic selection, while researchers have a large amount of information and appropriate statistical models to process it, there is still limited computing efficiency to do so. Although statistical models are usually mathematically elegant, they are also computatio nally inefficient, and they are impractical for many traits, lines, environments, and years because they need to sample from huge normal multivariate distributions. For these reasons, this study explores two recommender systems: a) item-based collaborative filtering (IBCF; method M1) and b) the matrix factorization algorithm (method M2) in the context of multiple traits and multiple environments. The IBCF and matrix factorization methods were compared with two conventional methods on simulated and real data. Results of the simulated and real data sets show that the IBCF technique (method M1) was slightly better in terms of prediction accuracy than the two conventional methods and the matrix factorization method when the correlation was moderately high. The IBCF technique is very attractive because it produces good predictions when there is high correlation between items (environment-trait combinations) and its implementation is computationally feasible, which can be useful for plant breeders who deal with very large data sets.

Dataset

CIENCIAS AGROPECUARIAS Y BIOTECNOLOGÍA