Introduction

Over half of the world’s population depend on rice for survival as it serves as a major staple food for an estimated 66% of people in almost all regions. Therefore, increased rice production will play a crucial role in improving food security, eradicating poverty and supplementing energy demands as it currently provides over 20% of energy. Predictions have shown that when compared with 1995, 60% more rice will have to be produced by 2030 (Saleh et al., 2020).

Influenced by quantitative trait loci (QTL) genes and environmental factors, yield is one of the most significant and complex trait in rice. In rice species, yield is determined by certain factors, classified as either direct or indirect. Direct determinants of yield include 1000 or 100 – grain weight depending on the size of the grain, filled grains per panicle and panicle number per unit area and/or per plant, while indirect determinants include traits like plant height, growth period, tillering ability, panicle length, seed length, seed setting rate, and grains per panicle (Li et al., 2019).

One major predictor of milling yield is test weight, a component used for the comparison of grain densities and it is derived by weighing a known grain volume. It is dependent on embryo size and reserved nutrient quantity which is used for germination and growth. Lower test weight scores occur due to unfavorable field environmental conditions or poor grain filling prior to harvest. Depending on grain size, the 100 or 1000 grain weight is an important measure of seed quality which has a positive effect on seedling growth, seed potential, sprouting and overall plant performance (Deivasigamani S & Swaminathan C, 2018).

Over the years, studies have focused on overall rice yield prediction using agronomic traits (Miedaner et al., 2019; Prakash Poudel, 2018; Puji Lestari et al., 2015; Venkateswarlu et al., n.d.), where test weight has been the leading determinant of grain yield. However, the realization of yield potential which is measured by multiplying the number of spikelets per unit area by average grain weight is highly dependent on some latent factors (Venkateswarlu et al., n.d.). Considering the strong association between test weight and increased yield potential, it will be relatively important to explore factors that might influence or control test weight in rice cultivars. Therefore, this study seeks to;

Explore trait variation across subpopulations and regions.
Examine the extent of association between test weight (100 seed weight) and yield attributing traits across rice subpopulations and geographical regions.
Identify direct and latent factors influencing test weight across subpopulations.

Materials and Methods

Rice Diversity Panel 1 Dataset

The RDP1 (Reg. No. MP-6, NSL 500357 MAP) dataset is used in this study here. It broadly represents the wide range of phenotypic and genotypic variation within O. sativa. It consists of 421 purified homozygous rice (Oryza sativa L.) accessions (GSOR 301001 through GSOR 301421; GSOR 312001 through 312020) which include both landraces and elite rice cultivars. Accessions are classified into five subpopulation groups namely; indica (95 accessions) and aus (60), which belong to the Indica varietal group, and tropical japonica (106), temperate japonica (111), and aromatic (Group V) (16) which comprise the Japonica varietal group. Thirty-three accessions shared <60% ancestry with a single group and were classified as admixtures (Eizenga et al., 2014). Data can be downloaded here (usda.gov). The data selected for the analysis included the following trait data; 100 seed weight (HHGW), panicle number per unit area (PNN), filled grain number per panicle (FGN), plant height (PTH), panicle length (PNL), total florets per panicle (FLN).

Statistical Analysis

All statistical analysis were carried out in R version 4.0.2. Data was loaded into R studio and all missing values and unknown groups were filtered out. For descriptive statistics, boxplots were plotted using the ‘ggplot2’ package to show the distribution of traits across all regions and subpopulations. Before inferential statistical analysis were performed, data was preprocessed and normalized using the caret package after which density plots were drawn to visualize normality. PCA analysis was performed using the ‘ggbiplot’ package. Standardized major axis (SMA) slopes were used for the summary of bivariate trait relationships and to test whether the regression of test weight and other agronomic traits differed among subpopulations and regions. SMA regression was carried out using the ‘smart’ package. Multivariate causal network for all selected traits was quantified with a Structural Equation Model (SEM) using the ‘lavaan’ package. SEM combines multiple regression analysis and factor analysis for the evaluation of structural relationships, estimating interrelated dependence in a single analysis (Li et al., 2019; Regression & Models, 2005). A general linear regression was performed, where the most correlated variable, FLN was used as predictor for test weight across subpopulation and region. All relevant assumptions were tested and post hoc analyses were carried out.

Analysis and Results

# load all required libraries
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(smatr)
library(Cairo)
library(extrafont)

## Registering fonts with R

library(clusterSim)

## Loading required package: cluster

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggpubr)
library(lavaan)

## This is lavaan 0.6-7

## lavaan is BETA software! Please report any bugs.

library(semPlot)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

## Registered S3 methods overwritten by 'BDgraph':
##   method    from
##   plot.sim  lava
##   print.sim lava

## Registered S3 methods overwritten by 'huge':
##   method    from   
##   plot.roc  pROC   
##   plot.sim  BDgraph
##   print.roc pROC   
##   print.sim BDgraph

library(devtools)

## Loading required package: usethis

library(multcomp)

## Loading required package: mvtnorm

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: TH.data

## 
## Attaching package: 'TH.data'

## The following object is masked from 'package:MASS':
## 
##     geyser

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(emmeans)

## 
## Attaching package: 'emmeans'

## The following object is masked from 'package:devtools':
## 
##     test

library(modelr)
#library(ggbiplot)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Distribution of subpopulations across regions

# load data and clean names
rice <- read.csv("./raw_data/rice_phenotype_data.csv") %>%
  janitor::clean_names() %>%
  filter(!subpopulation_group_pca %in% "N/A") %>%
  filter(!region %in% "Unknown") %>%
  na.omit()
  
# view data
head(rice)

##   nsftv_id gsor_id irgc_id          name orignal_providing_country
## 1        1  301001  126380      Agostano                     Italy
## 2        3  301003  117636 Ai-Chiao-Hong                     China
## 3        6  301006  117603      ARC 7229                     India
## 4        7  301007  126381         Arias                 Indonesia
## 5        8  301008  117643   Asse Y Pung               Philippines
## 6        9  301009  117647         Baber                     India
##           region subpopulation_group_structure subpopulation_group_pca
## 1    West Europe                           TEJ                     TEJ
## 2      East Asia                           IND                     IND
## 3     South Asia                           AUS                     AUS
## 4 Southeast Asia                           TRJ                     TRJ
## 5 Southeast Asia                           TRJ                     TRJ
## 6     South Asia                       ADMIX_J                     TEJ
##   subpopulation_group_fast_structure amycn alkdig dhulprotcn  dthd  ptht flflg
## 1                 temperate-japonica  15.6    6.1        8.4  75.1 110.9  28.4
## 2                             indica  23.1    5.8        9.2  89.5 143.5    39
## 3                                aus  23.2    5.5        8.5  89.1 148.3  36.9
## 4                  tropical-japonica  20.1    6.3       11.8   105 119.6    37
## 5                  tropical-japonica  22.2    6.6        8.7 101.6 147.4    42
## 6                 temperate-japonica  22.1    5.4        9.9    73 128.9  24.4
##   flfwd pnnb pnlg pbrnb flnbppn unfilgrnb filgrnb hulgrlg hulgrwd hulgrlgwdro
## 1   1.3 21.5 20.5   9.3   136.3      16.5   119.8     8.1     3.7         2.2
## 2     1 57.5 26.8   9.2     113      28.3    84.8     7.7       3         2.6
## 3   1.8 17.4 30.9  11.2   291.9      37.6   254.3     7.1     3.3         2.2
## 4   1.6 17.4 24.6  13.3   229.5      21.4   208.1     7.3     2.5         2.9
## 5   1.9 13.8 31.5  12.2   328.6      72.1   256.4     8.6     2.7         3.2
## 6   1.2 18.6 20.2   7.9   114.4      12.9   101.5     7.7     3.4         2.3
##   hulgrvol hhulgrwt dhulgrlg dhulgrwd dhulgrvol culmhab sdsh awnplu lflpubes
## 1     13.3      3.1      5.8      3.1       7.7       4  yes      0      yes
## 2      8.3      2.3      5.6      2.6       5.1       8  yes      0      yes
## 3        9      2.2      5.1      2.8       5.6       6  yes      0      yes
## 4      5.9      2.2      5.5      2.2       3.6       3  yes      1      yes
## 5      7.7        2      6.5      2.2       4.5       3  yes      0      yes
## 6     10.5      2.7      5.5      2.9       6.4       7   no      5      yes
##   hulcl dhulgrcl lfblrs strthead
## 1     2        2      6      4.8
## 2     2        1      3      7.8
## 3     4        2      0      8.2
## 4     2        2    1.5      8.6
## 5     3        2    1.5        7
## 6     6        2    6.5      7.2

# check data structure
str(rice)

## 'data.frame':    383 obs. of  38 variables:
##  $ nsftv_id                          : chr  "1" "3" "6" "7" ...
##  $ gsor_id                           : chr  "301001" "301003" "301006" "301007" ...
##  $ irgc_id                           : chr  "126380" "117636" "117603" "126381" ...
##  $ name                              : chr  "Agostano" "Ai-Chiao-Hong" "ARC 7229" "Arias" ...
##  $ orignal_providing_country         : chr  "Italy" "China" "India" "Indonesia" ...
##  $ region                            : chr  "West Europe" "East Asia" "South Asia" "Southeast Asia" ...
##  $ subpopulation_group_structure     : chr  "TEJ" "IND" "AUS" "TRJ" ...
##  $ subpopulation_group_pca           : chr  "TEJ" "IND" "AUS" "TRJ" ...
##  $ subpopulation_group_fast_structure: chr  "temperate-japonica" "indica" "aus" "tropical-japonica" ...
##  $ amycn                             : chr  "15.6" "23.1" "23.2" "20.1" ...
##  $ alkdig                            : chr  "6.1" "5.8" "5.5" "6.3" ...
##  $ dhulprotcn                        : chr  "8.4" "9.2" "8.5" "11.8" ...
##  $ dthd                              : chr  "75.1" "89.5" "89.1" "105" ...
##  $ ptht                              : chr  "110.9" "143.5" "148.3" "119.6" ...
##  $ flflg                             : chr  "28.4" "39" "36.9" "37" ...
##  $ flfwd                             : chr  "1.3" "1" "1.8" "1.6" ...
##  $ pnnb                              : chr  "21.5" "57.5" "17.4" "17.4" ...
##  $ pnlg                              : chr  "20.5" "26.8" "30.9" "24.6" ...
##  $ pbrnb                             : chr  "9.3" "9.2" "11.2" "13.3" ...
##  $ flnbppn                           : chr  "136.3" "113" "291.9" "229.5" ...
##  $ unfilgrnb                         : chr  "16.5" "28.3" "37.6" "21.4" ...
##  $ filgrnb                           : chr  "119.8" "84.8" "254.3" "208.1" ...
##  $ hulgrlg                           : chr  "8.1" "7.7" "7.1" "7.3" ...
##  $ hulgrwd                           : chr  "3.7" "3" "3.3" "2.5" ...
##  $ hulgrlgwdro                       : chr  "2.2" "2.6" "2.2" "2.9" ...
##  $ hulgrvol                          : chr  "13.3" "8.3" "9" "5.9" ...
##  $ hhulgrwt                          : chr  "3.1" "2.3" "2.2" "2.2" ...
##  $ dhulgrlg                          : chr  "5.8" "5.6" "5.1" "5.5" ...
##  $ dhulgrwd                          : chr  "3.1" "2.6" "2.8" "2.2" ...
##  $ dhulgrvol                         : chr  "7.7" "5.1" "5.6" "3.6" ...
##  $ culmhab                           : chr  "4" "8" "6" "3" ...
##  $ sdsh                              : chr  "yes" "yes" "yes" "yes" ...
##  $ awnplu                            : chr  "0" "0" "0" "1" ...
##  $ lflpubes                          : chr  "yes" "yes" "yes" "yes" ...
##  $ hulcl                             : chr  "2" "2" "4" "2" ...
##  $ dhulgrcl                          : chr  "2" "1" "2" "2" ...
##  $ lfblrs                            : chr  "6" "3" "0" "1.5" ...
##  $ strthead                          : chr  "4.8" "7.8" "8.2" "8.6" ...
##  - attr(*, "na.action")= 'omit' Named int [1:22] 9 10 38 48 55 72 74 80 91 114 ...
##   ..- attr(*, "names")= chr [1:22] "9" "10" "38" "48" ...

# convert characters to class numeric
rice2 <- rice %>%
  mutate(hhulgrwt1 = as.numeric(hhulgrwt),
         hulgrvol1 = as.numeric(hulgrvol),
         ptht1 = as.numeric(ptht),
         pnlg1 = as.numeric(pnlg),
         pnnb1 = as.numeric(pnnb),
         dthd1 = as.numeric(dthd),
         filgrnb1 = as.numeric(filgrnb),
         flnbppn1 = as.numeric(flnbppn))

# str data
str(rice2)

## 'data.frame':    383 obs. of  46 variables:
##  $ nsftv_id                          : chr  "1" "3" "6" "7" ...
##  $ gsor_id                           : chr  "301001" "301003" "301006" "301007" ...
##  $ irgc_id                           : chr  "126380" "117636" "117603" "126381" ...
##  $ name                              : chr  "Agostano" "Ai-Chiao-Hong" "ARC 7229" "Arias" ...
##  $ orignal_providing_country         : chr  "Italy" "China" "India" "Indonesia" ...
##  $ region                            : chr  "West Europe" "East Asia" "South Asia" "Southeast Asia" ...
##  $ subpopulation_group_structure     : chr  "TEJ" "IND" "AUS" "TRJ" ...
##  $ subpopulation_group_pca           : chr  "TEJ" "IND" "AUS" "TRJ" ...
##  $ subpopulation_group_fast_structure: chr  "temperate-japonica" "indica" "aus" "tropical-japonica" ...
##  $ amycn                             : chr  "15.6" "23.1" "23.2" "20.1" ...
##  $ alkdig                            : chr  "6.1" "5.8" "5.5" "6.3" ...
##  $ dhulprotcn                        : chr  "8.4" "9.2" "8.5" "11.8" ...
##  $ dthd                              : chr  "75.1" "89.5" "89.1" "105" ...
##  $ ptht                              : chr  "110.9" "143.5" "148.3" "119.6" ...
##  $ flflg                             : chr  "28.4" "39" "36.9" "37" ...
##  $ flfwd                             : chr  "1.3" "1" "1.8" "1.6" ...
##  $ pnnb                              : chr  "21.5" "57.5" "17.4" "17.4" ...
##  $ pnlg                              : chr  "20.5" "26.8" "30.9" "24.6" ...
##  $ pbrnb                             : chr  "9.3" "9.2" "11.2" "13.3" ...
##  $ flnbppn                           : chr  "136.3" "113" "291.9" "229.5" ...
##  $ unfilgrnb                         : chr  "16.5" "28.3" "37.6" "21.4" ...
##  $ filgrnb                           : chr  "119.8" "84.8" "254.3" "208.1" ...
##  $ hulgrlg                           : chr  "8.1" "7.7" "7.1" "7.3" ...
##  $ hulgrwd                           : chr  "3.7" "3" "3.3" "2.5" ...
##  $ hulgrlgwdro                       : chr  "2.2" "2.6" "2.2" "2.9" ...
##  $ hulgrvol                          : chr  "13.3" "8.3" "9" "5.9" ...
##  $ hhulgrwt                          : chr  "3.1" "2.3" "2.2" "2.2" ...
##  $ dhulgrlg                          : chr  "5.8" "5.6" "5.1" "5.5" ...
##  $ dhulgrwd                          : chr  "3.1" "2.6" "2.8" "2.2" ...
##  $ dhulgrvol                         : chr  "7.7" "5.1" "5.6" "3.6" ...
##  $ culmhab                           : chr  "4" "8" "6" "3" ...
##  $ sdsh                              : chr  "yes" "yes" "yes" "yes" ...
##  $ awnplu                            : chr  "0" "0" "0" "1" ...
##  $ lflpubes                          : chr  "yes" "yes" "yes" "yes" ...
##  $ hulcl                             : chr  "2" "2" "4" "2" ...
##  $ dhulgrcl                          : chr  "2" "1" "2" "2" ...
##  $ lfblrs                            : chr  "6" "3" "0" "1.5" ...
##  $ strthead                          : chr  "4.8" "7.8" "8.2" "8.6" ...
##  $ hhulgrwt1                         : num  3.1 2.3 2.2 2.2 2 2.7 2.6 1.6 4.4 2.4 ...
##  $ hulgrvol1                         : num  13.3 8.3 9 5.9 7.7 10.5 9.1 6.1 10 7.4 ...
##  $ ptht1                             : num  111 144 148 120 147 ...
##  $ pnlg1                             : num  20.5 26.8 30.9 24.6 31.5 20.2 21.1 17.6 22.5 30.6 ...
##  $ pnnb1                             : num  21.5 57.5 17.4 17.4 13.8 18.6 36.7 39 26.7 30.9 ...
##  $ dthd1                             : num  75.1 89.5 89.1 105 101.6 ...
##  $ filgrnb1                          : num  119.8 84.8 254.3 208.1 256.4 ...
##  $ flnbppn1                          : num  136 113 292 230 329 ...
##  - attr(*, "na.action")= 'omit' Named int [1:22] 9 10 38 48 55 72 74 80 91 114 ...
##   ..- attr(*, "names")= chr [1:22] "9" "10" "38" "48" ...

# Distribution of subpopulations across all regions

# create an object for total region counts
totals <- rice2 %>% 
  count(region)

# create a barplot object
barplot <- rice2 %>%
  count(subpopulation_group_pca, region) %>%
  filter(n !=10) %>%
  ggplot(aes(y = region, x = n)) +
  geom_bar(stat = "identity", aes(fill = subpopulation_group_pca)) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.3, hjust=1)) +
  geom_text(data = totals, aes(label = n), cex = 3) +
  labs(y ="Regions", title = "Distribution of subpopulations across regions") 

# visulaize barplot
barplot

East Asia had the highest number of accessions (76) consisting of four sub-populations, with Temperate Japonica (TEJ) and Indica varieties (IND) occupying a significant proportion. The Admixed (ADMIX) subpopulation can be found in almost all regions except Oceania, Central Asia, Central America and the Caribbean regions with the lowest number of accessions (6)

Trait differences across regions and sub-populations

# trait differences across regions and across sub-populations
hhulgrwt <- ggplot(rice2,
       aes(y = hhulgrwt1, fill = region)) +
  geom_boxplot() +
  labs(y ="100-Seed weight") 


ptht <- ggplot(rice2,
       aes(y = ptht1, fill = region)) +
  geom_boxplot() +
  labs(y ="Plant Height") 


pnlg <- ggplot(rice2,
       aes(y = pnlg1, fill = region)) +
  geom_boxplot() +
  labs(y ="Panicle length
Inflorescence length") 


pnnb <- ggplot(rice2,
       aes(y = pnnb1, fill = region)) +
  geom_boxplot() +
  labs(y ="Panicles per plant
Inflorescence per plant") 


dthd <- ggplot(rice2,
       aes(y = dthd1, fill = region)) +
  geom_boxplot() +
  labs(y ="Days to heading") 


filgrnb <- ggplot(rice2,
       aes(y = filgrnb1, fill = region)) +
  geom_boxplot() +
  labs(y ="Seeds per panicle
Filled florets per panicle")


flnbppn <- ggplot(rice2,
       aes(y = flnbppn1, fill = region)) +
  geom_boxplot() +
  labs(y ="Total Florets per panicle")



ggarrange(hhulgrwt, ptht, pnlg, pnnb, dthd, filgrnb, flnbppn  + rremove("x.text"), common.legend = TRUE, ncol = 2, nrow = 2)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

#------------------------------------------------------------------------------


# trait differences across subpopulations
hhulgrwt <- ggplot(rice2,
       aes(y = hhulgrwt1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="100-Seed weight") 


ptht <- ggplot(rice2,
       aes(y = ptht1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Plant Height") 


pnlg <- ggplot(rice2,
       aes(y = pnlg1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Panicle length
Inflorescence length") 


pnnb <- ggplot(rice2,
       aes(y = pnnb1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Panicles per plant
Inflorescence per plant") 


dthd <- ggplot(rice2,
       aes(y = dthd1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Days to heading") 


filgrnb <- ggplot(rice2,
       aes(y = filgrnb1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Seeds per panicle
Filled florets per panicle")


flnbppn <- ggplot(rice2,
       aes(y = flnbppn1, fill = subpopulation_group_pca)) +
  geom_boxplot() +
  labs(y ="Total Florets per panicle")


# arrange all plots side by side using ggarrange
ggarrange(hhulgrwt, ptht, pnlg, pnnb, dthd, filgrnb, flnbppn, rremove("x.text"), common.legend = TRUE, ncol = 2, nrow = 2)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Data normalization for regression analysis

# scale and center data using Caret library
preproc1 <- preProcess(rice2[,39:46], method=c("center", "scale"))

# normalize data with mean 0 and standard deviation 1 
norm1 <- predict(preproc1, rice2[,39:46]) %>%
  mutate(region = rice2$region,
         subpopulation =rice2$subpopulation_group_pca) %>%
  na.omit() %>%
  dplyr::rename(HHGW = hhulgrwt1,
         HGV = hulgrvol1,
         PTH = ptht1,
         PNL = pnlg1,
         PNN = pnnb1,
         DTH = dthd1,
         FGN = filgrnb1,
         FLN = flnbppn1)

# summary of normalized data 
summary(norm1)

##       HHGW               HGV                PTH                 PNL          
##  Min.   :-3.13283   Min.   :-1.85980   Min.   :-2.404944   Min.   :-2.46837  
##  1st Qu.:-0.63691   1st Qu.:-0.69574   1st Qu.:-0.781939   1st Qu.:-0.61774  
##  Median :-0.22092   Median :-0.07214   Median :-0.017243   Median :-0.04441  
##  Mean   : 0.02193   Mean   : 0.01695   Mean   : 0.006418   Mean   : 0.01261  
##  3rd Qu.: 0.61105   3rd Qu.: 0.64501   3rd Qu.: 0.674510   3rd Qu.: 0.63778  
##  Max.   : 3.93895   Max.   : 3.17060   Max.   : 3.746671   Max.   : 3.36654  
##       PNN                DTH                FGN                 FLN           
##  Min.   :-1.80455   Min.   :-2.41050   Min.   :-2.293937   Min.   :-1.929527  
##  1st Qu.:-0.74798   1st Qu.:-0.60661   1st Qu.:-0.626699   1st Qu.:-0.610398  
##  Median :-0.22914   Median :-0.15163   Median :-0.039020   Median :-0.116219  
##  Mean   :-0.02606   Mean   :-0.02416   Mean   : 0.003102   Mean   : 0.006202  
##  3rd Qu.: 0.56907   3rd Qu.: 0.52461   3rd Qu.: 0.526551   3rd Qu.: 0.521558  
##  Max.   : 3.07712   Max.   : 5.11907   Max.   : 3.608639   Max.   : 3.564864  
##     region          subpopulation     
##  Length:364         Length:364        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

# View distributions after data normalization}

norm1 %>%
  group_by(subpopulation) %>%
  summarise(subpopulation, HHGW) %>%
  ggplot(aes(x = HHGW, fill = subpopulation)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~subpopulation, ncol = 2) +
  labs(title = "Distribution of grain weight after normalization")

## `summarise()` regrouping output by 'subpopulation' (override with `.groups` argument)

Principal Component Analysis

# I encountered an error downloading the ggbiplot package from GitHub so I copied the function code from the GitHub repo.

ggbiplot <- function(pcobj, choices = 1:2, scale = 1, pc.biplot = TRUE, 
                     obs.scale = 1 - scale, var.scale = scale, 
                     groups = NULL, ellipse = FALSE, ellipse.prob = 0.68, 
                     labels = NULL, labels.size = 3, alpha = 1, 
                     var.axes = TRUE, 
                     circle = FALSE, circle.prob = 0.69, 
                     varname.size = 3, varname.adjust = 1.5, 
                     varname.abbrev = FALSE, ...)
{
  library(ggplot2)
  library(plyr)
  library(scales)
  library(grid)
  
  stopifnot(length(choices) == 2)
  
  # Recover the SVD
  if(inherits(pcobj, 'prcomp')){
    nobs.factor <- sqrt(nrow(pcobj$x) - 1)
    d <- pcobj$sdev
    u <- sweep(pcobj$x, 2, 1 / (d * nobs.factor), FUN = '*')
    v <- pcobj$rotation
  } else if(inherits(pcobj, 'princomp')) {
    nobs.factor <- sqrt(pcobj$n.obs)
    d <- pcobj$sdev
    u <- sweep(pcobj$scores, 2, 1 / (d * nobs.factor), FUN = '*')
    v <- pcobj$loadings
  } else if(inherits(pcobj, 'PCA')) {
    nobs.factor <- sqrt(nrow(pcobj$call$X))
    d <- unlist(sqrt(pcobj$eig)[1])
    u <- sweep(pcobj$ind$coord, 2, 1 / (d * nobs.factor), FUN = '*')
    v <- sweep(pcobj$var$coord,2,sqrt(pcobj$eig[1:ncol(pcobj$var$coord),1]),FUN="/")
  } else if(inherits(pcobj, "lda")) {
    nobs.factor <- sqrt(pcobj$N)
    d <- pcobj$svd
    u <- predict(pcobj)$x/nobs.factor
    v <- pcobj$scaling
    d.total <- sum(d^2)
  } else {
    stop('Expected a object of class prcomp, princomp, PCA, or lda')
  }
  
  # Scores
  choices <- pmin(choices, ncol(u))
  df.u <- as.data.frame(sweep(u[,choices], 2, d[choices]^obs.scale, FUN='*'))
  
  # Directions
  v <- sweep(v, 2, d^var.scale, FUN='*')
  df.v <- as.data.frame(v[, choices])
  
  names(df.u) <- c('xvar', 'yvar')
  names(df.v) <- names(df.u)
  
  if(pc.biplot) {
    df.u <- df.u * nobs.factor
  }
  
  # Scale the radius of the correlation circle so that it corresponds to 
  # a data ellipse for the standardized PC scores
  r <- sqrt(qchisq(circle.prob, df = 2)) * prod(colMeans(df.u^2))^(1/4)
  
  # Scale directions
  v.scale <- rowSums(v^2)
  df.v <- r * df.v / sqrt(max(v.scale))
  
  # Change the labels for the axes
  if(obs.scale == 0) {
    u.axis.labs <- paste('standardized PC', choices, sep='')
  } else {
    u.axis.labs <- paste('PC', choices, sep='')
  }
  
  # Append the proportion of explained variance to the axis labels
  u.axis.labs <- paste(u.axis.labs, 
                       sprintf('(%0.1f%% explained var.)', 
                               100 * pcobj$sdev[choices]^2/sum(pcobj$sdev^2)))
  
  # Score Labels
  if(!is.null(labels)) {
    df.u$labels <- labels
  }
  
  # Grouping variable
  if(!is.null(groups)) {
    df.u$groups <- groups
  }
  
  # Variable Names
  if(varname.abbrev) {
    df.v$varname <- abbreviate(rownames(v))
  } else {
    df.v$varname <- rownames(v)
  }
  
  # Variables for text label placement
  df.v$angle <- with(df.v, (180/pi) * atan(yvar / xvar))
  df.v$hjust = with(df.v, (1 - varname.adjust * sign(xvar)) / 2)
  
  # Base plot
  g <- ggplot(data = df.u, aes(x = xvar, y = yvar)) + 
    xlab(u.axis.labs[1]) + ylab(u.axis.labs[2]) + coord_equal()
  
  if(var.axes) {
    # Draw circle
    if(circle) 
    {
      theta <- c(seq(-pi, pi, length = 50), seq(pi, -pi, length = 50))
      circle <- data.frame(xvar = r * cos(theta), yvar = r * sin(theta))
      g <- g + geom_path(data = circle, color = muted('white'), 
                         size = 1/2, alpha = 1/3)
    }
    
    # Draw directions
    g <- g +
      geom_segment(data = df.v,
                   aes(x = 0, y = 0, xend = xvar, yend = yvar),
                   arrow = arrow(length = unit(1/2, 'picas')), 
                   color = muted('red'))
  }
  
  # Draw either labels or points
  if(!is.null(df.u$labels)) {
    if(!is.null(df.u$groups)) {
      g <- g + geom_text(aes(label = labels, color = groups), 
                         size = labels.size)
    } else {
      g <- g + geom_text(aes(label = labels), size = labels.size)      
    }
  } else {
    if(!is.null(df.u$groups)) {
      g <- g + geom_point(aes(color = groups), alpha = alpha)
    } else {
      g <- g + geom_point(alpha = alpha)      
    }
  }
  
  # Overlay a concentration ellipse if there are groups
  if(!is.null(df.u$groups) && ellipse) {
    theta <- c(seq(-pi, pi, length = 50), seq(pi, -pi, length = 50))
    circle <- cbind(cos(theta), sin(theta))
    
    ell <- ddply(df.u, 'groups', function(x) {
      if(nrow(x) <= 2) {
        return(NULL)
      }
      sigma <- var(cbind(x$xvar, x$yvar))
      mu <- c(mean(x$xvar), mean(x$yvar))
      ed <- sqrt(qchisq(ellipse.prob, df = 2))
      data.frame(sweep(circle %*% chol(sigma) * ed, 2, mu, FUN = '+'), 
                 groups = x$groups[1])
    })
    names(ell)[1:2] <- c('xvar', 'yvar')
    g <- g + geom_path(data = ell, aes(color = groups, group = groups))
  }
  
  # Label the variable axes
  if(var.axes) {
    g <- g + 
      geom_text(data = df.v, 
                aes(label = varname, x = xvar, y = yvar, 
                    angle = angle, hjust = hjust), 
                color = 'darkred', size = varname.size)
  }
  # Change the name of the legend for groups
  # if(!is.null(groups)) {
  #   g <- g + scale_color_brewer(name = deparse(substitute(groups)), 
  #                               palette = 'Dark2')
  # }
  
  # TODO: Add a second set of axes
  
  return(g)
}

# compute PCA 
rice.pca <- prcomp(norm1[,1:8], center = TRUE,scale. = TRUE)

# view pca summary
summary(rice.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7248 1.4214 1.1228 0.80165 0.66286 0.63019 0.43245
## Proportion of Variance 0.3719 0.2525 0.1576 0.08033 0.05492 0.04964 0.02338
## Cumulative Proportion  0.3719 0.6244 0.7820 0.86234 0.91726 0.96690 0.99028
##                            PC8
## Standard deviation     0.27883
## Proportion of Variance 0.00972
## Cumulative Proportion  1.00000

# view pca structure
str(rice.pca)

## List of 5
##  $ sdev    : num [1:8] 1.725 1.421 1.123 0.802 0.663 ...
##  $ rotation: num [1:8, 1:8] -0.3657 -0.3812 0.3156 0.4178 0.0747 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "HHGW" "HGV" "PTH" "PNL" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 0.02193 0.01695 0.00642 0.01261 -0.02606 ...
##   ..- attr(*, "names")= chr [1:8] "HHGW" "HGV" "PTH" "PNL" ...
##  $ scale   : Named num [1:8] 0.989 0.995 0.984 0.994 0.954 ...
##   ..- attr(*, "names")= chr [1:8] "HHGW" "HGV" "PTH" "PNL" ...
##  $ x       : num [1:364, 1:8] -2.227 0.552 3.9 2.64 4.948 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:364] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

P <- ggbiplot(rice.pca,
         obs.scale = 1, 
         var.scale=1,
         ellipse=F,
         circle=F,
         varname.size=4,
         var.axes=T,
         groups=norm1$subpopulation, #no need for coloring, I'm making the points invisible
         alpha=0) #invisible points, I add them below

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:ggpubr':
## 
##     mutate

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

P$layers <- c(geom_point(aes(color=norm1$subpopulation), cex=3), P$layers) #add geom_point in a layer underneath (only way I have to change the size of the points in ggbiplot)

P

For the principal component plot, PC1 accounted for 37.2% of the total variation and was associated with FGN, FLN, PNL, DTH and PNN. The first quadrat of PC1 formed a cluster comprising ADMIX, AUS and IND scoring high for FGN, FLN, PNL and DTH, while IND and AUS formed the major cluster in the second quadrat scoring low for PTH and PNN. TEJ and ADMIX were the major clusters found in PC2 which accounted for 25.3%. It was associated with HHGW and HGV, which scored high on the first quadrat, while the second quadrat contained some clusters of TEJ, ADMIX and AUS.

Standardized Major Axis (SMA) regression

Relationship between Test weight and other agronomic traits

# To plot relationship between seed per panicle and test weight across subpopulations

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$subpopulation), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~FGN*subpopulation, data=subset(norm1, subpopulation == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(subpopulation = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between seed per panicle and test weight across subpopulations
seeds_per_panicle <- ggplot(norm1) +
  geom_point(aes(x=FGN, y=HHGW, color=subpopulation)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Seeds per panicle", y = "100 seed weight",
       title = "Relationship between test weight and seeds per panicle across subpopulations")

# visualize relationship
seeds_per_panicle

# make a facet-wrap to view relationship across groups
seeds_per_panicle + facet_wrap(~subpopulation, ncol=2)

# Perform sma analysis for seed per panicle across subpopulations
filgrnb_sma <-sma(HHGW~FGN*subpopulation, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
filgrnb_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")       # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
ADMIX	56	0.0121245	0.4191634	-1.0579973	-1.3831041	-0.8093088	0.1553618	-0.2030996	0.5138232	NA	NA	NA	NA
ARO	7	0.2221491	0.2856813	-3.3189881	-8.0916075	-1.3613713	-3.0181351	-5.9428513	-0.0934189	NA	NA	NA	NA
AUS	51	0.0722018	0.0565743	-0.6931603	-0.9108492	-0.5274981	-0.5983793	-0.8477217	-0.3490369	NA	NA	NA	NA
IND	73	0.0003167	0.8812093	0.8160154	0.6454752	1.0316138	-0.5489209	-0.8172319	-0.2806098	NA	NA	NA	NA
TEJ	92	0.0000000	0.9992490	0.9523719	0.7735907	1.1724705	0.7747140	0.5149058	1.0345222	NA	NA	NA	NA
TRJ	85	0.2515723	0.0000010	-0.9518213	-1.1484191	-0.7888790	0.7758076	0.5609367	0.9906784	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To plot the relationship between Total Florets per panicle and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$subpopulation), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~FLN*subpopulation, data=subset(norm1, subpopulation == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(subpopulation = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between seed per panicle and test weight across subpopulations
florets_per_panicle <- ggplot(norm1) +
  geom_point(aes(x=FLN, y=HHGW, color=subpopulation)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Total Florets per Panicle", y = "100 seed weight",
       title = "Relationship between test weight and total florets per panicle across subpopulations")

# visualize relationship
florets_per_panicle

# make a facet-wrap to view relationship across groups
florets_per_panicle + facet_wrap(~subpopulation, ncol=2)

# Perform sma analysis for Total Florets per panicle across subpopulations
flnbppn_sma <-sma(HHGW~FLN*subpopulation, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
flnbppn_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")        # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
ADMIX	56	0.1613282	0.0021530	-1.0019881	-1.2831417	-0.7824390	0.4253753	0.1263962	0.7243544	NA	NA	NA	NA
ARO	7	0.3560280	0.1572758	-3.3027580	-7.5404121	-1.4466332	-2.9293795	-5.4986269	-0.3601320	NA	NA	NA	NA
AUS	51	0.0767219	0.0490988	-0.7737894	-1.0161392	-0.5892401	-0.8708629	-1.1211628	-0.6205630	NA	NA	NA	NA
IND	73	0.0280656	0.1565744	-0.8576659	-1.0807839	-0.6806086	-0.5036280	-0.7507534	-0.2565026	NA	NA	NA	NA
TEJ	92	0.0000873	0.9295503	1.0138130	0.8235053	1.2480998	0.8539348	0.5901924	1.1176772	NA	NA	NA	NA
TRJ	85	0.3162021	0.0000000	-0.9617684	-1.1509440	-0.8036867	0.8723962	0.6641858	1.0806066	NA	NA	NA	NA

#------------------------------------------------------------------------------


# To estimate the relationship between panicles per plant and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$subpopulation), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~FLN*subpopulation, data=subset(norm1, subpopulation == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(subpopulation = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between panicles per plant and test weight across subpopulations
panicles_per_plant <- ggplot(norm1) +
  geom_point(aes(x=PNN, y=HHGW, color=subpopulation)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Panicle per plant", y = "100 seed weight",
       title = "Relationship between test weight and panicles per plant across subpopulations")

# visualize relationship
panicles_per_plant

# make a facet-wrap to view relationship across groups
panicles_per_plant + facet_wrap(~subpopulation, ncol=2)

# Perform sma analysis for panicles per plant across subpopulations
pnnb_sma <-sma(HHGW~PNN*subpopulation, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
pnnb_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")            # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
ADMIX	56	0.1381634	0.0047924	-1.3066327	-1.678839	-1.0169464	-0.2327722	-0.5533354	0.0877910	NA	NA	NA	NA
ARO	7	0.0148219	0.7948473	3.1650317	1.190632	8.4135364	-0.3755597	-1.8688436	1.1177243	NA	NA	NA	NA
AUS	51	0.0877770	0.0347723	-0.8982791	-1.177736	-0.6851323	-0.1998707	-0.4796015	0.0798600	NA	NA	NA	NA
IND	73	0.0212668	0.2182957	-1.0111003	-1.275145	-0.8017317	0.5352962	0.1808400	0.8897523	NA	NA	NA	NA
TEJ	92	0.0681682	0.0119433	-1.4595375	-1.784107	-1.1940148	0.1097310	-0.1148818	0.3343438	NA	NA	NA	NA
TRJ	85	0.0135029	0.2895732	-2.6793730	-3.322629	-2.1606505	-2.0072094	-2.5961890	-1.4182299	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To estimate the relationship between plant height and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$subpopulation), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~PTH*subpopulation, data=subset(norm1, subpopulation == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(subpopulation = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between plant height and test weight across subpopulations
plant_height <- ggplot(norm1) +
  geom_point(aes(x=PTH, y=HHGW, color=subpopulation)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Plant height", y = "100 seed weight",
       title = "Relationship between test weight and plant height across regions")

# visualize relationship
plant_height

# make a facet-wrap to view relationship across groups
plant_height + facet_wrap(~subpopulation, ncol=2)

# Perform sma analysis for panicles per plant across subpopulations
ptht_sma <-sma(HHGW~PTH*subpopulation, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
ptht_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")            # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
ADMIX	56	0.0118435	0.4246377	-0.9903552	-1.2947248	-0.7575381	0.2082924	-0.1502028	0.5667876	NA	NA	NA	NA
ARO	7	0.0128631	0.8086956	-1.2907307	-3.4336862	-0.4851887	1.7019968	-1.0996128	4.5036064	NA	NA	NA	NA
AUS	51	0.0263307	0.2552656	-1.4129056	-1.8687461	-1.0682576	0.3222571	-0.0715528	0.7160670	NA	NA	NA	NA
IND	73	0.0482383	0.0618992	-0.7246572	-0.9110112	-0.5764232	-0.3988271	-0.6400974	-0.1575568	NA	NA	NA	NA
TEJ	92	0.0282577	0.1092141	1.2331440	1.0045823	1.5137078	1.2298064	0.9489479	1.5106649	NA	NA	NA	NA
TRJ	85	0.0402848	0.0654966	1.1154818	0.9021361	1.3792814	0.4182288	0.1609324	0.6755252	NA	NA	NA	NA

#------------------------------------------------------------------------------


# To estimate the relationship between days to heading and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$subpopulation), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~DTH*subpopulation, data=subset(norm1, subpopulation == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(subpopulation = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between days to heading and test weight across subpopulations
days_to_heading <- ggplot(norm1) +
  geom_point(aes(x=DTH, y=HHGW, color=subpopulation)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Days to heading", y = "100 seed weight",
       title = "Relationship between test weight and Days to Heading across subpopulations")

# visualize relationship
days_to_heading

# make a facet-wrap to view relationship across groups
days_to_heading + facet_wrap(~subpopulation, ncol=2)

# Perform sma analysis for days to heading across subpopulations
dthd_sma <-sma(HHGW~DTH*subpopulation, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
dthd_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")            # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
ADMIX	56	0.0570007	0.0763853	-1.1270561	-1.4645429	-0.8673391	0.1121810	-0.2200131	0.4443752	NA	NA	NA	NA
ARO	7	0.4841719	0.0825206	-1.3669631	-2.9013274	-0.6440459	-0.4074222	-1.2867789	0.4719346	NA	NA	NA	NA
AUS	51	0.0192425	0.3316554	1.1706593	0.8842261	1.5498786	-0.8271417	-1.0979854	-0.5562980	NA	NA	NA	NA
IND	73	0.0295793	0.1456815	-0.6419890	-0.8088564	-0.5095464	-0.2516907	-0.5068615	0.0034801	NA	NA	NA	NA
TEJ	92	0.0186075	0.1947779	-0.9552551	-1.1737686	-0.7774210	-0.1698103	-0.4337191	0.0940985	NA	NA	NA	NA
TRJ	85	0.0007544	0.8029571	-1.5460831	-1.9198803	-1.2450635	0.7033897	0.4122409	0.9945385	NA	NA	NA	NA

# To relationship between seed per panicle and test weight across regions

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$region), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~FGN*region, data=subset(norm1, region == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(region = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between seed per panicle and test weight across regions
seeds_per_panicle <- ggplot(norm1) +
  geom_point(aes(x=FGN, y=HHGW, color=region)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Seeds per panicle", y = "100 seed weight",
       title = "Relationship between test weight and seeds per panicle across regions")

# visualize relationship
seeds_per_panicle

# make a facet-wrap to view relationship across groups
seeds_per_panicle + facet_wrap(~region, ncol=2)

# Perform sma analysis for seed per panicle across regions
filgrnb1_sma <-sma(HHGW~FGN*region, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
filgrnb1_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")      # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
Africa	37	0.1038062	0.0518053	-1.1395415	-1.5683439	-0.8279783	0.7981303	0.4009233	1.1953373	NA	NA	NA	NA
Caribbean	6	0.0685432	0.6162613	-0.6306579	-1.8993192	-0.2094063	0.5911244	-0.1275153	1.3097641	NA	NA	NA	NA
Central America	6	0.4047575	0.1744462	-0.8800483	-2.2321099	-0.3469744	0.8680227	0.1612131	1.5748324	NA	NA	NA	NA
Central Asia	6	0.1373502	0.4695396	-1.0977920	-3.2067302	-0.3758181	-0.9904780	-2.4043457	0.4233898	NA	NA	NA	NA
East Asia	74	0.0007424	0.8177540	0.8618698	0.6829122	1.0877233	-0.0957317	-0.3602370	0.1687735	NA	NA	NA	NA
East Europe	15	0.0045454	0.8113138	-0.6786493	-1.1963794	-0.3849656	0.3255896	-0.3758267	1.0270060	NA	NA	NA	NA
North America	43	0.0158674	0.4208852	-0.7349767	-1.0000780	-0.5401487	0.2770053	-0.0168251	0.5708357	NA	NA	NA	NA
Oceania	6	0.3608716	0.2073036	-0.7256853	-1.8894745	-0.2787120	-0.3652131	-1.2710989	0.5406727	NA	NA	NA	NA
South America	30	0.0610612	0.1880170	-1.2935255	-1.8667460	-0.8963234	0.0973052	-0.3656056	0.5602159	NA	NA	NA	NA
South Asia	63	0.0494662	0.0797754	-0.7445339	-0.9532235	-0.5815328	-0.5703187	-0.8336862	-0.3069512	NA	NA	NA	NA
Southeast Asia	46	0.0105168	0.4976533	-0.8542723	-1.1506184	-0.6342513	-0.1038009	-0.4983028	0.2907009	NA	NA	NA	NA
West Asia	11	0.0089617	0.7818864	-0.9335995	-1.8681939	-0.4665511	-0.6187488	-1.5346333	0.2971357	NA	NA	NA	NA
West Europe	21	0.0023361	0.8351923	-0.9910938	-1.5745282	-0.6238484	0.9547820	0.4382989	1.4712650	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To estimate the relationship between Total Florets per panicle and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$region), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~FLN*region, data=subset(norm1, region == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(region= x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between seed per panicle and test weight across regions
florets_per_panicle <- ggplot(norm1) +
  geom_point(aes(x=FLN, y=HHGW, color=region)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Total Florets per Panicle", y = "100 seed weight",
       title = "Relationship between test weight and total florets per panicle across regions")

# visualize relationship
florets_per_panicle

# make a facet-wrap to view relationship across groups
florets_per_panicle + facet_wrap(~region, ncol=2)

# Perform sma analysis for Total Florets per panicle across regions
flnbppn1_sma <-sma(HHGW~FLN*region, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
flnbppn1_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")       # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
Africa	37	0.1824934	0.0083630	-1.2247427	-1.6623304	-0.9023444	0.9025426	0.5346671	1.2704181	NA	NA	NA	NA
Caribbean	6	0.0058945	0.8850630	-0.6217080	-1.9221328	-0.2010896	0.4849523	-0.2637442	1.2336489	NA	NA	NA	NA
Central America	6	0.5675139	0.0837613	-0.9795673	-2.2206821	-0.4320979	0.7868129	0.2336541	1.3399717	NA	NA	NA	NA
Central Asia	6	0.2078680	0.3634980	-0.9024660	-2.5495213	-0.3194501	-0.7732753	-1.9320294	0.3854787	NA	NA	NA	NA
East Asia	74	0.0000042	0.9862292	-0.8872140	-1.1198035	-0.7029346	-0.3062358	-0.5754577	-0.0370139	NA	NA	NA	NA
East Europe	15	0.0024563	0.8607635	-0.7441415	-1.3125404	-0.4218892	0.2996701	-0.4128345	1.0121747	NA	NA	NA	NA
North America	43	0.0952134	0.0440929	-0.7525207	-1.0114193	-0.5598938	0.4403637	0.1625446	0.7181829	NA	NA	NA	NA
Oceania	6	0.0381806	0.7106325	-0.9618998	-2.9344850	-0.3153028	-0.3175809	-1.5986496	0.9634879	NA	NA	NA	NA
South America	30	0.2966206	0.0018599	-1.2485895	-1.7181157	-0.9073753	0.2048406	-0.1542327	0.5639140	NA	NA	NA	NA
South Asia	63	0.0774085	0.0272509	-0.7995030	-1.0199343	-0.6267120	-0.7127386	-0.9639363	-0.4615409	NA	NA	NA	NA
Southeast Asia	46	0.0531385	0.1232532	-0.7999274	-1.0706494	-0.5976596	0.0055228	-0.3641482	0.3751938	NA	NA	NA	NA
West Asia	11	0.0015596	0.9082219	1.0253296	0.5112472	2.0563454	0.9416803	-0.0834793	1.9668400	NA	NA	NA	NA
West Europe	21	0.0047711	0.7660808	-1.0065084	-1.5981729	-0.6338858	0.8928619	0.3737050	1.4120189	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To estimate the relationship between panicles per plant and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$region), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~PNN*region, data=subset(norm1, region == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(region = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between panicles per plant and test weight across regions
panicles_per_plant <- ggplot(norm1) +
  geom_point(aes(x=PNN, y=HHGW, color=region)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Panicle per plant", y = "100 seed weight",
       title = "Relationship between test weight and panicles per plant across regions")

# visualize relationship
panicles_per_plant

# make a facet-wrap to view relationship across regions
panicles_per_plant + facet_wrap(~region, ncol=2)

# Perform sma analysis for panicles per plant across regions
pnnb1_sma <-sma(HHGW~PNN*region, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
pnnb1_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")          # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
Africa	37	0.0982540	0.0588716	-1.2364862	-1.7033950	-0.8975593	0.2075433	-0.2294960	0.6445827	NA	NA	NA	NA
Caribbean	6	0.0276460	0.7528919	-0.4763870	-1.4597221	-0.1554711	0.1371494	-0.6263329	0.9006317	NA	NA	NA	NA
Central America	6	0.2714903	0.2891587	3.2463278	1.1868026	8.8798628	3.0949157	-0.0113749	6.2012063	NA	NA	NA	NA
Central Asia	6	0.1525534	0.4439209	1.2164528	0.4193714	3.5285131	0.4598387	-0.8589333	1.7786107	NA	NA	NA	NA
East Asia	74	0.1473644	0.0007344	-0.8401930	-1.0420001	-0.6774705	0.0796029	-0.1373261	0.2965319	NA	NA	NA	NA
East Europe	15	0.2781364	0.0433516	-1.6136370	-2.6321650	-0.9892330	0.0772868	-0.4913607	0.6459344	NA	NA	NA	NA
North America	43	0.0003400	0.9065756	1.2394581	0.9087676	1.6904832	0.7702358	0.3952447	1.1452269	NA	NA	NA	NA
Oceania	6	0.4271472	0.1592367	-0.5701319	-1.4260224	-0.2279420	-0.2029340	-1.0481804	0.6423123	NA	NA	NA	NA
South America	30	0.2715049	0.0031527	-1.0937199	-1.5132481	-0.7905004	0.0680064	-0.3023325	0.4383453	NA	NA	NA	NA
South Asia	63	0.0551354	0.0639668	-0.9367392	-1.1984358	-0.7321880	-0.2173011	-0.5045463	0.0699441	NA	NA	NA	NA
Southeast Asia	46	0.1993628	0.0018681	-0.8358917	-1.0934764	-0.6389849	-0.2066121	-0.5153668	0.1021427	NA	NA	NA	NA
West Asia	11	0.3602667	0.0508948	-0.9851505	-1.7446157	-0.5562953	0.3023817	-0.2227482	0.8275117	NA	NA	NA	NA
West Europe	21	0.1134283	0.1354687	-1.2945534	-2.0060117	-0.8354231	0.6785127	0.1948718	1.1621536	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To estimate the relationship between plant height and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$region), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~PNN*region, data=subset(norm1, region == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(region = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between plant height and test weight across regions
plant_height <- ggplot(norm1) +
  geom_point(aes(x=PNN, y=HHGW, color=region)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Plant height", y = "100 seed weight",
       title = "Relationship between test weight and plant height across regions")

# visualize relationship
plant_height

# make a facet-wrap to view relationship across regions
plant_height + facet_wrap(~region, ncol=2)

# Perform sma analysis for panicles per plant across regions
ptht1_sma <-sma(HHGW~PNN*region, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
ptht1_sma$groupsummary%>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")          # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
Africa	37	0.0982540	0.0588716	-1.2364862	-1.7033950	-0.8975593	0.2075433	-0.2294960	0.6445827	NA	NA	NA	NA
Caribbean	6	0.0276460	0.7528919	-0.4763870	-1.4597221	-0.1554711	0.1371494	-0.6263329	0.9006317	NA	NA	NA	NA
Central America	6	0.2714903	0.2891587	3.2463278	1.1868026	8.8798628	3.0949157	-0.0113749	6.2012063	NA	NA	NA	NA
Central Asia	6	0.1525534	0.4439209	1.2164528	0.4193714	3.5285131	0.4598387	-0.8589333	1.7786107	NA	NA	NA	NA
East Asia	74	0.1473644	0.0007344	-0.8401930	-1.0420001	-0.6774705	0.0796029	-0.1373261	0.2965319	NA	NA	NA	NA
East Europe	15	0.2781364	0.0433516	-1.6136370	-2.6321650	-0.9892330	0.0772868	-0.4913607	0.6459344	NA	NA	NA	NA
North America	43	0.0003400	0.9065756	1.2394581	0.9087676	1.6904832	0.7702358	0.3952447	1.1452269	NA	NA	NA	NA
Oceania	6	0.4271472	0.1592367	-0.5701319	-1.4260224	-0.2279420	-0.2029340	-1.0481804	0.6423123	NA	NA	NA	NA
South America	30	0.2715049	0.0031527	-1.0937199	-1.5132481	-0.7905004	0.0680064	-0.3023325	0.4383453	NA	NA	NA	NA
South Asia	63	0.0551354	0.0639668	-0.9367392	-1.1984358	-0.7321880	-0.2173011	-0.5045463	0.0699441	NA	NA	NA	NA
Southeast Asia	46	0.1993628	0.0018681	-0.8358917	-1.0934764	-0.6389849	-0.2066121	-0.5153668	0.1021427	NA	NA	NA	NA
West Asia	11	0.3602667	0.0508948	-0.9851505	-1.7446157	-0.5562953	0.3023817	-0.2227482	0.8275117	NA	NA	NA	NA
West Europe	21	0.1134283	0.1354687	-1.2945534	-2.0060117	-0.8354231	0.6785127	0.1948718	1.1621536	NA	NA	NA	NA

#------------------------------------------------------------------------------

# To estimate the relationship between days to heading and 100 seed weight

# assembly sma model coefficients into smaller objects using do.call and rbind
do.call(rbind, lapply(unique(norm1$region), function(x) {
  # make object for sma model and subset population
  obj <- sma(HHGW~DTH*region, data=subset(norm1, region == x), multcomp=TRUE,multcompmethod="adjusted")
  data.frame(region = x, 
             intercept=obj$coef[[1]][1, 1],
             slope=obj$coef[[1]][2, 1])
})) -> fits

# plot relationship between days to heading and test weight across regions
days_to_heading <- ggplot(norm1) +
  geom_point(aes(x=DTH, y=HHGW, color=region)) +
  geom_abline(data=fits, aes(slope=slope, intercept=intercept)) +
  labs(x = "Days to heading", y = "100 seed weight",
       title = "Relationship between test weight and Days to Heading across regions")

# visualize relationship
days_to_heading

# make a facet-wrap to view relationship across regions
days_to_heading + facet_wrap(~region, ncol=2)

# Perform sma analysis for Days to heading across regions
dthd1_sma <-sma(HHGW~DTH*region, data=norm1,multcomp=TRUE,multcompmethod="adjusted")

# view group summary
dthd1_sma$groupsummary %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")          # Using KableExtra table formatting

group	n	r2	pval	Slope	Slope_lowCI	Slope_highCI	Int	Int_lowCI	Int_highCI	Slope_test	Slope_test_p	Elev_test	Elev_test_p
Africa	37	0.0146114	0.4760598	0.8908774	0.6376806	1.2446081	0.5296634	0.0713506	0.9879763	NA	NA	NA	NA
Caribbean	6	0.0131383	0.8288195	1.0412775	0.3377972	3.2097924	0.6798661	-0.1550681	1.5148004	NA	NA	NA	NA
Central America	6	0.1222994	0.4968149	1.2897887	0.4385342	3.7934441	0.0422902	-1.0048885	1.0894690	NA	NA	NA	NA
Central Asia	6	0.2522037	0.3100299	-0.5894852	-1.6286757	-0.2133591	-0.7191027	-1.7961556	0.3579501	NA	NA	NA	NA
East Asia	74	0.0144835	0.3070841	-0.8577000	-1.0807544	-0.6806813	-0.2869803	-0.5394275	-0.0345331	NA	NA	NA	NA
East Europe	15	0.0133345	0.6819456	-0.9361115	-1.6465107	-0.5322193	-0.3903905	-1.3054272	0.5246461	NA	NA	NA	NA
North America	43	0.0183218	0.3867970	1.2024663	0.8840452	1.6355785	0.1531816	-0.1332323	0.4395954	NA	NA	NA	NA
Oceania	6	0.0674391	0.6192211	-1.4786715	-4.4553515	-0.4907513	-0.3157077	-1.5444846	0.9130691	NA	NA	NA	NA
South America	30	0.1428722	0.0394516	-1.0683908	-1.5178379	-0.7520295	0.1522019	-0.2676111	0.5720149	NA	NA	NA	NA
South Asia	63	0.0662274	0.0417346	-0.8890929	-1.1358622	-0.6959348	-0.4514204	-0.7147067	-0.1881341	NA	NA	NA	NA
Southeast Asia	46	0.0644515	0.0886649	-1.0929733	-1.4603881	-0.8179953	0.2561081	-0.1278416	0.6400579	NA	NA	NA	NA
West Asia	11	0.0259993	0.6357593	0.8877299	0.4459374	1.7672086	0.3375603	-0.4217561	1.0968766	NA	NA	NA	NA
West Europe	21	0.0052198	0.7556367	-1.0477267	-1.6634589	-0.6599089	0.3486499	-0.2971674	0.9944671	NA	NA	NA	NA

There was a strong relationship between HHGW - FGN for all rice subpopulations (Table 1). Negative associations were observed in ADMIX (slope = -1.06, p < 0.01), ARO (slope = -3.32, p < 0.01), AUS (slope = -0.69, p < 0.01) and TRJ (slope = -0.95, p < 0.01), while IND (slope = 0.82, p < 0.01) and TEJ (slope = 0.95, p < 0.01) showed positive associations. Across regions, only Africa (slope = -1.14, p = 0.052) showed a significant negative association between these traits.

For HHGW - FLN, all subpopulations showed a strong negative association except TEJ (slope = 1.01, p < 0.01) which had a positive association. Africa (slope = -1.22, p < 0.01), North America (slope = -0.75, p = 0.04), South America (slope = -1.25, p < 0.01) and South Asia (slope = -0.80, p = 0.03) were the only regions with significant relationships and they all exhibited a negative association.

ADMIX (slope = -1.31, p < 0.01), AUS (slope = -0.90, p = 0.03) and TEJ (slope = -1.46, p = 0.01) all showed strong negative associations for HHGW – PNN relationship. East Asia (slope = -0.84, p < 0.01), East Europe (slope = -1.61, p = 0.04), South America (slope = -1.09, p < 0.01), Southeast Asia (slope = -0.84, p < 0.01) and West Asia (slope = -1.00, p = 0.05) all showed a strong negative association.

No significant association was observed across all subpopulations for HHGW- PTH relationship. East Asia (slope = -0.84, p < 0.01), East Europe (slope = -1.61, p < 0.04), South America (slope = -1.09, p < 0.01), Southeast Asia (slope = -0.84, p < 0.01) and West Asia (slope = -0.99, p = 0.05) regions all showed strong negative associations.

Significant associations were not found across subpopulations for HHGW – DTH relationship. Nevertheless, significant negative associations were found only in South America (slope = -1.07, p = 0.04) and South Asia (slope = -0.89, p = 0.04).

Structural Equation Modeling (SEM)

#sem for ADMIX
data<-subset(norm1,subpopulation=='ADMIX')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 25 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                            56
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                61.587
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               137.088
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.520
##   Tucker-Lewis Index (TLI)                      -1.399
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -409.267
##   Loglikelihood unrestricted model (H1)       -378.474
##                                                       
##   Akaike (AIC)                                 854.535
##   Bayesian (BIC)                               890.991
##   Sample-size adjusted Bayesian (BIC)          834.418
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.591
##   90 Percent confidence interval - lower         0.467
##   90 Percent confidence interval - upper         0.723
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.324
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN               0.403    0.130    3.096    0.002
##     PNN              -0.472    0.150   -3.143    0.002
##     FLN              -0.774    0.110   -7.014    0.000
##   FGN ~                                               
##     DTH               0.233    0.165    1.414    0.157
##     PTH              -0.028    0.145   -0.196    0.844
##   PNN ~                                               
##     DTH               0.123    0.148    0.836    0.403
##     PTH               0.074    0.130    0.570    0.569
##   FLN ~                                               
##     DTH               0.501    0.162    3.090    0.002
##     PTH              -0.075    0.142   -0.527    0.598
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN              -0.413    0.112   -3.694    0.000
##  .PNN ~~                                              
##    .FLN               0.150    0.084    1.791    0.073
##   DTH ~~                                              
##     PTH               0.463    0.131    3.520    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.579    0.109    5.292    0.000
##    .FGN               0.829    0.157    5.292    0.000
##    .PNN               0.665    0.124    5.361    0.000
##    .FLN               0.803    0.152    5.292    0.000
##     DTH               0.763    0.144    5.292    0.000
##     PTH               0.988    0.187    5.292    0.000

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("ADMIX")

#------------------------------------------------------------------------------

#sem for ARO
data<-subset(norm1,subpopulation=='ARO')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 60 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                             7
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 9.832
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.020
## 
## Model Test Baseline Model:
## 
##   Test statistic                                46.483
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.783
##   Tucker-Lewis Index (TLI)                      -0.085
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)                -11.625
##   Loglikelihood unrestricted model (H1)         -6.709
##                                                       
##   Akaike (AIC)                                  59.249
##   Bayesian (BIC)                                58.276
##   Sample-size adjusted Bayesian (BIC)            5.594
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.570
##   90 Percent confidence interval - lower         0.199
##   90 Percent confidence interval - upper         0.985
##   P-value RMSEA <= 0.05                          0.022
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.094
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN               1.570    0.893    1.758    0.079
##     PNN               3.189    0.912    3.498    0.000
##     FLN              -4.850    1.094   -4.435    0.000
##   FGN ~                                               
##     DTH               0.329    0.189    1.734    0.083
##     PTH              -0.165    0.179   -0.923    0.356
##   PNN ~                                               
##     DTH              -0.101    0.206   -0.491    0.623
##     PTH               0.345    0.195    1.773    0.076
##   FLN ~                                               
##     DTH               0.310    0.127    2.452    0.014
##     PTH               0.044    0.119    0.369    0.712
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN              -0.029    0.021   -1.370    0.171
##  .PNN ~~                                              
##    .FLN               0.029    0.018    1.595    0.111
##   DTH ~~                                              
##     PTH               0.400    0.255    1.567    0.117
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.215    0.115    1.871    0.061
##    .FGN               0.059    0.032    1.871    0.061
##    .PNN               0.070    0.034    2.068    0.039
##    .FLN               0.026    0.014    1.871    0.061
##     DTH               0.514    0.275    1.871    0.061
##     PTH               0.577    0.308    1.871    0.061

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("ARO")

#------------------------------------------------------------------------------

#sem for AUS
data<-subset(norm1,subpopulation=='AUS')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 31 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                            51
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               200.907
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               290.159
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.281
##   Tucker-Lewis Index (TLI)                      -2.596
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -297.642
##   Loglikelihood unrestricted model (H1)       -197.188
##                                                       
##   Akaike (AIC)                                 631.283
##   Bayesian (BIC)                               666.056
##   Sample-size adjusted Bayesian (BIC)          609.543
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          1.137
##   90 Percent confidence interval - lower         1.007
##   90 Percent confidence interval - upper         1.273
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.392
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN               0.460    0.090    5.102    0.000
##     PNN              -0.562    0.119   -4.732    0.000
##     FLN              -0.982    0.094  -10.395    0.000
##   FGN ~                                               
##     DTH              -0.414    0.193   -2.149    0.032
##     PTH               1.084    0.233    4.657    0.000
##   PNN ~                                               
##     DTH              -0.103    0.156   -0.659    0.510
##     PTH              -0.729    0.188   -3.882    0.000
##   FLN ~                                               
##     DTH              -0.343    0.169   -2.031    0.042
##     PTH               1.035    0.204    5.081    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN              -0.183    0.083   -2.200    0.028
##  .PNN ~~                                              
##    .FLN              -0.054    0.066   -0.811    0.417
##   DTH ~~                                              
##     PTH               0.009    0.043    0.209    0.834
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.326    0.065    5.050    0.000
##    .FGN               0.703    0.139    5.050    0.000
##    .PNN               0.458    0.091    5.056    0.000
##    .FLN               0.539    0.107    5.050    0.000
##     DTH               0.371    0.073    5.050    0.000
##     PTH               0.255    0.050    5.050    0.000

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("AUS")

#------------------------------------------------------------------------------

#sem for IND
data<-subset(norm1,subpopulation=='IND')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 29 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                            73
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               129.492
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               184.562
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.254
##   Tucker-Lewis Index (TLI)                      -2.730
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -582.198
##   Loglikelihood unrestricted model (H1)       -517.452
##                                                       
##   Akaike (AIC)                                1200.397
##   Bayesian (BIC)                              1241.625
##   Sample-size adjusted Bayesian (BIC)         1184.906
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.760
##   90 Percent confidence interval - lower         0.651
##   90 Percent confidence interval - upper         0.875
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.442
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN               0.708    0.083    8.538    0.000
##     PNN              -0.352    0.119   -2.954    0.003
##     FLN              -0.958    0.106   -9.045    0.000
##   FGN ~                                               
##     DTH              -0.001    0.091   -0.007    0.995
##     PTH              -0.179    0.103   -1.746    0.081
##   PNN ~                                               
##     DTH              -0.129    0.076   -1.707    0.088
##     PTH               0.161    0.086    1.886    0.059
##   FLN ~                                               
##     DTH               0.152    0.085    1.788    0.074
##     PTH              -0.166    0.096   -1.730    0.084
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN               0.082    0.078    1.054    0.292
##  .PNN ~~                                              
##    .FLN              -0.395    0.097   -4.060    0.000
##   DTH ~~                                              
##     PTH               0.174    0.166    1.051    0.293
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.489    0.081    6.042    0.000
##    .FGN               0.943    0.156    6.042    0.000
##    .PNN               0.655    0.108    6.061    0.000
##    .FLN               0.827    0.137    6.042    0.000
##     DTH               1.588    0.263    6.042    0.000
##     PTH               1.246    0.206    6.042    0.000

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("IND")

#------------------------------------------------------------------------------

#sem for TEJ
data<-subset(norm1,subpopulation=='TEJ')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 28 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                            92
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               196.728
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               334.740
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.394
##   Tucker-Lewis Index (TLI)                      -2.029
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -573.818
##   Loglikelihood unrestricted model (H1)       -475.454
##                                                       
##   Akaike (AIC)                                1183.636
##   Bayesian (BIC)                              1229.028
##   Sample-size adjusted Bayesian (BIC)         1172.210
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.838
##   90 Percent confidence interval - lower         0.741
##   90 Percent confidence interval - upper         0.939
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.169
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN              -0.016    0.107   -0.150    0.881
##     PNN              -0.424    0.156   -2.711    0.007
##     FLN              -0.073    0.126   -0.580    0.562
##   FGN ~                                               
##     DTH               0.533    0.096    5.531    0.000
##     PTH               0.124    0.125    0.996    0.319
##   PNN ~                                               
##     DTH               0.087    0.079    1.103    0.270
##     PTH              -0.353    0.102   -3.454    0.001
##   FLN ~                                               
##     DTH               0.511    0.089    5.777    0.000
##     PTH               0.146    0.114    1.275    0.202
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN               0.081    0.039    2.066    0.039
##  .PNN ~~                                              
##    .FLN              -0.194    0.045   -4.336    0.000
##   DTH ~~                                              
##     PTH               0.286    0.070    4.109    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.658    0.097    6.782    0.000
##    .FGN               0.518    0.076    6.782    0.000
##    .PNN               0.350    0.051    6.845    0.000
##    .FLN               0.436    0.064    6.782    0.000
##     DTH               0.780    0.115    6.782    0.000
##     PTH               0.468    0.069    6.782    0.000

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("TEJ")

#------------------------------------------------------------------------------

#sem for TRJ
data<-subset(norm1,subpopulation=='TRJ')
sitemod <- 'HHGW~FGN+PNN+FLN
            FGN~~PNN
            PNN~~FLN
            FGN~DTH+PTH
            PNN~DTH+PTH
            FLN~DTH+PTH
            DTH~~PTH'        

# model/sem anylysis
sitemod.fit <- lavaan::sem(sitemod, data = data)

#summary of the results
summary(sitemod.fit, fit.measures=TRUE)

## lavaan 0.6-7 ended normally after 28 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         18
##                                                       
##   Number of observations                            85
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               130.740
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               226.559
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.396
##   Tucker-Lewis Index (TLI)                      -2.019
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -517.290
##   Loglikelihood unrestricted model (H1)       -451.920
##                                                       
##   Akaike (AIC)                                1070.580
##   Bayesian (BIC)                              1114.548
##   Sample-size adjusted Bayesian (BIC)         1057.762
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.708
##   90 Percent confidence interval - lower         0.607
##   90 Percent confidence interval - upper         0.814
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.216
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   HHGW ~                                              
##     FGN              -0.051    0.079   -0.645    0.519
##     PNN              -0.938    0.234   -4.002    0.000
##     FLN              -0.611    0.081   -7.525    0.000
##   FGN ~                                               
##     DTH               0.558    0.189    2.957    0.003
##     PTH              -0.092    0.136   -0.675    0.500
##   PNN ~                                               
##     DTH               0.058    0.067    0.856    0.392
##     PTH              -0.087    0.049   -1.789    0.074
##   FLN ~                                               
##     DTH               0.349    0.191    1.823    0.068
##     PTH               0.030    0.138    0.215    0.829
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .FGN ~~                                              
##    .PNN              -0.049    0.033   -1.474    0.140
##  .PNN ~~                                              
##    .FLN              -0.071    0.034   -2.069    0.039
##   DTH ~~                                              
##     PTH               0.233    0.060    3.868    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .HHGW              0.500    0.077    6.519    0.000
##    .FGN               0.864    0.133    6.519    0.000
##    .PNN               0.110    0.017    6.528    0.000
##    .FLN               0.890    0.136    6.519    0.000
##     DTH               0.363    0.056    6.519    0.000
##     PTH               0.698    0.107    6.519    0.000

#plot our CFA. you can change layout with layout = argument. see ?semPaths() for more. 
semPlot::semPaths(sitemod.fit, "par", edge.label.cex = 1.2, fade = FALSE)
title("TRJ")

Based on the Root Mean Square Error of Approximation (RMSEA > 0.5) and Standardized Root Mean Square Residual (SRMSR < 0.5), the model was a good fit for the data. Test weight was significantly affected in all rice sub-populations.

Direct and Indirect Effects

For all subpopulations, a negative direct effect on HHGW was observed in FLN. In ADMIX subpopulation, FGN, PNN and FLN accounted for 58% (R2 = 0.58) of the variation in HHGW. FLN had the strongest negative effect on HHGW (β = -0.77, p < 0.05), this implies that if FLN was increased by one standard deviation with FGN and PNN held constant, HHGW will decrease by 0.77 standard deviations. PNN negatively affected HHGW (β = -0.47), while FGN positively affected HHGW (β = 0.40) at p < 0.05. There was a significant indirect effect (p < 0.05) of PTH on HHGW via DTH and FLN. The double-headed arrow between DTH and PTH indicates a significant relationship (p < 0.05) but assumes no causality.

In ARO subpopulation, FLN, PNN and FGN accounted for 22 % (R2 = 0.22) of the variation in HHGW. The strongest negative direct effect was observed in FLN (β = -0.85, p < 0.05), while PNN and FGN had positive effects on HHGW. All indirect effects of PTH and DTH via FGN, PNN and FLN were all insignificant (p > 0.05). In AUS subpopulation, all agronomic traits accounted for 33% (R2 = 0.33) of the variation in HHGW. The strongest direct effect on HHGW was observed in FLN (β = -0.98, p < 0.05). PNN had a direct negative effect (β = -0.56, p < 0.05) while FGN had a direct positive effect (β = 0.46, p < 0.05). PTH and DTH had significant indirect effects on HHGW. PTH had effect on HHGW via FLN, PNN and FGN while DTH effect was via FGN and FLN.

For IND, the only significant direct effects on HHGW were observed in FLN (β = -0.96, p < 0.05) and FGN (β = 0.71, p < 0.05) which had negative and positive respectively. These traits accounted for 49% (R2 = 0.49) of the variation in HHGW. All indirect effects were insignificant (p > 0.05). For TEJ, only PNN accounted for all the variation in HHGW (R2 = 0.66) and was the only trait with a significant direct effect on HHGW (β = -0.42, p < 0.05). A significant indirect effect on HHGW was observed in PTH via DTH, FLN and PNN and also a negative effect via PNN (p < 0.05). DTH also had a significant indirect effect via FLN and PNN (p < 0.05).

For TRJ, FLN and PNN accounted for the total variation in HHGW (R2 = 0.50). Both traits had a negative direct effect on HHGW, (β = -0.61, p < 0.05) and (β = -0.94, p < 0.05) for FLN and PNN respectively. A significant indirect effect was observed in PTHT via DTH and FLN (p < 0.05), while other indirect effects were insignificant (p > 0.05).

Path Summary

# Path summary
# I created a result summary table on excel
sem <- read.csv("./raw_data/SEM_result.csv")

sem %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")          # Using KableExtra table formatting

Effect	ADMIX	ARO	AUS	IND	TEJ	TRJ	Proposed.interpretation
FLN_HHG	-0.77	-4.88	0.98	-0.98	NS	-0.61	HHG variation is strongly influenced by FLN; Decreasing FLN increases grain weight except in TEJ
PNN_HHG	-0.44	0.18	-0.55	NS	-0.42	-0.9	HHG variation is strongly influenced by PNN except in IND
FGN_HHG	-0.42	1.57	0.46	0.71	NS	NS	Increase in filled grain per panicle increases test weight except in TEJ and TRJ
DTH_PTH	-0.46	NS	NS	NS	0.21	0.23	There is a relationship between DTH and PTH in ADMIX, TEJ and TRJ, although causality is not asumed
DTH_FGN	NS	NS	-0.41	NS	0.53	0.56	DTH either increases or decreses FGN depending on the rice subpopulation
DTH_PNN	NS	NS	NS	NS	NS	NS	Panicle number is not dependent on days to heading
DTH_FLN	0.5	NS	NS	NS	0.53	0.35	DTH regulates FLN
PTH_FGN	NS	NS	NS	NS	NS	NS	PTH controls FGN in AUS
PTH_FLN	NS	NS	NS	NS	0.5	NS	PTH controls FLN in TEJ
PTH_PNN	NS	NS	NS	NS	-0.35	NS	PTH controls PNN in TEJ

Predicting test weight with Florescence per panicle

# Does florescence per panicle predict test weight?

# Explore data
ggplot(norm1,
       aes(x = HHGW, y = FLN)) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

# Fit a model
rice_mlr <- glm(FLN ~ HHGW + region,
                 data = norm1,
                 family = gaussian(link = "identity"))

# assumptions
plot(rice_mlr, which = 1)

plot(rice_mlr, which = 2)

shapiro.test(residuals(rice_mlr))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(rice_mlr)
## W = 0.96349, p-value = 6.921e-08

plot(rice_mlr, which = 4)

# view residuals
residualPlots(rice_mlr)

##      Test stat Pr(>|Test stat|)
## HHGW    0.0214           0.8838

# include residuals into data
norm1 <- norm1 %>%
  add_residuals(rice_mlr)

# plot residuals 
qplot(subpopulation, resid, data = norm1, geom = "boxplot")

# test of interaction between mass and species
rice_int <- glm(FLN ~ HHGW * region * subpopulation,
                 data = norm1,
                 family = gaussian(link = "identity"))

residualPlots(rice_int)

##      Test stat Pr(>|Test stat|)
## HHGW    0.0028            0.958

Anova(rice_int)

## Analysis of Deviance Table (Type II tests)
## 
## Response: FLN
##                           LR Chisq Df Pr(>Chisq)    
## HHGW                        22.643  1  1.950e-06 ***
## region                      10.697 12    0.55507    
## subpopulation               41.480  5  7.504e-08 ***
## HHGW:region                  8.996 12    0.70330    
## HHGW:subpopulation           6.178  5    0.28931    
## region:subpopulation        45.525 30    0.03447 *  
## HHGW:region:subpopulation   16.529 19    0.62177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(rice_mlr)

## Analysis of Deviance Table (Type II tests)
## 
## Response: FLN
##        LR Chisq Df Pr(>Chisq)    
## HHGW     19.288  1  1.124e-05 ***
## region   40.182 12  6.708e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# post-hocs what is the effect of subpopulation and region at the average level of HHGW
rice_em <- emmeans(rice_mlr, specs = ~ region,
                    at = list(HHGW = 4))  # for specificity


# check which slopes are statistically different
rice_em %>%
  contrast(method = "tukey") %>%
  data.frame() %>%
  kbl(format = 'html') %>%
  kable_paper() %>%
  scroll_box(width = "700px",
             height = "350px")

contrast	estimate	SE	df	z.ratio	p.value
Africa - Caribbean	0.0337395	0.4092806	Inf	0.0824361	1.0000000
Africa - Central America	-0.0732953	0.4088549	Inf	-0.1792698	1.0000000
Africa - Central Asia	0.9839374	0.4125749	Inf	2.3848698	0.4546777
Africa - East Asia	0.5121375	0.1945159	Inf	2.6328831	0.2902848
Africa - East Europe	0.6681605	0.2842514	Inf	2.3505973	0.4795893
Africa - North America	-0.1902123	0.2119240	Inf	-0.8975497	0.9995857
Africa - Oceania	0.4180241	0.4131480	Inf	1.0118022	0.9986235
Africa - South America	0.2539267	0.2304817	Inf	1.1017221	0.9968816
Africa - South Asia	0.4909530	0.2105321	Inf	2.3319626	0.4932626
Africa - Southeast Asia	0.0977997	0.2125899	Inf	0.4600395	0.9999997
Africa - West Asia	1.1566380	0.3214016	Inf	3.5987313	0.0193922
Africa - West Europe	0.3249966	0.2551319	Inf	1.2738379	0.9884219
Caribbean - Central America	-0.1070348	0.5362086	Inf	-0.1996141	1.0000000
Caribbean - Central Asia	0.9501979	0.5372101	Inf	1.7687641	0.8653087
Caribbean - East Asia	0.4783980	0.3954042	Inf	1.2098963	0.9926346
Caribbean - East Europe	0.6344211	0.4489458	Inf	1.4131352	0.9725897
Caribbean - North America	-0.2239518	0.4050367	Inf	-0.5529173	0.9999978
Caribbean - Oceania	0.3842846	0.5374829	Inf	0.7149708	0.9999625
Caribbean - South America	0.2201873	0.4154044	Inf	0.5300552	0.9999987
Caribbean - South Asia	0.4572135	0.4017533	Inf	1.1380455	0.9957791
Caribbean - Southeast Asia	0.0640602	0.4044683	Inf	0.1583814	1.0000000
Caribbean - West Asia	1.1228985	0.4716060	Inf	2.3810096	0.4574661
Caribbean - West Europe	0.2912571	0.4326647	Inf	0.6731706	0.9999806
Central America - Central Asia	1.0572327	0.5379487	Inf	1.9653040	0.7557753
Central America - East Asia	0.5854329	0.3963321	Inf	1.4771271	0.9612775
Central America - East Europe	0.7414559	0.4486327	Inf	1.6527013	0.9130482
Central America - North America	-0.1169170	0.4055887	Inf	-0.2882649	1.0000000
Central America - Oceania	0.4913194	0.5382968	Inf	0.9127296	0.9995083
Central America - South America	0.3272221	0.4157816	Inf	0.7870046	0.9998949
Central America - South Asia	0.5642483	0.4034667	Inf	1.3985001	0.9747789
Central America - Southeast Asia	0.1710950	0.4054356	Inf	0.4220030	0.9999999
Central America - West Asia	1.2299333	0.4720944	Inf	2.6052697	0.3068046
Central America - West Europe	0.3982919	0.4316315	Inf	0.9227592	0.9994506
Central Asia - East Asia	-0.4717999	0.3941696	Inf	-1.1969464	0.9933115
Central Asia - East Europe	-0.3157769	0.4516975	Inf	-0.6990892	0.9999707
Central Asia - North America	-1.1741497	0.4050387	Inf	-2.8988582	0.1593394
Central Asia - Oceania	-0.5659133	0.5361288	Inf	-1.0555548	0.9979246
Central Asia - South America	-0.7300107	0.4159540	Inf	-1.7550275	0.8716449
Central Asia - South Asia	-0.4929844	0.3977910	Inf	-1.2393052	0.9908876
Central Asia - Southeast Asia	-0.8861377	0.4030556	Inf	-2.1985494	0.5922718
Central Asia - West Asia	0.1727006	0.4715593	Inf	0.3662330	1.0000000
Central Asia - West Europe	-0.6589408	0.4379017	Inf	-1.5047686	0.9554425
East Asia - East Europe	0.1560230	0.2677101	Inf	0.5828058	0.9999961
East Asia - North America	-0.7023499	0.1786253	Inf	-3.9319726	0.0056072
East Asia - Oceania	-0.0941134	0.3942204	Inf	-0.2387331	1.0000000
East Asia - South America	-0.2582108	0.2020626	Inf	-1.2778753	0.9881016
East Asia - South Asia	-0.0211846	0.1623590	Inf	-0.1304798	1.0000000
East Asia - Southeast Asia	-0.4143378	0.1743633	Inf	-2.3762901	0.4608816
East Asia - West Asia	0.6445004	0.3003711	Inf	2.1456808	0.6311814
East Asia - West Europe	-0.1871409	0.2433461	Inf	-0.7690318	0.9999179
East Europe - North America	-0.8583729	0.2807830	Inf	-3.0570690	0.1055434
East Europe - Oceania	-0.2501364	0.4521913	Inf	-0.5531651	0.9999978
East Europe - South America	-0.4142338	0.2951159	Inf	-1.4036310	0.9740272
East Europe - South Asia	-0.1772076	0.2791821	Inf	-0.6347383	0.9999899
East Europe - Southeast Asia	-0.5703608	0.2810882	Inf	-2.0291166	0.7136716
East Europe - West Asia	0.4884774	0.3704347	Inf	1.3186602	0.9844510
East Europe - West Europe	-0.3431639	0.3153776	Inf	-1.0881050	0.9972271
North America - Oceania	0.6082364	0.4052299	Inf	1.5009663	0.9562811
North America - South America	0.4441391	0.2209937	Inf	2.0097362	0.7267339
North America - South Asia	0.6811653	0.1894020	Inf	3.5963995	0.0195514
North America - Southeast Asia	0.2880121	0.1976721	Inf	1.4570191	0.9651515
North America - West Asia	1.3468503	0.3137507	Inf	4.2927399	0.0012543
North America - West Europe	0.5152090	0.2559775	Inf	2.0127116	0.7247432
Oceania - South America	-0.1640974	0.4162045	Inf	-0.3942709	1.0000000
Oceania - South Asia	0.0729289	0.3975169	Inf	0.1834611	1.0000000
Oceania - Southeast Asia	-0.3202244	0.4030811	Inf	-0.7944416	0.9998839
Oceania - West Asia	0.7386139	0.4717179	Inf	1.5657958	0.9403280
Oceania - West Europe	-0.0930275	0.4386902	Inf	-0.2120573	1.0000000
South America - South Asia	0.2370262	0.2126491	Inf	1.1146353	0.9965213
South America - Southeast Asia	-0.1561270	0.2191544	Inf	-0.7124065	0.9999640
South America - West Asia	0.9027112	0.3273863	Inf	2.7573277	0.2225328
South America - West Europe	0.0710699	0.2708806	Inf	0.2623661	1.0000000
South Asia - Southeast Asia	-0.3931532	0.1824945	Inf	-2.1543297	0.6248588
South Asia - West Asia	0.6656850	0.3068328	Inf	2.1695369	0.6136967
South Asia - West Europe	-0.1659563	0.2596732	Inf	-0.6390970	0.9999891
Southeast Asia - West Asia	1.0588382	0.3120686	Inf	3.3929662	0.0387769
Southeast Asia - West Europe	0.2271969	0.2582749	Inf	0.8796709	0.9996630
West Asia - West Europe	-0.8316413	0.3520620	Inf	-2.3622016	0.4711173

# plot comparison
multcomp::cld(rice_em, adjust="tukey") %>%
  ggplot(aes(x = region, y = emmean, 
             ymin = asymp.LCL, ymax = asymp.UCL,
             color = factor(.group))) +
  geom_pointrange() +
  coord_flip() +
  labs(title = "Slope comparison across regions")

## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons

# visualize!
rice_newdat <- modelr::data_grid(norm1,
                                  region = unique(region),
                                  HHGW = seq_range(HHGW, n = 100))


# create data predictions
rice_predict <- predict(rice_mlr,
                         newdata = rice_newdat,
                         type = "response",
                         interval = "confidence")

# add data predictions
rice_newdat <- rice_newdat %>%
  mutate(FLN = rice_predict)

# plot regression on predicted data
ggplot(norm1,
       aes(x = HHGW, y = FLN, color = region)) +
  geom_point()+
  geom_line(data = rice_newdat) +
  labs(title = "Regression on predicted data")

# compare means of new fit
rice_newfit <- emmeans(rice_mlr, specs = ~ region + HHGW,
                        at = list(HHGW = seq(1,3, length.out = 100))) %>%
  as_tibble() %>%
  mutate(FLN = emmean)

# plot regresion slopes with prediction intervals
ggplot(norm1,
       aes(x = HHGW, y = FLN, color = region)) +
  geom_point() +
  geom_line(data = rice_newfit) +
  geom_ribbon(data = rice_newfit,
              aes(ymin = asymp.LCL, ymax = asymp.UCL, group = region, colour = region),
              alpha = 0.1, color = "lightgrey") +
  theme_classic() +
  labs(title = "Regression on predicted data showing prediction intervals")

The linear regression passed all necessary assumptions. All residuals were clustered close to the regression line and the cook distance was 0.04 indicating normality of data. Comparison of slopes with ‘emmeans’ showed that slopes were not statistically different. Regression with the predicted data revealed slope alignment within the predicted values while all regression lines fell within the prediction interval.

Discussion

Total florets per panicle has the most significant association with test weight which agrees with the result of (Ren et al., 2020), where altering meristem determinacy and restoring lateral floret formation significantly increased grains per panicle and subsequently yield. Plant height is a significant morphological trait influencing yield performance. Molecular cloning and functional analyses of several genes associated with PTH in rice have revealed connections to the synthesis and regulation of the phytohormone (gibberellin) (Miedaner et al., 2019). TEJ was observed to have the highest test weight which shows a significant relationship with plant height based on the SEM analysis. Indirect effect traits like DTH affects test weight via FLN and should be fully considered for breeding purposes in TEJ and TRJ subpopulations.

Conclusion

This research was able to assess the relationship between test weight and other agronomic traits and identify significant associations. The major determinant of test weight is the total floret per panicle except in Temperate japonica, where test weight is determined by plant height. Also the prediction of test weight across regions using floret number gave a positive result. Results obtained from this study will improve crop management techniques in rice breeders.

References

Deivasigamani S, & Swaminathan C. (2018). Evaluation of Seed Test Weight on Major Field Crops. Www.Arcjournals.Org International Journal of Research Studies in Agricultural Sciences, 4(1), 2454–6224. crossref

Eizenga, G. C., Ali, M. L., Bryant, R. J., Yeater, K. M., McClung, A. M., & McCouch, S. R. (2014). Registration of the Rice Diversity Panel 1 for Genomewide Association Studies. Journal of Plant Registrations, 8(1), 109–116. crossref

Li, R., Li, M., Ashraf, U., Liu, S., & Zhang, J. (2019). Exploring the relationships between yield and yield-related traits for rice varieties released in china from 1978 to 2017. Frontiers in Plant Science. crossref

Miedaner, T., Laidig, F., Zhao, J., Zhang, J., Li, R., Li, M., Ashraf, U., & Liu, S. (2019). Exploring the Relationships Between Yield and Yield-Related Traits for Rice Varieties Released in China From 1978 to 2017. crossref

Prakash Poudel, A. (2018). Analysis of Growth and Yield Attributing Characteristic of Direct Seeded Upland Rice Genotypes in Western Hills of Nepal. In Acta Scientific Agriculture (Vol. 2).

Puji Lestari, A., Sopandie, D., & Aswidinnoor, H. (2015). Panicle Length and Weight Performance of F3 Population from Local and Introduction Hybridization of Rice Varieties. HAYATI Journal of Biosciences April, 22(2), 87–92. crossref

Regression, M., & Models, E. (2005). c ontRibutions Commentary. Commentary, 86(October), 283–295. [crossref](http://www.esajournals.org/doi/abs/10.1890/0012-9623(2007)88[72:AHOTES]2.0.CO;2

Ren, D., Li, Y., He, G., & Qian, Q. (2020). Multifloret spikelet improves rice yield. New Phytologist, 225(6), 2301–2306.[crossref] (https://doi.org/10.1111/nph.16303)

Saleh, M. M., Salem, K. F. M., & Elabd, A. B. (2020). Definition of selection criterion using correlation and path coefficient analysis in rice (Oryza sativa L .) genotypes.

Venkateswarlu, B., Prasad, G. S. V, & Prasad, A. S. R. (n.d.). STUDIES ON THE NATURE OF RELATIONSHIPS BETWEEN GRAIN SIZE , SPIKELET NUMBER , GRAIN YIELD AND SPIKELET FILLING IN LATE DURATION VARIETIES OF RICE ( OR YZA SA TIVA L .) Grain number Grain size Productivity Rate of grain growth Rice Spikelet filling. 130(1981), 123–130.

Assessing variability and association of rice agronomic traits across subpopulations and regions

Uzezi Okinedo

11/27/2020