Results - Data Analysis

Results – Data Analysis

Principal Component Analysis (PCA)

1.       Character Analysis

During data collection and subsequently on initial data analysis it was noted that B1 characters were unreliable as a consequence of inadequate specimen collections, combined with the evidence of preliminary box plots and Score Plots, several characters were dropped, Table RD1, to obtain the results which follow.   

Character

Character No.

B1 Width

(1)

R1&2 – Ratio Length / Width

-

Spikelet Colour

(52)

Spikelet Width

(23)

R22&23 – Ratio Spikelet Length / Width

-

Glume Laterals

(31)

Colour of Anther Connective

(35)

Stigma Colour

(37)

TableRD1.  Characters omitted from statistical analysis.  The remainder are “Selected Characters”.

2.       Eigenvalues

Eigenvalues from the PCA, Table RD2, shows that 11 components are needed to represent 91.7% of all the variation in the data, whereas 4 represent only 61%. 

Normally the first two components are used in further analyses, since will represent a significant proportion of the overall variation, however, in these results they represent only 41.3% and clearly, therefore, other components are making a significant contribution and may not be ignored.

Eigenvalue

6.5437

3.3647

2.6636

2.0757

1.6815

1.5759

1.2746

0.9821

Proportion

0.273

0.14

0.111

0.086

0.07

0.066

0.053

0.041

Cumulative

0.273

0.413

0.524

0.61

0.68

0.746

0.799

0.84

                 

Eigenvalue

0.7288

0.566

0.5504

0.4414

0.3549

0.3341

0.2447

0.1505

Proportion

0.03

0.024

0.023

0.018

0.015

0.014

0.01

0.006

Cumulative

0.87

0.894

0.917

0.935

0.95

0.964

0.974

0.981

                 

Eigenvalue

0.1275

0.1097

0.1014

0.0655

0.0394

0.0111

0.0089

0.0041

Proportion

0.005

0.005

0.004

0.003

0.002

0

0

0

Cumulative

0.986

0.99

0.995

0.997

0.999

0.999

1

1

Table RD2.  Eigenvalues calculated for PCA.  The first 24 components shown, 21 are contributing to the overall variation, however, using Scree Plots (see below) the best number of components to use can be observed.

3.       PCA Graphs

The scree plot Fig. RD1 for all quantitative characters only, confirms that while there is one notable inflection point at five components, it might be considered that there is also one at 10, which is more consistent with the Eigenvalues.  In subsequent processing, 11 components are therefore used, to incorporate >90% of the total variation.


Fig. RD2a shows the PCA Outlier Plot for quantitative characters, there is a consistent spread amongst the sheets, but no clearly outstanding ones, the characters are adjusted for scale in this analysis.  It suggests there are no Sheets which are anomalous compared to the others, consequently succeeding analyses may be undertaken without re-checking any Characters or sheets.

It can be seen from the Score Plot above, Fig. RD2b that there is a significant skew on the clustering of Sheets.  However, the outlier on the first component Axis is Sheet 37, which has visibly different characteristics, therefore, this is likely to be a valid finding.  This result does not indicate any strong groupings determined by these two axes.

Score and Loading Plots with and without the omitted characters in Table RD1 were compared, and show them having little observable impacts on the first two component axes, which further supports their exclusion.

The Loading Plot, Fig. RD3 highlights numerous characters with negative correlations on both component  axes.  Table RD3 lists those characters most significantly contributing to the variation on the first axis.   The second axis is notable for a very strong –ve correlation between character 49, number of medium length R2’s.  and several others, including in particular 12 number of spikelets down one side of spike, B1 length, R1 length and several others.    

Strong -ve on Axis 1

Strong +ve on Axis 1

30 – Glume length

46 – R1 count (categorical)

48 – R2, no. short

19 – B2 Length

22 – Spikelet length

11 – Spike length

36 – Stigma depth of divisions

R19&20 – Ratio B2 length / width

R22&25 – Ratio spikelet length / florets along spike

21 – P2 length

25 – Florets along Spikelet axis

17 – B2 number

Table RD3.  PCA results.  Axis 1: Strongly +vely correlated characters – columns; and ‑vely correlated characters – col. 1 vs. col 2.

There is a clear clustering of characters which are +vely correlated on the first axis, and intermediate on the second axis.  These are listed in Table RD3. 

Characters

18, no. thin B2s

19 B2 length

20 B2 width

50 R2, no. long

51 R2 max. length

9 P1 length.

Table RD4.  PA results.  Grouped characters operating on the variation in a single direction.

However, it is clear from this result and the table of Eigenvalues that there are not a small number of clear components that explain all the variation in the specimens.      

Multivariate Cluster Analysis (MCA)

1.       Clustering (Groups)

Multivariate cluster analysis was undertaken to determine the extent to which groupings in the specimens might inform later analyses.  Dendrograms were generated using quantitative characters only, Fig. RD4, and all the characters, Fig. RD5, quite different results were achieved.   The squared Euclidean distance applied to the Ward method served to increase the differences between groups.  In the results, clusters (groups) are numbered left to right.  They show high consistency in three groups, 4, 5 and 6, 5 and 6 being most separated from the others.   Distances generally between groups 1, 2 and 3 are consistent for both analyses, however, groups 1 and 2 have low consistency.  Group 2 is significantly larger, at the expense of group 1, in the analysis using non-quantitative characters.   The overall distances between groups are greatest using the quantitative and non-quantitative characters (“all Characters”), with 392.32 units of distance compared to 292.75 with quantitative only, therefore, on this evidence and following comparisons of box and whisker plots generated from all the characters using both cluster arrangements, the clusters from all characters were selected for all further analyses.  Both clusterings require further analysis, since the differences are notable, clusterings with 4, 5 and 7 groups were also generated but without improving the results.

 

2.       Box and Whisker (Box) Plots

Boxplot graphs of all selected (remaining) characters are presented for the 6 groups generated by MCA, Fig. RD6. 

From these graphs the outliers “*” can be reviewed.  It is clear that groups 1, and to a lesser extent 2, have the greatest preponderance of outliers, representing measurements outside of the 75% percentile.  Characters 46, 47, 48, 21, and 45 are indicative of this  (a significant proportion of the observations for character 45 are missing due to their absence on the herbarium sheets, which probably explains the unusual distribution of these character values).  This finding may be due to the larger number of sheets in each of these groups, compared to groups 3-6.  Ideally each of these outliers requires checking for atypical material or erroneous measurements. 

There is considerable variation in the types of spread and skew in the values for different Characters.  Character 4, R1 length, shows normal distributions, un-skewed, in all groups except 6, which only contains one sheet.  Whereas, Character 51, R2 maximum length shows a skew with several Sheets clustered in a very close range with short R2s.  Neither binary character type, 18, no. of thin B2s and 45, stem shape, is properly represented by these graphs.

A number of characters are helpful in distinguishing the groups, none clearly separates all the groups, therefore it is necessary to regard them in combination to adequately “identify” a sheet.  The key, Tab Results – Key uses the most informative characters.


1.       Scatter Plots

PCA representing an eigen analysis of patterns of correlation between variables and PCO matrix measurement of the distance between specimens, using a Gower General Similarity Coefficient, are presented on scatter plots.   Results represent likely positions of the specimens in relation to each other according to the axes shown, depending on the analysis used.  PCA using a standard correlation matrix in Minitab was compared to results from the Gower coefficient using MVSP, the latter producing the better clustering, using the groups identified by MCA. 

Two analyses represent the PCA first and second axis of variation, Fig. RD7, and the PCO first and third axis, Fig. RD8, and the first axis and third, Fig. RD8.   The symbology used to display specimen relative locations shown on both plots are those determined by MCA using all characters for the analysis.   The specimens are labelled in the format An1‑Sn2 where An1 is the Kew Herbarium Area and Sn2 is the specimen (BRAHMS Accession) number.

These results have many similarities, and give indications of how the specimens are related.

·      The MCA results are clearly correlated with both PCA and PCO, clusters are reflected in the distribution of specimens generated in the scatter plots.  This suggests that there are some genuine relationships between the specimens, i.e. there is not uniform variation across the species.

·      Groups 6 and 5 are related in all the analyses conducted, including those not shown, strongly along the first axis.  While they are scattered, by the second and to a lesser extent the third axis, these specimens seem to be related, S37, S33 and S32 were all collected in Madagascar.  S37 is visually quite different from other specimens, so this separation is realistic.  S15 was collected in Botswana, however, the collectors have noted, Heath & Heath 2009, that it might be an introduction there. 

·      S35, also from Madagascar, is grouped with S17 in group 4 from Angola, in other analyses, not shown, they are shown to lie probably in closer proximity to group 1.  They do not appear to form a consistently natural group.  N.B. S17 is unique amongst those sampled, in having B3s which are longer than the spikes and attenuated (green arrows), and is also notable for two very different sizes and shapes of spikelets (pink arrows).  See Fig. RD11.

·      S29 and S31, from eastern South Africa, are in group 1 but are consistently shown to lie closest to group 5 compared to other group 1 specimens, and may therefore be intermediate between these groups.  S30 also from eastern South Africa and S11 from eastern Mozambique, but located close to the others, show similar correlations, but less strongly.

·      Group 3 is not coherent along any of the axes observed during the analysis (1-10) and therefore appears to be a grouping based upon .  S14 was collected in Zambia and 7 in Chad.  The former is reported by the collector to be “much reduced following fire after droughts” and therefore may be atypical, if so then S7 appears to cluster with group 2 consistently.

·      Groups 1 and 2 cluster in both analyses, but best in the PCA.  The clusters are not clearly separated it appears there specimens are intermediate between them. PCO shows specimens S2 and S19 likely to be closer to group1 than 2 on component axis 3, and they are clearly intermediate in PCA along with S26.

·      The PCA shows two very strong clusterings within group 2, in addition to the specimens closer to group 1.

·      Neither the Sicilian specimens, S1 and S2, nor the Israeli ones S3 and S4 are identified as distant from groups 1 or 2, nor do they cluster in any analyses.

2.       Distribution of Clusters

Fig. RD9.  Distribution of specimens using All Character clusters.  Fig. RD10.  Distribution of specimens using Quantitative Characters.

Mapping the MCA results, for both the quantitative characters and all characters, Fig. RD9 and Fig. RD10, allow spatial relationships in the clusters to be visualised, in many respects they reflect the PCA and PCO results.  Groups 1 and are notably different, with the smaller group 2 identified by just the quantitative characters clustered around Uganda and Tanzania, whereas when all characters are analysed, specimens from Sudan and Zambia are drawn in to it.  Notably the Madagascan specimens are shown to be clustered, one specimen from the inconsistent group 4 is also Madagascan.  Fig. RD2a shows the PCA Outlier Plot for quantitative characters, there is a consistent spread amongst the sheets, but no clearly outstanding ones, the characters are adjusted for scale in this analysis.  It suggests there are no Sheets which are anomalous compared to the others, consequently succeeding analyses may be undertaken without re-checking any Characters or sheets.


Species

Table RD5 shows the characters Kükenthal used to separate C. papyrus L. from other species in his section Papyrus, and the findings from this study.

Kükenthal Character

Observations

Reference ‘photos.

Bracts (B1), brown and leathery compared to shorter or longer but rigid, green and foliaceous.

This character was seen on some sheets; however, bracts of various lengths, some seemingly green and rigid were also noted.

See Characters media gallery

Prophylls (P1) truncate compared to oblique with two teeth posteriorly

Most sheets showed oblique prophylls which were bi-dentate

See Characters media gallery

Stem robust, to 5m compared to slender 45-60cm.

Generally this character held, in so far as it could be determined, however frequently the size of the plant was bot recorded.  At least one plant was noted to be only

See Characters media gallery

Inflorescence complex, diffuse compared to simple.

Inflorescences varied enormously, but were both diffuse and constricted and sometimes with very few rays; however they were always complex in respect of compound arrangements with spikes and spikelets.

See Characters media gallery

Rays (R1) numerous, subequal to 45cm compared to 6-7, unequal, 2-6cm.

Rays were neither equal nor always numerous, however, compared to the alternative of 2-6cm, they were usually considerably longer.

See Characters media gallery

Table RD5.  Analysis of Species level characters defined by Kükenthal.

This is further evidence that the species delimitation is not robust, and needs to be reviewed.  While characters should be used in combination rather than individually, there were several sheets showing multiple differences from Kükenthal’s key characters.

Other authors report that confusion with other species can occur, despite the very robust size of C. papyrus.  Wheeler Haines and Lye (1983), for example, cite C. penzoanus as a species sometimes confused with it.  While never quite as robust, they report a maximum size of 2-3m, there is an overlap in range.  The most clearly distinguishing feature is the presence of leaf blades, absent according to all accounts (Chiovenda 1931, Kükenthal 1935) in C. papyrus.  In herbarium material, neither the size nor presence / absence of leaf blades is consistently recorded, the latter never recorded, this may be an indication that identification in the field was positive, but might also allow mis-identifications if the material was later determined from dried material.  The photograph below clearly shows shows leaf blades on young shoots of C. papyrus.

Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith