Feasibility study into combining data resources to explore the drivers of crop yield in winter wheat and oil seed rape

Project report from FERA (James Rainford, Glyn Jones, Roy Macarthur & David Garthwaite) for Defra exploring 'big data' analysis of yields.

Summary:

Analysis of the drivers of yield

This work represents an initial exploration of fusing several large-scale datasets relating to UK agricultural practice to examine the drivers of farm level yields of focal crops. The following datasets were identified for combination, as relevant to understanding crop yields in the UK: • FERA pesticide usage survey (PUS) aggregated to a farm level • June farm survey • DEFRA crop pest survey • Weather data from the Meteorological office • Soil type information from the National Soil Map • Annual economic information from the John Nix farm management pocketbook Datasets were combined for the biennial period between 2000 and 2010, based on the associated labelling information and geolocation via the June survey. The combined data represents an analytical dataset through which the potential drivers of winter wheats and oilseed rape (OSR) could be examined. The analytical approach undertaken combines parametric statistical analysis with the use of highflexibility machine learning techniques (Random Forest) to provide hypothesis-driven and heuristic approaches to revealing the key factors that may be used to predict yield. We defined a core feature set (a limited number of factors that we expected to be important based on existing literature), and a set of expanded features (i.e. all the relevant variables within the study). We undertook statistical hypothesis testing on the core feature set, with machine learning to describe the potential shapes and observed importance of the identified significant drivers. Machine learning was also used to explore the expanded feature set to identify factors that may have been missed in the definition of the core set and which might warrant further investigation into their impacts on yield. Wheat varieties were divided into two categories (bread wheats and feed wheats) based on the grouping system provided by the National Association of British and Irish Flour Millers (“nabim”). For bread wheats, the most important predictors from core features set include

  • the major variety on the holding,
  • the area of the holding (larger holdings having higher yield),
  • proportion of own seed used (lower yields for more own seed),
  • increased yields under dry conditions during the pre-frost period (July to December of the year prior to harvest),
  • number of unique active compounds applied, and total mass of pesticide associated with the holding (both having positive effects on yield).
  •  

Machine learning using the expanded feature set reinforced these conclusions. In addition, a strong positive signal associated with the application of growth regulators was observed. In machine learning a stepwise relationship between the yield and the number of compounds was observed: applying 12 or more active compounds was associated with a step up in expected yields. Different factors were associated with yield change in feed wheats compared with bread wheats:

  • The strongest yield effect was associated with by diversity of applied compounds (positive effect);
  • conditions during the pre-frost period (low rainfall and high humidity being associated with increased yield);
  • small effect of the number of spray rounds conducted replacing the effect of total pesticide load (positive effect) 
  • an absence of clear effects associated with primary variety

Machine learning analysis of the complete feature set gave, in common with bread wheat, an observation that growth regulators are associated with yields. The recorded proportion of land set aside as grassland was also highlighted for further investigation. For OSR the key drivers in the statistical analysis include, the diversity of compounds applied and the year of sampling ( overall upward trend in yields during the period, with lower values reported in 2004 and 2008 )By far the largest and most important identified driver was latitude, with a noticeable stepped effect of increasing yield north of the 52 degrees north (approximately the latitude of Ipswich). The causes of this are unclear and may relate to the distribution of economically important pest species. This interpretation is reinforced by analysis of the expanded feature set which also revealed the quantity of fungicide applied on a holding as a key determinate of yields in OSR crops. The applied modelling indicated large amounts of (apparently) random variation in yields which could not be accounted within the fitted functions. This is likely the result of a combination of intrinsic noise in the combined datasets, and limited predictive power in the included variables. Confirmatory studies are advised to further investigate the effects identified and their impacts on crop yield across holdings. We found that, for all three crops, there was a potential sub-population of unusually low-yielding sites compared with their predicted yield. These might represent either unidentified failed crops (with resulting differences in farmer behaviour) or holdings where there are specific socio-economic factors which are associated with reduced yield values (e.g. failure to incorporate innovation). None of the examined drivers, including the expanded feature set, showed useful correlation with the observation of low yield suggesting a role for other potential drivers outside of the scope of this analysis. In conclusion, our results suggest that diversity of agrochemical inputs is a key component of observed landscape level yields in both wheats and OSR. We also show that climatic conditions during the growth year are key predictors for wheat while OSR appears largely driven by latitude. A less consistent effect of total pesticide load is estimated (positive in bread wheats, non-significant for feed wheats, and negative for OSR). We didn’t find evidence of a relation between soil types or disease prevalence and yield. This may be because our measures of disease prevalence were not suitable for finding a relationship. The use of growth regulators in wheats and fungicides in OSR were identified for potential further study particularly in the context of standardised field trials

Recommendations for future work

The modelling conducted here is preliminary and subject to constraints relating to the content and structure of the underlying datasets. Areas of concern include the representativeness of the yield data available in the PUS (many farms do not provide yield estimates and the remainder may not be fully representative of variation across the UK), as well as the representation of some of the variables, particularly for soil type and disease prevalence (which were subject to sampling and aggregation constraints within this study). Some of these issues could be minimised by combining similar analytical approaches with data from standardised plots such as those run by the AHDB for variety level yield assessment. These static sites, with standardised input regimes across years, may help in characterising the abiotic components of crop yield (e.g. weather, soil type and latitude). Modelling the sub-population of low yielding sites discussed above, remains an outstanding challenge for statistical analysis. Hurdle modelling, based on whether localities achieved a commercially viable yield, or similar techniques, may help to refine the models fitted here. Alternatively, increased understanding of how this population might be characterised, e.g. based on socio-economic factors, may provide insight into how they should be represented in any future modelling and the consequences for policy decisions around yield This work highlights the opportunities and challenges that arise from combining related datasets in the agrarian sector. We provide a framework and discussion relating to the overlapping use of statistical and machine learning based techniques in the context of the numeric analysis of fused datasets. To summarise, statistical procedures are based on an explicit model of the system under study, that are dependent on knowledge regarding the relation between the studied factors and the way in which observations may vary. This provides greater power and interpretability when testing hypotheses about the workings of the system, where the model is judged to be adequate. The flexibility of machine learnings makes it better able to reflect complex relationships that may be present within data (e.g. for forecasting future states) and which may reflect system processes. However, this same flexibility can lead to over reliance on random variation in the data used to fit the function and undermine the generality of the resulting model. These contrasting strengths represent important considerations in how methods are to be used in numeric analyses, and how to structure similar studies in other relevant policy areas.

 

 

PDF report2.35 MB
Disclaimer

Please ensure that you have proof-read your content. Pages are not edited further once submitted and will go live immediately.