# Methods for Computing Water Quality Using Regression Analysis

## Table of Contents

## Overview: Multiple Linear Regression

Water-quality measurements often are related to one another, and in many cases, linear regression models can be used to describe the relation. Linear regression relates one or more explanatory variables to a response variable using linear coefficients (Helsel and Hirsch, 2002). Continuously measured values can be regressed with concentrations of discrete, manually-collected water samples to develop regression models for computing continuous constituent concentrations and loads.

The first step in developing an effective regression model for a specific surface-water site is to plot each possible explanatory variable against the response variable and examine patterns in the data. For example, consider how the concentration of suspended sediment (SSC) can be directly proportional to turbidity (Christensen and others, 2000; Uhrich and Bragg, 2003; Lietz and Debiak, 2005; and Rasmussen and others, 2005) or turbidity and discharge (Miller and others, 2008).

**Figure 1.** Graphs showing the relation between turbidity and suspended-sediment concentration
and streamflow and suspended-sediment concentration for Little Arkansas River at
Sedgwick, Kansas, 1999-2005.

Scatterplots and correlation coefficients of explanatory variables (turbidity and streamflow) to response variables (suspended-sediment) are a simple way to identify which of the variables are closely related and whether data transformation might improve the relation between explanatory and response variables. Generally, the closer the correlation coefficient is to 1 (perfect positive correlation), the stronger the association between variables.

A regression model can be created that relates turbidity and discharge (explanatory variables) to suspended sediment (response variable) (Eq 1).

SSC_{i} =
β_{0} + β_{1}T_{i} + β_{2}Q_{i} + ε_{i}

(Eq 1)

where: | |
---|---|

SSC_{i} |
is the response variable, regression computed suspended-sediment concentration, |

i | is 1, 2, ...n measurements of variables, |

T_{i} |
is measured turbidity, |

Q_{i} |
is measured streamflow, |

β_{0}, β_{1}, β_{2} |
are constant coefficients calculated by regression analysis, and |

ε_{i} |
is the random error of the regression, which has a mean value of zero. |

Optimal values for β coefficients are calculated using n simultaneous measurements of both the explanatory and response variables. The least squares method is used to calculate optimal coefficient values.

A regression model (Eq 1) also can be expressed in matrix notation (Eq 2; Draper and Smith, 1998). This notation is convenient and concise for more than one explanatory variable, and is used in the section "Quantifying Uncertainty."

Y =
X β + ε

(Eq 2)

where | |
---|---|

Y | is a vector of measurements of the response variable SSC, suspended-sediment concentration, |

X | is a matrix of all n measurements of the explanatory variables T and Q, turbidity and streamflow, |

β | is a vector of constants calculated by regression analysis, and |

ε | is a vector of random errors. |

It is common to log transform explanatory and response variables of hydrologic data in order to eliminate curvature and simplify analysis (Ott, 1993, p. 454). Log transformation can help meet linear regression assumptions that residuals are normally distributed and of constant variance.

**Figure 2.** Graphs showing the relation between turbidity and suspended-sediment concentration and
streamflow and suspended-sediment concentration on log_{10} scale for Little Arkansas River
at Sedgwick, Kansas, 1999-2005.

The graphical plots of turbidity and suspended-sediment concentrations and streamflow and
suspended-sediment concentration in log10 space indicates improved correlation and
suggests that log_{10} transformation may improve the regression model. Measurements
such as turbidity and discharge can be used to compute water-quality constituents such as
suspended-sediment concentration. In this example, all the variables are log_{10}
transformed resulting in the following model form:

log_{10}SSC =
β_{0} + β_{1} log_{10}T + β_{2} log_{10}Q

(Eq 3)

where | |
---|---|

SSC | are the computed values of the response variable, suspended-sediment concentration, and |

T and Q | are measurements of the explanatory variables, turbidity and discharge, respectively. |

Log_{10} transformation of the response variable (SSC) has a consequence that must be
considered when computing suspended-sediment concentrations. However, computed values must be
retransformed back to the original units, a step that introduces a bias (typically negative)
in computed suspended-sediment concentration values (Miller, 1951; Koch and Smillie, 1986)
unless the data are perfectly and positively correlated. The bias arises because regression
predicts the mean of a normal distribution in log units, and the retransformed value
results in a value closer to the median value than the mean in linear space.
To correct for retransformation bias, the retransformed SSC must be multiplied by a bias
correction factor (BCF; equation 4).

SSC =
10^{ [ β0 + β1 log10T + β2 log10Q ]}
x BCF

(Eq 4)

where | |
---|---|

BCF | is the bias correction factor. |

The random error term (ε_{i}) does not appear in equations for
computing mean concentration of suspended sediment. This is because a regression-computed
concentration assumes the mean random error is zero.
However, any computed value has associated error caused by measurement uncertainty,
the inability to fully capture natural variability in a mathematical equation, and other factors.
This is considered separately in the section "Quantifying Uncertainty."

Duan (1983) introduced a nonparametric bias-correction factor called the "smearing" estimator (Helsel and Hirsch, 2002, p. 256). The smearing factor is calculated from the mean of the residual values (Eq. 5; Duan, 1983)

(Eq 5)

where | |
---|---|

SSC_{i} |
is the i^{th} measured suspended-sediment concentration, |

SSC_{i} |
is the i^{th} regression-computed suspended-sediment concentration, and |

n | is the number of measured suspended-sediment concentrations in the model-calibration dataset. |

Graphical plots of the actual and regression computed values indicate the performance of the model.

**Figure 3.** Graph showing comparison of observed and computed suspended-sediment concentration in log space,
Little Arkansas River near Sedgwick, KS. 1999-2005.

All equations shown in this example using suspended sediment, turbidity, and discharge. This approach to regression analysis can be used for any type and number of variables.

## Regression Model Development

Regression analysis is used to develop relations between discrete laboratory analyses of manually collected water samples (response variables) and continuously measured water-quality data (explanatory variables).

Stepwise-regression analysis sometimes is used to help determine the optimal
selection of explanatory variables.
The appropriate form of each model can be determined by log_{10}-transforming all the variables
(Eq 3), starting with a formula using all possible explanatory variables, and then
removing them one at a time, both in forwards and backwards steps.

There are many choices for explanatory variables, including: continuous real-time measurements of specific conductance, pH, water temperature, turbidity, dissolved oxygen, and streamflow. Explanatory variables are included in models only if there is some physical basis or explanation for their inclusion; also, the addition of a variable to a model must make a significant improvement in model performance.

Additional factors also are used to guide the model creation process, which can be subjective and involves the best professional judgment of the researcher. For example, addition of an explanatory variable can cause a potentially erroneous coefficient to shrink, thus suggesting multi-collinearity, which also must be taken into consideration when refining a model. Correlation (Pearson's r) coefficients muticolinearity between explanatory variables greater than 0.95 indicates multicolineatriy and suggests that only one variable is necessary. Selection of an appropriate model is a somewhat subjective process that relies on the best professional judgment of the modeler. Helsel and Hirsch (2002) provide a more detailed summary of techniques for model development.

The multiple R^{2} (coefficient of determination),
RMSE (root-mean-squared error), Cp, PRESS (prediction error sum of sqaures),
and other measurements are used as guides for optimal selection of explanatory variables.
The R^{2} gives the proportion of variance in the
response variable that is explained by the explanatory variables.
To add value to a model, adding a variable should substantially increase
the R^{2}, decrease RMSE and minimize Cp and PRESS while
decreasing the residual standard deviation
(Helsel and Hirsch, 2002).
Finally, regression coefficients must be sufficiently different from zero and substantially
large enough to define a statistical relation.

Regression models (of the form Eq 3) are loaded into underlying website software that computes concentrations of response variables and generates plots and data tables. These computed concentration values also are multiplied by the computed streamflow value to obtain instantaneous computations of the loads or mass of a constituent per unit of time.

## Quantifying Uncertainty

All regression models have uncertainty inherently associated with each computation. Uncertainty can be defined in a number of different ways, including relative percent difference, absolute error, and prediction intervals. Prediction intervals, used by this website, define a range of values for the response variable for a given level of certainty. The level of certainty presented with the modeled data is the 90-percent prediction interval. The larger the range, the more uncertainty there is associated with the regression computed value.

Calculating prediction intervals for regression models with two or more explanatory variables involves matrices (Eq 6; Draper and Smith, 1998).

(Eq 6)

where | |
---|---|

t | is the value of Student's t-distribution having n-3 degrees of freedom with an exceedance probability of α/2, |

α | is the level of certainty for the prediction interval (1-0.9, or 0.1 for this website), |

p | is the number of explnatory variables plus one, |

s | is the variance of the residuals calculated during model development, |

(X'X)^{-1} |
is the X prime X inverse matrix calculated during model development — an expression of the covariance among all explanatory variables, and |

X | is a vector of the explanatory variable measurements. |

## Probability of Criteria Exceedance

Although prediction intervals are good indicators of uncertainty, they can be difficult to compare against water-quality criteria. Water-quality criteria or guidelines are values defined by best-available scientific evidence indicating some measurable detrimental effect on the biota or water consumers. While a maximum criterion is likely exceeded when the lower bound of a 90-percent prediction interval is above it, interpretation is more difficult when a prediction interval spans, or encompasses, a criterion value.

A convenient way to interpret regression model computed concentrations in the context of water-quality criteria is the probability of exceedance. Probability of exceedance is a single value representing the percent likelihood that a criterion has been exceeded (Eq 7). This approach assumes that a normal distribution centered on the mean computed value describes the uncertainty distribution. For log-transformed response variables, this equation is applied in logarithmic space.

(Eq 7)

where | |
---|---|

Pr | is the probability that the criterion has been exceeded (0 < Pr < 1), |

D | is the cumulative distribution function for the standard normal curve — values for it are obtained from equations that approximate the exact values, and |

RMSE | is the root-mean-squared-error, or standard error of the regression, or standard deviation of the residuals. |

## Limitations of Regression Analysis

A regression model between the response and explanatory variables generally is site-specific and may change over time if changes occur in the sources of the constituent or an improved sensor becomes available. This is because of simplifying assumptions implicitly built into the regression analysis. For example, turbidity measurements are affected by physical properties of suspended-sediment particles such as size, color, and density (Ziegler, 2003; Anderson, 2005). In turn, these physical properties are affected by complex watershed properties such as source sediment lithology, stream morphology, land use distribution, and many other things. All of these complexities are simplified and "lumped" into the values of β developed by regression analysis. It is very unlikely that any two streams (or even one stream at two locations) would have identical suspended-sediment properties. In fact, sediment properties at even one location on a stream will change over time. Therefore, regression analysis is site-specific, and the regression model must be verified annually by continued sample collection and refined as needed.

## Calculation of Constituent Loads

Annual mean daily loads (AMDL) of constituents are calculated by summation of available hourly (or more frequent time steps) loads for a given year, which is calculated using equation 8:

(Eq 8)

where | |
---|---|

C_{i} |
is the instantaneous, hourly concentration at the i^{th} time, |

Q_{i} |
is the instantaneous, hourly streamflow at the i^{th} time, |

CF | is a conversion factor from Table 1, and |

n | is the number of available hourly values for a given year (a maximum of 8,760). |

Multiply | by CF | and by | to obtain |
---|---|---|---|

colonies per 100 milliliters (col/100 mL) | 0.02447 | streamflow, in ft^{3}/s |
billion colonies per day |

micrograms per liter (μg/L) | 0.00539 | streamflow, in ft^{3}/s |
pounds per day |

milligrams per liter (mg/L) | 5.39 | streamflow, in ft^{3}/s |
pounds per day |

Load computations assume that there is no error or uncertainty in the computation of streamflow. Although this is not true, the typical measurement uncertainty of streamflow is less than 5 percent (Carter and Anderson, 1963; Sauer and Meyer, 1992). Measurement uncertainty for the sensor values typically are less than 10 percent if common protocols for operation of water-quality monitors are followed (Wagner and others, 2006).

## Calculation of Geometric Mean for Indicator Bacteria

Computed instantaneous indicator-bacteria-concentration data can be viewed using a modified 30-day moving geometric mean (Eq 9). This approach, a type of smoothing function, helps reveal the "central tendency" of the data (Helsel and Hirsch, 2002, p. 285), and also allows for more direct comparisons with some water-quality criteria. A comparison between instantaneous and geometric-mean values is shown in Figure 4.

The position of the 30-day "window" used in these geometric means differs from some approaches -- the window is not centered around the value being calculated. Rather, the calculated value is at the far right of the window. This is consistent with definitions of biological water-quality criteria used by some regulatory agencies, such as the Kansas Department of Health and the Environment.

(Eq 9)

where | |
---|---|

C_{0,GM} |
is the calculated 30-day moving geometric mean for one moment in time, |

n | is equal to 720, which is 24 instantaneous values per day multiplied by 30 days, |

C_{0} through C_{0-i} |
are the current instantaneous value, and the preceding 719 instantaneous values. |

**Figure 4.** Comparison of time-series plots showing computed concentrations of fecal coliform
bacteria (black line), Little Arkansas River near Sedgwick, KS, 2006. (a) Instantaneous values change rapidly and
cannot be compared directly to water-qualtiy criteria
(b) A 30-day moving geometric mean applied to the instantaneous values produces a dataset which can
be compared to water-quality criteria.

## References

- Anderson, C.W., 2005, Turbidity (ver. 2.1): U.S. Geological Survey Techniques of Water-Resources Investigations, book 9, chap. A6, section 7.
- Carter, R.W., and Anderson, I.E., 1963, Accuracy of current meter measurements: Am Soc. Civil Engineers Journal, v. 89 no HY4, p 105-115.
- Christensen, V.G., Jian, X., and Ziegler, A.C., 2000, Regression analysis and real-time water-quality monitoring to estimate constituent concentrations, loads, and yields in the Little Arkansas River, South-Central Kansas, 1995-99: U.S. Geological Survey Water-Resources Investigations Report 00-4126, 36 p.
- Draper, N.R., and Smith, H., 1998, Applied regression analysis, 3rd edition: John Wiley & Sons, New York, New York, 706 p.
- Duan, N. 1983, Smearing estimate--a nonparametric retransformation method: Journal of the American Statistical Association, v.78, p.605-610.
- Helsel, D.R., and Hirsch, R.M., 2002, Statistical methods in water research, Elsevier, Amsterdam, 529 p.
- Insightful Corporation, 2007, S-PLUS 7--Delivering the power of predictive analytics across the enterprise. Accessed 26 Apr 2007.
- Koch, R.W., and Smillie, G.M., 1986, Bias in hydrologic prediction using log-transformed regression models: American Water Resources Association, Water Resources Bulletin, v. 22, no. 5, p. 717-732.
- Miller, C.R., 1951, Analysis of flow-duration sediment rating curve method of computing sediment yield: U.S. Bureau of Reclamation, 55 p.
- Miller, C.V., Gutierrez-Magness, A., Feit-Majedi, B.L., and Foster, G.D., 2007, Water quality in the Upper Anacostia River, Maryland: Continuous and discrete monitoring with simulations to estimate concentrations and yields: U.S. Geological Survey Scientific Investigation Report 2007-5142, 43 p.
- Ott, R.L., 1993, An introduction to statistical methods and data analysis: Duxbury Press, Belmont, California, 1051 p.
- Sauer, V.B., and Meyer, R.W., 1992, Determination of error in individual discharge measurements: U.S. Geological Survey Open-File Report 92-144, 21 p.
- Uhrich, M.A., and Bragg, H.M., 2003, Monitoring instream turbidity to estimate continuous suspended-sediment loads and yields and clay-water volumes in the Upper North Santiam River Basin, Oregon, 1998–2000: U.S. Geological Survey Water-Resources Investigations Report 03–4098, 43 p.
- U.S. Geological Survey, variously dated, National field manual for the collection of water-quality data: U.S. Geological Survey Techniques of Water-Resources Investigations, book 9, chaps. A1-A9.
- Wagner, R.J., Boulger, R.W., Oblinger, C.J., and Smith, B.E., 2006, Guidelines and standard procedures for continuous water-quality monitors—station operation, calibration, record computation, and data reporting: U.S. Geological Survey Techniques and Methods 1-D3, 96 p.
- Ziegler, A.C., 2003, Issues related to use of turbidity measurements as a surrogate for suspended sediment, Accessed 26 Apr 2007.