TE | Variation.com Dr. Wayne Taylor - Taylor Enterprises, Inc. |
Applied Statistics for Engineers and Quality in the FDA Regulated Industries |
Search variation.com
Site MapCAPAs and Trending of Quality Data Spec Setting, Tolerance Analysis and Robust Design Store What's New Technical Library FAQ Contact Info
Subscribe to our Web SiteBy entering your e-mail address and clicking the Subscribe button, you will automatically be added to our mailing list. You will receive an e-mail when new versions of our software or books are available as well as other significant announcements. (privacy policy). |
A Pattern Test for Distinguishing Between
Statistical methods such as control charts and change-point analysis are commonly used to determine whether the mean has shifted. Such methods assume independent errors around a possibly changing mean. When such techniques are applied to autoregressive data, erroneous conclusions can result. However, shifts of the mean create autocorrelation between the observations making it difficult to distinguish mean-shift data from autoregressive data. A pattern test has been devised that can reliably distinguish between these two important cases.
Table of Contents
Introduction IntroductionLook at Figures 1-3. Which two sets of data are most similar in structure?
Figure 1: Mean-Shift Model
Figure 2: First Order Autoregressive Model - Positive Correlation
Figure 3: First Order Autoregressive Model - Negative Correlation
Would you be surprised to find out it is the plots in Figures 2 and 3? Both were generated using a first order autoregressive model. The plot in Figure 1 was generated using a different model, called the mean-shift model. When analyzing data collected over time, it is important to be able to distinguish between these two important cases. Visual inspection of such data is unreliable. A pattern test has been developed which can reliably distinguish between these two models.
The Mean-Shift ModelStatistical methods such as control charts and change-point analysis assume a series of independent observations collected over time. At one or more points in time the mean may shift. Let X_{1}, X_{2}, ... represent the data in time order. The mean-shift model can be written as
X_{i} = m_{i} + e_{i}
where m_{i} is the average at time i. Generally m_{i} = m_{i-1} except for a small number of values of i called the change-points. e_{i} is the random error associated with the i-th value. It is assumed that the e_{i} are independent and identically distributed with means of zero. Other assumptions including normality may also be made by some of these statistical methods but are not required for the proposed pattern test.
The data shown in Figure 1 was generated using the following model:
e_{i}~ N(0,1) and independent m_{1}, m_{21}, m_{41}, m_{61}, m_{81}~ N(10,1) and independent For all other i, m_{i} = m_{i-1}
N(m,s) means normally distributed with mean m and standard deviation s. This model could result from a process where the mean shifts as a result of periodic material changes. It could also result from a process subject to both setup and within setup variation. In other cases, the mean-shifts could occur at random times. The proposed pattern test works for any of these situations.
The First Order Autoregressive ModelThe data shown in Figures 2 and 3 were generated using the first order autoregressive model:
e_{i}~ N(0,1) and independent r_{i} = f r_{i-1} + e_{i} r_{0} = 0 X_{i} = 10 + r_{i}
f is a constant between -1 and 1. The above model results in a correlation between successive values of:
Corr{X_{i}, X_{i-1}} = f
Values of f=0.7 and f=-0.7 were used respectively in Figures 2 and 3. When f=0, the autoregressive model reduces to what is called the white noise model where X_{i} ~ N(10,1) and independent. This is also a special case of the mean-shift model with no shifts.
When checking for an autoregressive model, one frequently calculates the autocorrelations and displays them in the form of a correologram. However, this is only useful for distinguishing between an autoregressive model and white noise. The mean-shift model also results in autocorrelations between the values. In Figure 1 the correlation between consecutive values is 0.43. Looking at the autocorrelations will not allow one to distinguish between these two models.
The Pattern TestFigure 4 shows the six possible patterns that can result from plotting three consecutive points when there are no ties. Pattern 1 is called the double up pattern and Pattern 6 is called the double down pattern. The other 4 patterns will be referred to as reversal patterns. For the autoregressive model, the double up and double down patterns are most common when there is a positive autocorrelation as in Figure 2. The reversal patterns are most common when there is a negative correlation as in Figure 3.
When the means of the 3 points are the same, all six patterns are equally likely. In this case, the double up and double down patterns should occur 1/3 the time and the reversal patterns should occur 2/3 of the time. The pattern test involves counting the number of times the double up/down patterns occur. This count is slightly biased when the mean shifts or there is an outlier. However the bias is small and easily compensated for making this count useful for distinguishing between mean-shift and autoregressive data. If this count is significantly greater than a third the number of values, the data is autoregressive with positive correlation. If this count is significantly less than a third, the data is autoregressive with negative correlation. Otherwise the mean-shift model fits the observed data.
Figure 4: Six Patterns for Three Consecutive Points
Table 1 gives critical values for S for a 2-sided test with a=0.05 for n between 10 and 200. If S £ s_{lower}, the data is autocorrelated with negative correlation. If S ³ s_{upper}, the data is autocorrelated with positive correlation. Otherwise, the data is consistent with the mean-shift model. These critical values and the approximations given below are all based on the assumption that the number of shifts and outliers is less than 1 per 20 data points. This assumption should rarely restrict the use of this procedure.
Table 1: Two-Sided Critical Values for S = Number of Double Up/Down Patterns (a=0.05)
Note: n = sample size. If S £ s_{lower}, the data is autocorrelated with negative correlation. If S ³ s_{upper}, the data is autocorrelated with positive correlation. Otherwise, the data is consistent with the mean-shift model.
Formulas 1 and 2 can also be used to calculate significance levels. If a_{lower}£ 0.025, the data is autocorrelated with negative correlation. If a_{upper}£ 0.025, the data is autocorrelated with positive correlation. Otherwise, any correlation in the data is the result of mean shifts.
_{} (1)
where _{}, _{} and _{}
_{} (2)
where _{}, _{} and _{}
I_{p}(a,b) is the incomplete beta function. The derivation of these formulas is given in Appendix A. They are within 2% of the true value for 0.01£a£0.1 and n³10. Formulas 3 and 4 give a second less accurate approximation that can be used when n³100.
_{} (3)
_{} (4)
Applications of the Pattern TestTable 2 shows the results of applying the pattern test to the three sets of generated data in Figures 1-3 plus the three real sets of data shown in Figures 5-7. In Figures 1-3, n=100 resulting in critical values s_{lower}=24 and s_{upper}=44. For the mean-shift data in Figure 1, S=38 which falls between the two critical values. This is consistent with a mean-shift model. For the Figure 2 autoregressive data with positive correlation, S=46. This exceeds the upper critical value proving the data is not consistent with a mean-shift model. For the Figure 3 autoregressive data with negative correlation, S=19. This is below the lower critical value again proving the data is not consistent with a mean-shift model. The a values from Equations 1-4 support these same conclusions. Also shown are the true a values obtained through simulation. All four approximations are accurate to three digits when n=100.
Table 2: Analysis of Example Data Sets
Figure 5 shows the number of sunspots for a 50 year period of time. This data is Series E from Box and Jenkins (1976). The number of double up/down patterns is S=38. This exceeds the upper critical value s_{upper}=23 indicating the data is autoregressive with positive correlation. The a values from Equations 1-4 support this same conclusion.
Figure 5: Wölfer Sunspot Data
Figure 6 shows the yields from 70 consecutive batches of a chemical process. This data is Series F from Box and Jenkins (1976). The number of double up/down patterns is S=9. This is below the lower critical value s_{lower}=15 indicating the data is autoregressive with negative correlation. The a values from Equations 1-4 support this same conclusion.
Figure 6: Batch Yields
Figure 7 shows part strength readings taken once an hour over 52 consecutive hours. The number of double up/down patterns is S=19. This is between the lower critical value s_{lower}=10 and the upper critical value s_{upperr}=24 indicating the data is consistent with the mean-shift model. The a values from Equations 1-4 support this same conclusion.
Figure 7: Part Strength
Handling TiesWhen ties are possible, two new patterns can occur: the single tie and the double tie. In this case, let Pi be defined in terms of X_{i-2}, X_{i-1}, X_{i} as follows:
_{}
Further, let S be defined as:
_{}
When X_{i-2}, X_{i-1}, X_{i} are identically distributed, E{P_{i}} = 1/3. Again a test for autoregression can be constructed based on S averaging above or below 1/3 the number of patterns. If the number of ties is small, Table 1 and Equations 1-4 may still be used. But if ties are more common, Table 1 and Equations 1-4 can no longer be used because the ties reduce the variation of S. Instead Equations 5-8 should be used:
_{} (5)
where _{},
_{} (6)
where _{},
_{} (7)
_{} (8)
Estimates of Var{P_{i}}, Cov{P_{i},P_{i+1}} and Cov{P_{i},P_{i+2}} can be obtained from the data. A special case with numerous ties is pass/fail data. In this case:
_{}
Then:
_{}
This gives:
_{}
For pass/fail data, the variance and covariances of P_{i} are:
_{} (9)
_{} (10)
_{} (11)
For pass/fail data, an estimate of p can be obtained from the data and substituted into Equations 9-11 to estimate Var{P_{i}}, Cov{P_{i},P_{i+1}} and Cov{P_{i},P_{i+2}}. These estimates can then be plugged into Equations 5-8 to obtain approximate a levels.
Other Applications of P_{i}An example of a data set with ties is shown in Figure 8. 197 chemical concentrations are shown. This data is Series A from Box and Jenkins (1976).
Figure 8: Chemical Concentration Data
From this data P_{3}, ..., P_{197} can be calculated. The P_{i} values are time ordered data that reacts to changes in the autoregressive behavior of the data. A CUSUM chart of the P_{i} values is shown in Figure 9. The sudden change in direction in the CUSUM chart indicates a sudden change in the autoregressive behavior of this data.
Figure 9: CUSUM Chart of P_{i} for Chemical Concentration Data
A change-point analysis was then performed on the P_{i} using Taylor (2000). This software performs a bootstrap analysis on the CUSUM chart to obtain confidence levels and confidence intervals for the change. The results of this analysis are shown in Figure 10. It verifies a change occurred with 98% confidence. The change is estimated to have occurred just prior to point 145. With 95% confidence it occurred between points 83 and 179.
Figure 10: Results of Change-Point Analysis of P_{i} for Chemical Concentration Data
The average P_{i} before the change is 0.326, which is close to 1/3, indicating a lack of autoregressive behavior. The average P_{i} following the change is 0.542 indicating autoregression with a positive correlation. Separate tests for autoregression were performed on points 1-144 and points 1405-197. The results are shown in Table 3. These tests confirm that following the change, the data is autoregressive with positive correlation, while before the change, the data is consistent with the mean-shift model.
Table 3: Pattern Test for Chemical Concentration Data
ConclusionThe pattern test has proven to be useful for distinguishing between two very important models: the mean-shift model and the first order autoregressive model. The pattern test can be used to detect a violation of the assumption of independent errors when control charting data and performing a change-point analysis. The series P_{i} can also be used to detect changes in the autoregressive behavior of the data. It provides a useful new tool for helping to analyze complicated time series data.
Appendix AThe distribution of the test statistic S will be derived assuming no mean shifts or ties. Assume that a series of n data points X_{1}, X_{2}, ..., X_{n} has been collected in time order. Let P_{i} be an indicator function of whether the double up/down pattern occurred for points X_{i-2}, X_{i-1}, X_{i}. Further let:
_{}
The average and variance of S are:
_{} (12)
_{} (13)
Assuming no ties or mean shifts, the P_{i} are identically distributed with:
E{P_{i}} = 1/3 Var{P_{i}} = 2/9 Cov{P_{i},P_{i+1}}= -1/36 Cov{P_{i},P_{i+2}} = 1/180
All other covariances are zero. The above moments were calculated by generating the 5!=120 possible patterns for 5 points. Substituting the moments of P_{i} into Equations 12 and 13 gives the following moments for S:
_{} (14)
_{} (15)
When the mean shifts between time i-1 and i, the following values change:
E{P_{i}} = E{P_{i+1}} = 1/2 Var{P_{i}} = Var{P_{i+1}} = 1/4 Cov{P_{i-1},P_{i}}= 0 Cov{P_{i},P_{i+1}}= 0 Cov{P_{i+1},P_{i+2}}= 0 Cov{P_{i-2},P_{i}}= 0 Cov{P_{i-1},P_{i+1}}= 0 Cov{P_{i},P_{i+2}}= 0 Cov{P_{i+1},P_{i+3}}= 0
All other values are as before. The above moments were calculated by generating the (4!)^{2}= 576 possible patterns for 8 points where the first 4 points are all less than the last four points. Let t be the number of shifts. When t shifts occur:
_{} (16)
_{} (17)
Shifts increase both E{S} and Var{S}. To see what effect this has on the critical values, take E{S} ± 2 SD{S} as an approximate critical values. Both upper and lower critical values increase as t increases. Figure 11 shows the percentage increase in these approximate critical values as t ranges from 0% to 10% of n. When t is 5% of n, i.e. a change occurs once every 20 points, the critical values increase only 5%.
Figure 11: Approximate Percent Increase in Critical Values As t Increases
Since the number of changes is not known, one cannot exactly determine the distribution of S. However, by assuming an upper bound on the number of changes, one can bound its distribution. It would seem reasonable to expect no more than one change per twenty points (t £ n/20). A lower critical value is then calculated based on t=0 changes while the upper critical value is based on t=n/20 changes.
If the P_{i} where uncorrelated, S would follow the binomial distribution. Since the correlations are small, one would expect the binomial distribution to provide a close approximation. The binomial distribution B(x|n_{b},p_{b}) has parameters n_{b} and p_{b}. It has a mean of n_{b}p_{b} and variance n_{b}p_{b}(1-p_{b}). Setting E{S} = n_{b}p_{b} and Var{S} = n_{b}p_{b}(1-p_{b}) and solving for n_{b} and p_{b} gives:
_{} (18)
_{} (19)
Since n_{b} may not be an integer as required by the binomial distribution, the more general incomplete Beta function, I_{p}(a,b), will be used. Assuming t changes, the upper and lower significance levels for S can be approximated by:
_{} (20)
_{} (21)
Equation 1 was obtained from Equation 20 by substituting Equations 18 and 19 and setting t=0. Equation 2 was derived from Equation 21 by substituting Equations 18 and 19 and setting t=n/20. Equation 5 was obtained from Equation 20 by substituting Equations 13 and 16 and setting t=0. Equation 6 was derived from Equation 21 by substituting Equations 13 and 16 and setting t=n/20. Simulations indicate that Equations 20 and 21 are accurate to within 2% of the true value for 0.01£a£0.1 and n³10.
A second less accurate estimate can be obtained by approximating the distribution of S using the normal distribution with continuity correction. This results in Equations 22 and 23. Equation 3 was derived from Equation 22 by substituting Equations 16 and 17 and setting t=0. Equation 4 was derived from Equation 23 by substituting Equations 16 and 17 and setting t=n/20. Equation 7 was derived from Equation 22 by substituting Equations 13 and 16 and setting t=0. Equation 8 was derived from Equation 23 by substituting Equations 13 and 16 and setting t=n/20. These approximations should only be used when n³100.
_{} (22)
_{} (23)
ReferencesBox, George E. P. and Jenkins, Gwilym (1976). Time Series Analysis: Forecasting and Control, Holden-Day, San Francisco, California.
Taylor, Wayne (2000). Change-Point Analyzer 2.0 software package, Taylor Enterprises, Libertyville, Illinois. WEB: www.variation.com/cpa
Key Words: Mean-Shift, Autoregression, Change-Point Analysis, Control Chart, Time Series
Citation: Taylor, Wayne A. (2000), "A Pattern Test for Distinguishing Between Autoregressive and Mean-Shift Data," WEB: www.variation.com/cpa/tech/pattern.html. |
Copyright © 1997-2017 Taylor Enterprises, Inc.
Last modified:
September 08, 2017