[Focus 11]

Simple, two-variable Linear Regression

A note on testing the significance of a regression line

Regression analysis might be more accurately called 'preliminary prediction analysis'. By generating a 'line of best fit' we can use that line to find a value for the 'y' variable given a specific value for the 'x' variable. All data used must be parametric.

Two useful ways to analyse a relationship are Correlation and Regression. Correlation sets out to determine if there is a relationship between two variables and also measures the strength of any discovered relationship but regression goes one step further.

Linear regression is used to find a model (a formula for the 'line of best fit') for that relationship so that new values of one variable (y) might be predicted given a value for the other (x)...(note that it is never the other way round). The data used must be parametric and normally distributed.

The intention is to place a 'line of best fit' through the plotted points on a scattergraph. This line will have a number of interesting properties....it is always straight and the steepness of the slope (an indication of the way that one variable changes as the other changes) may suggest that the relationship is proportionally balanced or very disproportionately balanced. A line at 45 degrees would indicate equal proportionality.

Both procedures require a scattergraph as their starting point. As with correlation (remember: no line
is drawn though), the direction of the slope (up or down) will indicate whether the relationship is positive (directly proportional) i.e. as one variable increases, so does the other.... or negative ( inversely proportional) i.e. as one variable increases, the other decreases.

We will be looking at the dependence of one variable upon the other and this in turn may indicate that a 'cause and effect' relationship exists. However, be very cautious about saying that "changes in one variable have caused changes in the other"; causation is very hard to prove.

So whereas correlation looks for an association, regression goes further and utilises any association in order to make predictions of a value in 'y' given a specific value in 'x'.

The predictor (what has actually been measured) is always placed upon the X axis and the variable upon which we wish to make value predictions is always placed upon the Y axis. Put another way; the independent variable (the error free variable) should be plotted on the X axis and the dependent variable on the Y axis. Remember also that we are always using paired data and as such, values must not become uncoupled from each other.

What does the term "The Error Free variable on the X axis" mean?

A scattergraph; with a line of best fit inserted, will show each point to be situated a determinable distance from the line. We only take account of that distance in relation to the vertical i.e. on the Y axis. We record the vertical distance from the line. The horizontal distance; the error in x, is therefore always zero.

Look at the following chart.....

so the error in y (when x = 2.8) is 3.6 - 2.7 = +0.9

The next consideration is how to create this 'line of best fit'...it is not simply a 'by eye' exercise but has to be done mathematically. The line must pass through the mean value of 'x' and the mean value of 'y' but that is insufficient information because it would fail to tell us about the angle or slope of the line. Simply passing through the two mean values would merely give a point of rotation for the line.

We need an equation for the line that takes into account the location of all the intersecting points on the scattergraph....

y = a + bx

where....
'y' is the dependent variable,
'x' is the independent variable,
'a' is the line's interception point on the Y axis (the regression constant)
'b' is the regression coefficient and is a measure of the steepness of the slope.

If the line cuts the Y axis at a value for 'x' that is less than zero, then 'a' becomes negative. Furthermore, 'b' can be either positive or negative depending upon which direction the slope of the line takes.

If the scattergraph appears to suggest a curved relationship, the data could be transformed first. Transforming the data (for example, by taking the log of the values of one or both variables) can produce a straight line from a curved one.

One special case should be mentioned and that is where the regression point passes through the origin i.e. both x and y values = zero at this point. Special formulae exist in these cases and the procedures outlined here do not apply.

Before dealing with the mathematics, let us consider the following situation....

An international cosmetics company has embarked upon a long-term sales campaign of some new products and this initiative has been supported by an advertising campaign on television. Here are the 24 monthly results for the 2 year period of the campaign. The money spent can be viewed as the independent variable and the sales achieved, as the dependent variable. I Is the relationship proportional, disproportional or inversely proportional?

Advertising costs(£'000) (x)	Sales value (£'000) (y)
18	50
30	202
31	210
39	287
40	107
42	163
50	303
50	219
55	199
58	140
58	189
60	175
61	265
62	330
66	330
68	245
75	722
78	689
78	405
81	300
83	310
90	491
99	358
100	860
Totals:..1472	7549

Remember the values are paired.

Mean value for x = 1472 ÷ 24 = 61.333

Mean value for y = 7549 ÷ 24 = 314.542

As usual, the first task is to produce a scattergraph to look for patterns in the data. [The SPSS routine to do this is explained later]

There is quite a strong pattern showing which suggests that a relationship does exist between these two variables. We have calculated a mean value for 'x' and a mean value for 'y'. Any 'line of best fit' is going to have to pass through the special point where those two values intersect but in order to calculate the slope of that line about the intersection point requires the use of the y = a +bx formula.

[The 'R Sq Linear' figure is part of the SPSS requested output.
The Sq root of the figure shown is the Pearson correlation value (rp) for the dataset, in this case, = 0.704; a fairly strong positive correlation]

When using SPSS for Regression analysis, a very large and comprehensive output is generated.
Of the 7 outputs, the sixth is the most important because it gives us the exact values for the formula of the 'line of best fit'...

The regression constant ('a') is shown as -79.758 (very confusing that SPSS refers to this as 'B') and our regression coefficient is shown as 6.429 (i.e. 'b' in our basic equation). [The best way to remember which figure is which is by inserting them in the equation in the order in which they are needed.]

Complete into the equation to crosscheck the math's, using the known mean value for 'x' (61.333). [We know what the answer should be as the mean value for 'y' has already been derived from the table above i.e. 314.542]

Deriving the mean value for 'y' using the 'straight line formula' route...

y = a + bx

y = -79.758 + [6.429 x 61.333]

y = -79.758 + 394.310

y = 314.55 (rounded)...Correct!

Interpolation and Extrapolation

Now that we have proved the formula is correct, we can insert any value for 'x' and calculate y' (a predicted value).

Never reverse this procedure because that would suggest that the dependent variable was able to influence the independent one. In the above example it would mean that the sales value directly set the advertising budget. There are more advanced methods of regression analysis that do enable us to predict values for x' but we will not be dealing with them here.

Suppose the company wanted to know the predicted sales value when 'x' = £25,000 and when 'x' = £75,000. This process is know as Interpolation.

y' = -79.758 + [6.429 * 25] = 80.97, that is: £80,970

and

y' = -79.758 + [6.429 * 75] = 402.44, that is: £402,440

Go back to the graph and see if these results are reasonable.

There is a problem where we try to determine a figure outside of the range of collected values . Such
a procedure is called 'extrapolation' and is not to be recommended. Simply because we have
established a straight-line relationship between two variables over a finite range, does not mean that this relationship continues above or below that tested range. The behaviour of stocks and shares over time is a good example where changes might occur suddenly and in a stepwise fashion.

We cannot say therefore, that an advertising expenditure of £125,000 would produce sales of £723,870 because we just do not have the evidence. Extrapolation using Time Series Analysis is discussed in a later Focus page.

The Manual method of calculating simple regression

This is a somewhat lengthy process but it is important to understand how the 'line of best fit' has been derived. It is a line which represents the "least square differences of all points along that line".

A small news production company needs to check the expense claims and recorded vehicle mileage's of 8 of their regional teams. They want to introduce a more regularised allowance scheme and in order to do this, it is necessary to set up some standardised expenses for standardised journey's.

For example: What is currently the predicted claim for a 50 mile journey?

Here are the results:

The scattergraph seems to show a wide variation. However, we can see a pattern and the slope would appear to be a positive one. For all regression charts, we start from a known value on the X axis, move upwards and then read across for the predicted value on Y.

Q. How would you interpret this pattern? Calculate rp.

Q. What is the main weakness in using this data when trying to make predictions?

Q. Can you make an estimate of the claim for a 50 mile journey by just using the chart?

The 'line of best fit' manual calculation...

Firstly we need to calculate 'b', the slope ( regression coefficient). The formula is not as frightening
as it looks! However, we do need to sum 4 columns (x, y, x squared and xy) and to square each individual 'x' value. You would be wise to produce a table of the 5 individual values required to avoid confusion before commencing with the equation..

To calculate 'b' .....

Sigma x	450
Sigma (x squared)	34,500
(Sigma x) squared	202,500
Sigma y	353
Sigma xy	25140

Check these calculations for yourself

n = 8

As with so many apparently complicated statistics formulae, once we have all the sigma totals done,
it is only a question of substitution...

So the regression coefficient; b = 0.575

The line of best fit has to pass through the mean of both variables so next we transform the equation
to read:

The mean of x is (450 ÷ 8) = 56.25 and the mean of y is (353 ÷8) = 44.125.
So using the above formula to calculate 'a'......

a = 44.125 - (0.575*56.25)

a = 11.78

So now we have the full and final equation for the line of best fit for this particular set of data...

y = a + bx

y = 11.78 - 0.575x

Checking in SPSS..... (Remember that you can use the 'Transform' function to display the x squared
and y squared columns if you wish)>>>>

Let us go back to the original data and check out the claim for a 50 mile journey and for an already recorded journey of 40 miles.

y = 11.776 + ( 0.575*50) = 40.53 = £40.53

y = 11.776 + (0.575*40) = 34.78 = £34.78

Q. But y (for 40 miles) had an actual value of £50.00, why the difference? (clue: read the text box at the top of this page about 'error free' , 'error predictions' and 'residuals')

Now that we have the equation for the line, we can solve questions such as "If the mileage recorded is 72.5 miles , how much expenses should be claimed?

y = 11.776 + (0.575*72.5) = £53.46.

Check against the graph....correct!!

Do not be tempted to interpolate in the other direction, i.e. given the amount of expenses claimed..can we estimate the mileage? No!

If you wished to do this it would be necessary to reverse the x & y variables but this is dangerous because you must not reverse the dependency and independence of the respective variables.

Q. Would we be justified in making any statement about 'cause and effect' in this example?

Task: Write a paragraph of explanation of the findings for the team accountants and make your own recommendations for a suitable expenses scheme.

An important note about 'residuals' and Residual Analysis

(see also the note earlier concerning error)

You will have noted that some data points lay much closer to the regression line than others.
Compare the last two data points in the above example for instance. These 'distances from the fitted line' are referred to as the 'residuals'. It is possible to work out both the 'x' and 'y' residuals for any point on the scattergraph but in practice (and for significance testing) we look only at the variation in 'y', that is, along the 'Y' axis as explained earlier.

Let us first take the data point where the mileage was 90 and the expenses were £60.

Just as before, we simply insert into the line formula the value for 'x' to produce a value for 'y' that
will sit on the line....

So when x = 90 what should the value of y be?

y' = 11.776 + (0.575*90) = 63. 53 and so the residual (on y) is, in this case quite small
(60 - 63.53 = - 3.53) and you will see that the point is below but very close to the line.

However let us now carry out the same calculation when the mileage was 40 and the claim was £50...

y' = 11.776 + (0.575*40) = 34.78 and here the original value for y was £50 and so the residual (on Y) is now...

50 - 34.78 = +15.22.....a much larger value

[Note that values below the line are negative and those above are positive.]

There is another useful output that SPSS can yield which looks at the values of all the residuals on Y and compares them with the values derived from the line of best fit. It is the 'scatterplot of standardised residuals against predicted scores'. If these two sets are plotted against each other we should see a wide scatter of points with no obvious patterning. This confirms the linearity of our plot because the values above and below the line have 'balanced each other out'. If patterning is apparent then the dataset needs to be reexamined because the assumption of linearity would now be suspect. For this example it would look like this....

....and you can see that the scatter is random.

The diagram below attempts to show how the variation in a residual value can be broken down into its component parts....

The SPSS procedure.... step by step

Sparky Audio plc build sound systems. In the annual report to shareholders they have included a dataset that showed how the company had monitored the cost of production. The data covered 9 months' production and was updated fortnightly.

To complete the Interpolation stage, we will need to estimate a value on the Y axis from a given value on the X axis and not the other way round. So it is essential to get the variables set up 'the right way round' to begin with. When using SPSS in this context it is important to note that the first window can be quite misleading...

For 'dependent' variable read: "the one to be predicted" and for independent variable read: " the one that was measured at source" i.e. the predictor. Production level will dictate the average unit cost so in this case, 'production level' will go on the 'x' axis.

Open SPSS in Variable View and type in 'xprodlevel' and 'yavecost'

Then type in the data as above data

......or you may wish to use: Spex 51 Sparky regress

As a first step go to 'graphs', 'scattergraphs', 'simple' and place 'xprodlevel' in the x axis variable box and 'yavecost' in the y axis variable box.

Click 'OK'

When you have the scattergraph, double left click on it to open the Chart Editor

Click on the 'Insert text box' and give the chart a title

Right click on any one of the data points and a drop down window will appear. Click on 'Fit line at total'

The R Sq Linear value will also appear.....

Q's What deductions can you make about the relationship of these two variables so far? Would you say the line represents a good fit or not? Calculate rp from the chart details (Care with the sign!).

Go back to the data.

This is how to carry out a regression run using SPSS....

Go to 'Analyse', 'Regression', click 'Linear'. Place 'xprodlevel' in the independent variable box and 'yavecost' in the dependent variable box.

Click 'Statistics' and tick 'Estimates', 'Model Fit' and 'Descriptives'. Click 'Continue'...

Now click 'Plots' put 'ZRESID' in Y and 'ZPRED' in X. Click 'Continue'. Click 'OK'

A full output will appear but go to 'Coefficients' first.....

This shows that our formula for the line is: y = 55.432 + (- 2.431x)

CHECK>>>>> From the descriptive output, we know what the mean of both x and y is....

Now substitute the mean of x into the formula:

y' = 55.432 - 2.431*9.933

= 55.432 -24.147 = 31.285 correct!

Finally check out the scatterplot of residuals to confirm that the assumptions of linearity have been
met (see green box above for explanation)...

Remember, if the plot shows no obvious pattern, then the assumptions about linearity and homogeneity of variance have been met and confirm that the analysis has been a valid one.

Q's What is the estimated average cost of production if the output level were fixed at 11,500 units per fortnight? Compare this figure with the actual average total cost recorded of £17500. Calculate the residual variation figure.

Complete this partially worked analysis for yourself....

The Bournemouth Bus Company (BBC) is trying to save money by rationalising their vehicle maintenance programme. As part of this exercise, they are monitoring tyre wear throughout the fleet. 24 of the coaches were inspected and the mean value for the remaining tyre tread was calculated and a note of the vehicle mileage travelled since the tyres were fitted was also recorded.

The current policy is to change tyres when they reach 2mm remaining.

Open: SPsmex 18 bus tyres to see the data....

Go to 'Analyse', 'Regression', click 'Linear'. Place 'mileage' in the independent variable box and 'tread' in the dependent variable box.

Click 'Statistics' and tick 'Estimates', 'Model Fit' and 'Descriptives'. Click 'Continue'...

Now click 'Plots' put 'ZRESID' in Y and 'ZPRED' in X. Click 'Continue'. Click 'OK'

A full output will appear but go to 'Coefficients' first.....

Q. What is the formula for the line?

Task: Produce the requisite scattergraph and insert a 'line of best fit' and a reference line at y = 2.0mm. The 'Graphs' 'drop down' menu is sufficient for this.

Select 'Scatter' and 'Simple'. Place 'mmtread' in the Y axis box and 'mileage' in the X axis box. Type in a suitable title. Click 'OK'

When the basic chart appears, right click the mouse and scroll down to 'Chart object / open'
Now click on any dot on the chart. Scroll down to 'Addfit line at total', left click the mouse.

Close the window. Place cursor over the Y axis and right click, scroll down to 'Add Y axis reference line'.

Click tab at top of the window for 'reference line'. Type in '2' for the Y axis reference line position. Click 'Apply' and close

Click red 'X' to finish with the chart editor.

You should see this.....

Q. What are your immediate thoughts about this output? Clue: always look at a chart and ask yourself "does it make basic sense"? Would we expect to see a chart showing less tread correlating with increased mileage?

Task: Carry out a full SPSS regression analysis. Complete the regression equation and find 'y' when 'x' = 46,000. Clue: the chart should tell you approximately how much tread should be left.

Q. Does the 'residuals plot' confirm our assumptions concerning linearity.....Clue: refer back to your full SPSS output and remember that 'randomness' is what we are seeking from that particular chart.

Q. The 25th coach is examined at 49,500 miles interval and found to have a tread of 2.2mm, is this
lower or higher than expected? Clue: first look at the chart and then use the formula for the line.

Q. If all coaches were inspected at 50,000, how many would have been found with sub-2mm tyres? Clue: use the chart and refer to the Y axis reference line.

Q. How would you use this analysis to help to draught a new tyre maintenance policy for the company? Clue: the first coach to fail the 2mm test had travelled 44,000 miles and 4 of the 24 coaches failed to meet this level when inspected. Note that all 4 had travelled a significant distance with defective tyres. So how should the inspection regime be 'tightened up'? What should be the inspection interval for tyres?

Testing the significance of the line...

Just as we did with correlation, we have to ask the question "How significant is the relationship between the two variables?"

The null hypothesis in each case has to be; " there is no linear relationship between the
dependent and the independent variable". That would mean that H0 states that the regression coefficient (b) is approaching zero. The 'line of best fit' would be horizontal, indicating that any change in x had no effect upon the value in y.

The above examples have shown that 'b' can be 'plus' or 'minus' and have a large or small value.

A common method for testing the significance of the line is to carry out a t-test to see if the gradient is significantly greater than zero. In SPSS / Regression / Linear, this is done automatically for you and the significance is given in one of the outputs.

The t- statistic is actually testing the value for the the regression coefficient, 'b' for significance.

An example:

Is the sale of Wines linked to the amount of shelf space dedicated to these products? Metromart plc assessed the sales figures generated in May 2005 from 15 of their suburban stores. 'Shelf space' becomes the independent variable and 'sales' is the dependent variable.

Task: use dataset:SPsmex 19 Wine Sales and carry out your own analysis.

The Coefficients output looks like this. The 'Sig' column shows the p-value of t

The t statistic is testing the regression coefficient for significance and Sig (lower row) is the p-value of t and in this instance is <.05 but >.01 and so is significant at the 5% level but not quite at the 1% level.

Q. Is the regression coefficient zero? Yes or No!

Task: Write out the equation for the line and interpret the figures for the Significance for yourself to make sure you are clear.

Caution: When you see .00 or .000 in a Sig column it means <0.005, i.e the significance is beyond the 0.01 level. Students often misinterpret this output to either mean zero or "no output generated".

Other situations

Working with two comparative datasets:

It is quite common to plot two or more sets of data (e.g. from two related factories, shops etc) on one chart. [SPSS refers to this as 'overlay']. This will entail two separate regression analyses but does
allow a comparison of correlations (rp), and significance of the 'lines of best fit'.

Below, we have compared the performance of 10 similarly sized service-based companies with 10 equally similarly sized manufacturing companies.

Task: Construct 2 pairs of hypotheses, one for service industries and one for manufacturing.

Use the dataset: SPex 52 regression comparisons

Sector	Formula for the Line	r Sq Linear	rp	Sig
Services	y =12.91 + 0.43x	0.781	.884	P<.01
Manufacturing	y = 178.47 + 0.026x	0.022	.149	NS

Note that the regression coefficient (b) [Manufacturing] is approaching zero and this can be verified by the fact that the blue line on the chart is approaching the horizontal. To check your equations, always choose a datapoint that falls near to the line. Your value for y will then be close to the actual value.

Q's Why would it be incorrect to undertake a regression analysis of say, both sets of share prices?

.......What can you conclude about the structure of the two sectors investigated?

.......If you invested in a service industry company that currently returned a profit of £0.75m, what share price might you expect to obtain?

.......What difference might you expect to see in your share value between a manufacturing company returning a profit of £250,000 and one returning a profit of £650,000

------------------------------------------------------------------------------------------------------------------------------------------------------

We can also briefly consider what happens when a second independent variable is added to a single dataset. This suggests that two independent (predictor) factors are 'in play' and both may both influence the values of the dependent factor. It will be important to be able to distinguish any proportional influence of these two independent variables. This type of analysis is referred to as multiple regression and will be more fully explored in Focus 12.

The formula for the line changes to: y = a + b1x1 + b2x2

What is Analysis of Variance?

Another approach to estimating the strength of a relationship is to look at the total amount of variation existing in 'y' and then to see how much of this can be explained by the variation in 'x'. If you look back on the three charts we have created (mileage vs expenses, production level vs average cost and mileage / tyre wear ) you may have noted that the 'spread' of the points about the regression line is quite different in each case and it is this property that we can use by assessing the location of each point in relation to the line and to the mean value for 'y'.

Go back and look at the 'Looking at residuals' chart above.

The process covers a very wide range of modern test procedures and is known as the Analysis of Variance [ANOVA]. This large and important subject begins in Focus 13 .