A comprehensive glossary of 230 technical terms
(with additional links where applicable)


Note that there are a number of 'expanded definitions' which will be of use, especially if you are using Microsoft Excel. These and many more can be found in Excel by going to 'Insert function'(fx), 'Statistical'; then select the function you are interested in from the dropdown menu and highlight it...next select 'help on this function' in the bottom left of the window (versions vary).

Some extended definitions have been selected for inclusion in this package: Excel definitions


A

B

CDEFGHI JKLMNOPQRST U VW X Y Z

A

Absolute difference The difference between two numbers but ignoring the positive / negative sign.

Accuracy (as opposed to Precision) The closeness that a numerical value has to the true value.

Additive model A model that explains differences by adding up the component parts e.g. A = B+C+D and so B = A - (C+D) etc

Age / Sex pyramid A graphical representation of the composition of a population of a country and summarised as a frequency distribution / histogram. The histogram is shown as a series of horizontal bars with males on the left and females on the right.

Aggregate The value of a single variable resulting from the combination of values from two or more variables. Most index numbers are derived from aggregated numbers. An overall exam mark is usually an aggregate of a number of exam papers.

Algorithm A procedure or set of mathematical rules for performing a specific calculation

Alternative hypothesis (or 'working hypothesis') The alternative case to the null hypothesis and if accepted, means that samples do differ or that an association or relationship has been shown to exist. Our starting point is always that the null hypothesis holds true, but if we cannot accept the null hypothesis then we must accept the alternative. Mathematically, the alternative hypothesis has to be accepted if the probability (calculated in a specific test) exceeds (usually) a critical value known as the test statistic critical value.

ANCOVA An acronym for ANalysis of COVAriance. This is a modification of ANOVA that can be utilised when there is a variable (such as IQ; when we might be measuring reaction times) or altitude (when we are measuring seed production in a flowering plant species) that might affect the dependent variable that we are measuring but is not a variable that we are wishing to assess directly. The routine is a combination of Regression analysis and ANOVA.

ANOVA An acronym for ANalysis Of Variance: a parametric procedure used to investigate the differences between three or more samples and to determine whether or not they come from a common, normally distributed population. We look at the levels of variation within each sample and also between the samples. If the levels of variance are small it is likely that the samples have all been drawn from the same population. If the between samples variance is greater than the within samples, then it is less likely that the samples have been drawn from the same population. Also know as the F-ratio tests.

Asymmetric Of unequal shape, unbalanced, the term could be applied to a histogram or a distribution for example.

Average A loose term that covers three measures of central tendency; the mode, the median and the mean *** Extended Definition>>>


-------------------------------------------------------------------------------------------------------------------------------------------------------

B


Bar chart A chart which displays (horizontally or vertically) a bar that has a length that is proportional to its value. It must not be confused with a histogram. The former is used to display independent groups or variables whereas a histogram is used to display the data derived from a continuous variable. The general convention is that the bars should not touch each other.

Base year In Index work, a base year value for the variable is replaced with the value of 100. Base- weighted calculations then use this anchor point for all future derivations within the set.

Bias A systematic error that does not diminish with repeated measures. Usually associated with human attitudes to the data being collected.

Binomial model A distribution model where there are two possible outcomes and all trials are independent of each other.

Bivariate data that involves two variables and is thus capable of being plotted on an xy graph.

Block design A type of experimental design or survey layout in which individual treatments can be allocated and is just one attempt to reduce sampling error.

Box and Whisker plots A special type of graphical display that quickly indicates the median, upper and lower quartiles and the range of a dataset.

Break-even point A financial situation where total costs equals total revenue so that the profit is zero.


-------------------------------------------------------------------------------------------------------------------------------------------------------

C

Categorical data Data that can only be classified by being placed in groups e.g. hair colour. Often referred to as Nominal data.

Causation The production of an effect by a cause. This can never be fully proven statistically but may be strongly suggested.

Census A count of the members of a given population and any specified attributes. Often the starting point for any summary / descriptive statistics project.

Central limit theorem In any given distribution where samples are taken, this theorem states that the overall mean of all of the means from the samples will gradually approach the true mean of the population as the number of samples taken increases.

Central tendency An all- embracing term to cover the mean, mode and median values that might be extracted from a dataset.

Chi-square test A very flexible Nonparametric test used to test for an association between two frequency distributions. There are two versions; one compares the test set (Observed) with a theoretical set (Expected) and the other version compares two test sets and attempts to establish the probability that both samples were drawn from the same population (or not).

Class Interval The set of limits by which data is classified. E.g. 4 - 6,6 -8, 8 -10. The exact boundaries of each class are called the class boundaries or limits.

Climograph A specific scattergraph plotting temperature on the X axis and rainfall on the Y axis.

Closed Question A question with a restricted number of possible responses. The opposite (an open question) allows the respondent to say what he/she likes.

Cluster analysis A statistical technique for classifying cases (usually individuals) using the attributes (and the similarities / differences) of that data and breaking it down into smaller and smaller relational groups. The relational connection between individuals is often displayed graphically using Dendrograms. The best example is the Linnean binomial classification of living things.

Cluster sampling one of a number of sampling techniques where firstly groups are randomly selected and then individuals are randomly selected from within.

Coefficient a constant within or derived through an equation, an index of measurement of a characteristic. E.g. Coefficient of Determination: Explained Variation ÷ Total Variation. (See also: Correlation Coefficient below)

Compound Interest Interest is earned on the money invested and subsequently, that interest also bears interest

Confidence interval The range within a distribution in which the mean lies with a stated probability (usually 95%) The upper and lower boundaries of the range are known as the Confidence Limits. ***Extended definition>>>>>

Contingency table Simply a table listing and summarising the data from 2 or more frequency distributions.

Continuous / variable data data on the Interval or Ratio scale. The data can be measured on a scale with an infinite number of points and the data is capable of infinite subdivision e.g. temperature, height, weight .

Control An integral part of any experiment whereby a nil treatment is included (as a measured variable) that will reflect all other possible influences upon the outcome other than the specific influence that the experiment is trying to investigate. In order that the experimental result only reflects the behaviour of the measured variable it is necessary to negate / eliminate all other possible influences.

Coordinates x and y values for the point on a graph where the two sets of values intersect.

Correlation the strength of the association or relationship between the set of values (Ordinal or Ratio) on two or more variables. If changes in one variable cause a similar change (i.e. In the same direction) in the other variable it is called a positive correlation . A perfect positive correlation has a correlation coefficient of +1. Conversely, if one gets larger as the other gets smaller, it is called negative correlation. A perfect negative correlation has a correlation coefficient of -1. If there is no clear relationship, it is called zero correlation. Such patterns of behaviour are best illustrated using scattergraphs. The actual strength of the association is given by the correlation coefficient (r) and can only have a range from -1 through zero to +1.

Critical value In a statistical test, this is the value that marks the boundary between the acceptance or non-acceptance of a Null hypothesis. The % Probability level is set before any analysis begins and by convention the first level to be considered is usually the 95% level. For a 2- tailed test (Normal Distribution) at the 5% sig level, the tails of 2.5% each represent values more extreme than ± 1.96 s.d's from the mean (z< -1.96 or z > +1.96). In many tests, the test statistic is compared with the given (tabular) critical value and depending whether the test statistic calculated is less than or exceeds the critical value determines whether or not the Null hypothesis can be accepted . (See also: Probability)

Cumulative frequency when working with 'frequency of occurrence' data, we invariably have to have a cumulative frequency column which is achieved by adding up each % occurrence sequentially across categories or class intervals until 100% is reached.

Cyclic Component A non-seasonal component in a time series.


-------------------------------------------------------------------------------------------------------------------------------------------------------

D


Data Information in the form of names, numerical scores, measurements or groups.

Deduction This is the logical conclusion that must follow from a set of premises without contradicting any of those premises.

Degrees of freedom The number of 'pieces of data' that are free to vary. The maximum number of observations or categories that can vary before the rest are determined. For example, if there are 60 observations and they have to be placed in one of three categories, once the first two have been fully assigned, there is no further choice open. The remaining observations have to go into that third category. Thus in this example there would be only 2 d.f.

Dendrogram A special form of 'tree' chart that indicates closeness of relationship between individuals or objects. The best example is the conventional 'family tree'. Used in Cluster analysis.

Denominator The divisor in a division calculation; the figure below the line.

Dependent variable A variable whose values are reliant upon the values of another variable i.e. the Dependent variable. When plotted, should be displayed on the Y axis.

Descriptive Statistics Independent statistics which describe data in quantitative terms, particularly with respect to their magnitude and range or spread. A necessary preliminary stage before inferential statistical procedures are used.

Direct correlation A positive (or negative) correlation...when one variable increases / decreases, the other variable does the same.

Discount factor A method of determining the current 'time value' of money.

Discrete data / variable A variable whose values must be separated from each other by definite gaps because of the nature of the data e.g. number of people.

Discriminant analysis A data reduction technique that allows us to 'blend' many values from a selection of variables in such a way as to create a new variable; the discriminant factor. These new factors will allow us to predict which group an individual is most likely to belong to.

Dispersion The 'spread' of values in a set of data. If the data is Parametric, it is normal to to express this as the Standard Deviation or as the variance. If the data is Nonparametric then the inter-quartile range is usually quoted.

Distribution The manner in which the values for a variable actually occur. May be Normal, Poisson, Binomial, Continuous or Empirical. There are also many sub-derivatives of these.


----------------------------------------------------------------------------------------------------------------------------------------------------

E


Eigen values (see Scree Plots)

Empirical Values derived from observation rather than from theory.

Error A specific term that is intended to convey the idea that perfect accuracy cannot be expected in statistical data. Error can arise through inaccurate measurements, miscalculations. Errors in data collection(often called 'sampling error'). In hypothesis testing we find Type 1 errors where a null hypothesis not accepted when in fact it is true and the Type 2 error is the reverse i.e. we accept the null hypothesis when we should not have done so.

Exponential Smoothing A technique used in Time Series Analysis to forecast future values. The technique reduces irregularities in order to make long term trends easier to see.

Extrapolation An often misused term...It means a method of extending the known values of a variable beyond the actual limit of observation by extending the trend already indicated. Often used to predict and quantify future values such as population levels or sea level rise.


------------------------------------------------------------------------------------------------------------------------------------------------------

F


F-test See Analysis of Variance

Factor Analysis Multivariate Analysis methods used to 'reduce' or 'condense' the number of variables and to look for 'structure' between those variables.

Fixed cost A cost incurred whether or not any production takes place e.g. the rent on a factory building.

Fixed variable A variable that is set prior to an experiment or survey and is part of the planned design e.g. time intervals or incubation temperature etc.

Fractile diagrams Rarely used today. A paper method of plotting Cumulative %'s against observational values using specially drawn probability paper. A perfectly normal distribution will produce a perfectly straight line. Useful to compare a dataset with a standard normal curve to see how far they depart from perfect normality.

Frequency The number of times a value occurs in a dataset

Frequency distributions The number of times each and every value (or category) appears in a data set can be recorded and displayed in both graphic (bar charts and histograms mainly) and tabular formats. ***Extended definition>>>>>


------------------------------------------------------------------------------------------------------------------------------------------------------

G


G-test A slightly more powerful alternative to the traditional chi-square test.

Goodness of fit test A test to assess the difference between a measured set of frequencies and a theoretical frequency distribution.

Gradient (of the line) The slope of a line created when values for x and y are inserted

Graph theory A mathematical technique to quantify and illustrate the spatial arrangements between a set of intersecting and intervening points. Used in geographical / topographical studies. A simple column / row / total matrix can be constructed.

Grouped data Information that has been collected into groups or classes for ease of understanding, display of for simplification.


------------------------------------------------------------------------------------------------------------------------------------------------------

H


H-test see Kruskal-Wallis H-test.

Histogram A very specific type of chart. A measured and continuous variable is plotted along the X axis and the frequency of occurrence is plotted on the Y. There must be no gaps between the bars on the X axis. The areas of the blocks should be proportionate to the frequencies. See also 'Class Interval'

Homogeneity Equality (usually between samples) and usually refers to variances.

Hypothesis A proposition (as yet unproven) which is tentatively accepted in order to test its accord with the known facts The hypothesis may also specify the strength and / or direction of a relationship between two variables. It is an integral part of hypothesis testing that a Null hypothesis is constructed first. This proposes 'no association' or no significant difference and then the alternative hypothesis proposes the alternative proposition. All inferential statistics tests set out to establish the validity of the null hypothesis (Ho)


--------------------------------------------------------------------------------------------------------------------------------------------------------

I


Independent variable A variable that can influence the value of another variable without being altered itself. For example wind speed may alter wave height but wave height is unlikely to alter wind speed. Sometimes referred to as 'error free' meaning that it is not possible for human intervention to alter the value as far as the experiment is concerned. When plotted, these variables should be displayed on the X axis.

Index Numbers A statistic giving the value of a quantity (e.g. share values) relative to a fixed level at a fixed point in time or place. Invariably, the fixed point is given the value of 100.

Induction A process of reasoning by which a general conclusion might be drawn form a set of premises drawn from experimental or experiential evidence.

Inference This is a process of reasoning that starts with an idea / premise and moves towards a conclusion.

Inferential Statistics That branch of statistics that uses observational data as a basis for further calculating estimates, exploring relationships between variables and for making predictions based on hypotheses.

Interaction A term used when using two-factor ANOVA's to indicate that the two factors interact in such a way that they influence the dependent variable according to the given levels of each factor. Interaction may be the main point of interest in a two-way ANOVA analysis. When the two factors are plotted separately on a profile chart (plotted against the dependent variable) the two lines will not be parallel (and may even cross over) if there is interaction present.

Intercept The exact point on a graph where an x value meets a y value. In regression analysis it is taken as the value of y at the point where the line cuts the y axis. ***Extended definition>>>>>

Interpolation The calculating of a quantity by using the adjacent values. That is the insertion of an estimated value between two known values. Not to be confused with extrapolation.

Inter-quartile range The mid portion of a distribution that covers the values from the 25%( lower) quartile through the 50% (median) quartile and on to the 75% (upper) quartile. Thus 50% of all values present will lie within this region.

Interval data The third level of data. Such data has to have a precise numerical value along a continuous scale but there is no natural zero on that scale. The example often quoted is degrees Celsius because 0 degrees C is an arbitrary point and not a true zero. Nevertheless the scale is uniform in that we can say that 40 degrees C is twice as hot as 20 degrees C.

Isometric line A line drawn on a graph that connects points of equal value.


-------------------------------------------------------------------------------------------------------------------------------------------------------

J

-------------------------------------------------------------------------------------------------------------------------------------------------------

K

Kendall's tau A rank correlation test (used on the Ordinal scale) that produces an index describing the direction and degree of association between the two ordinal variables. It is a rank correlation test that measures the disorder of the ranks of one variable when the other is placed in a natural (or numerical) sequence.

Kolmogorov-Smirnov test A Nonparametric test that can be deployed in different ways but generally classified as a 'goodness-of fit' test designed to establish whether data is consistent with a continuous distribution. The data must be ranked and is typically on the Ordinal scale.

Kruskal-Wallis H-test A Nonparametric test for determining whether or not there is a significant difference between 3 or more samples i.e. have the samples been taken from populations with identical distributions? Data must be on the Ordinal scale.

Kurtosis A descriptive term used to describe the 'peakedness' or flatness at the top of a frequency distribution curve.


-------------------------------------------------------------------------------------------------------------------------------------------------------

L


Laspeyre's Index A base-weighted series of index numbers

Latin square An experimental design layout. Such a design is stratified and random but each treatment will appear once in every row and once in every column. The minimum acceptable trial size would be 4 x 4 but 5 x 5 or 6 x 6 would be preferable.

Law of Large Numbers As the size of samples increases their means will tend towards the mean of the parent population.

LD50 In experimental test procedures (where treatments are imposed upon an organism), the LD50 is the median value for survival. Thus it is the treatment dosage that will destroy 50% of the organisms in a controlled experiment.

'Line of Best Fit' A mathematically calculated line on a regression plot which allows us to predict the value of the dependent variable (on the y axis) given a specific value of the independent value plotted on the X axis. ***Extended definition>>>>>

Linear correlation A relationship between two variables that is best expressed as a straight line.

Logistic curve A curve with 'time' on the X axis that appears firstly to have a steady state, then a period of increase and finally a period of steady state again. Such curves are often employed in population studies.

Log transformation On occasions a data set may not follow a normal distribution pattern or there may be some apparently atypical values present. However, by using the log of each individual value we may find that the new values do conform better and hence conventional parametric tests may then be used. Two other common transformations that can be carried out are square root transformations and angular (arcsine) transformations.


-------------------------------------------------------------------------------------------------------------------------------------------------------

M


Mann-Whitney U test A Nonparametric test that investigates differences in central tendency and requires the data to on the Ordinal scale and to be in unmatched (independent) pairs. In essence, it compares the median of two independent samples to see if the difference in the two medians is large enough to reject Ho i.e. that both come from the same populations.

Matched data Data points that are linked as pairs of values. The date will have come from the same sampling unit and so each value is related to the other value within the pair.

Mean ("x bar") The arithmetic average is the sum of all the data values divided by the number of
values being summed. The data must be on the Interval or Ratio scale.

Mean deviation A preliminary indication of dispersion. The average deviation of a set of values away from the mean. So it is calculated as the sum of all the individual deviation values and dividing by the number of values. Not to be confused with 'standard deviation'.

Mean square deviation (see Variance)

Median A measure of central tendency and simply being " the middle value" in a data set. Note however that the data must first be arranged in order of magnitude. Used primarily where ordinal data is involved. ***Extended definition>>>>>

Mode The most frequently occurring value in a data set. It is the simplest measure of central tendency and used primarily where nominal data is involved.***Extended definition>>>>>

Modal group The range of values that contain the Mode

Model A mathematical or logical representation of a relationship designed to help elucidate a theory or hypothesis.

Monte Carlo The method trying to find the probability distribution of the possible outcomes of a process by carrying out simulations.

Monothetic The classification of objects based upon a single characteristic as opposed to 'Polythetic'....based upon a number of characteristics.

Moving Averages In Time Series work, a technique which allows a secondary graph to be drawn which takes the mean of a series of points but redefines that series as each new point is drawn in.

Midpoint The central point in a class interval.

Missing values A term used to indicate that there are gaps in the dataset Depending upon the nature of the data, this fact may be ignored, estimates inserted or calculated from the surrounding data (Interpolation). SPSS, will not allow cells to be left empty and so either system-missing or user-missing values have to be incorporated.

Multiple regression When two or more x variables are being used to predict values for y

Multiplacative model A model that explains differences by multiplying the elements e.g. A = BxCxD

Multivariate analysis The analysis of the relationships between more than two variables or between case measurements involving more than two variables.


-------------------------------------------------------------------------------------------------------------------------------------------------------

N


Nearest Neighbour Analysis
A spatial analysis test that compares the point pattern distribution with a theoretical distribution of points. The method involves measuring the straight line distance from each point to its nearest neighbour. The mean distance can then be calculated and compared with the mean distance of a completely random set of points. Thus an index that ranges from 0(completely clustered) to 1 (completely random) to 2 (uniform grid) can be produced.

Net present value (NPV) Method of assessing whether a financial investment (over the life of the investment) is worth doing or not. Competing projects can be assessed in this way also.

Nominal data Also known as categorical data. The lowest scale of data and exists only by name. Such data can be placed in categories (e.g. sex, hair colour, leaf colour etc.) but such data has no magnitude and no directional differences.

Non-parametric tests A suite of tests that do not require the data to be normally distributed. Such data must be at least on the ordinal scale (with the exception of tests of association or goodness of fit such as Chi square). They can be used on higher order data if there is some doubt over the normality of the data but these tests are inherently less powerful than their parametric cousins.

Non-sampling error Differences in results that are not explained by the sampling process.

Normal distribution A hypothetical frequency distribution which possesses perfect symmetry about
the 3 measures of central tendency...mean, median and mode and all 3 will all be coincident. Often called the bell-shaped curve. There is an infrequent occurrence of extreme values and more frequently occurring values tend to surround the mean. The mathematical values of this curve allow many predictions to be made e.g. 68.27% of all data points will lie within one standard deviation of the mean. ***Extended definition>>>>>

Normality An assumption that the data under test approximates to a normal distribution.

Null hypothesis The rationale used prior to any inferential test and will always say that the the data sets are not related or that the variables have no association or are not related. If the the probability (P-) value calculated (as a part of the analysis) is less than the critical value then the null hypothesis has to be rejected in favour of the alternative hypothesis. Never use the phrase "accepting the Null Hypothesis" because you cannot "accept" something that is nothing!.

Numerator The number to be divided in a fraction i.e. above the line.


-------------------------------------------------------------------------------------------------------------------------------------------------------

O


Observational data Data that has been collected from an experiment or survey and has not been manipulated in any way i.e. raw data. The collection and sorting of such data precedes any inferential work.

Odds The ratio of the probability of an event occurring to that of its not occurring.

Ogive A cumulative frequency distribution curve.

One-tailed test A test that only considers one tail or end of a probability distribution thus only a directional hypothesis is being tested. In a two-tailed test, both ends are considered and the alternative hypothesis has to worded accordingly. For a one-tailed test the alternative hypothesis might say " A is greater than B" whereas for a two-tailed test it would say "A is significantly different from B".

One-way Analysis of Variance A test to investigate the differences between two or more samples. The datasets have to be unmatched and parametric in nature.

Open questions The respondent (usually in a survey) has an opportunity to answer in any way that they wish as opposed to a closed question where the choices for an answer have been limited by the designer of the question.

Ordinal data One step up (in informational content) from nominal data. Such data can be ranked in some form of order hence they have direction but still no true mathematical magnitude e.g. cold, warm, hot.

Orthogonal In the context of Statistics; the term refers to the practice of drawing 3 lines at right angles to each other in order that a 3-dimensional graph can be displayed in a 2-dimensional plane.

Outlier A value that appears to be extreme in comparison with the remainder of the dataset. Outliers can heavily affect the results of analysis, the mean is particularly sensitive to outlier values. In some cases such values will be eliminated from the set because of the disproportionate effect they have on the outcome.


-------------------------------------------------------------------------------------------------------------------------------------------------------

P


Paasche's Index the weighting of the values within an Index number series are taken from within the current year.

Paired Samples 2 samples in which the same attribute of each member of the sample is measured twice but under different circumstances; often that means over time.

Paired t-test (see t-test)

Parameter A general term for any summary measurement (such as the true mean or standard deviation) that characterises a given population.

Parametric tests A suite of tests that require the data to have all the attributes of a normal distribution. Such data must be at least on the Interval scale. Such tests are more powerful than their
Nonparametric cousins. It is good practice to test a dataset for Normality before using any parametric test.

Payback time The time taken to recoup the original investment

Pearson's Product Moment Correlation Coefficient (PPMCC) A parametric test of correlation to investigate the relationship between two normally distributed variables. The data must be at least on the Interval scale. It measures to what degree, changes in magnitude / direction in one variable relates to changes in the other.

Percentage change The comparison of one value with another when one is set at 100

Percentile The value arrived at on a frequency distribution chart that represents the stated percentage of all values recorded between 0% and 100%. It is common practice to use the 25th, 50th (median value) and the 75th percentiles for most situations. ***Extended definition>>>>>

Pictograms A distribution map (often to scale) that uses pictures instead of bars or lines to illustrate a graph. The pictures might be barrels, people, aircraft etc. or any other suitable image.

Pie chart A useful pictorial display for nominal data. Each category becomes an individual 'slice of the pie'.

Point Pattern analysis is a data reduction technique that also involves the application of the Chi square test to a 2 dimensional distribution of objects with a view to testing a constructed null hypothesis that the distribution is random.

Polythetic A classification of objects based upon a number of different characteristics.

Population In statistics this word has a specific meaning. Every possible member of a group that possess the same defined characteristics and from which samples (that have the same defined characteristics) are to be taken.

Power A specific term used to indicate the ability of a given test to find a difference or an association between variables.

Precision (as compared with 'Accuracy') Often interpreted as the number of decimal points to which a value is taken but in statistics the term should more accurately mean the closeness of the results when a variable (in an experiment) is measured repeatedly.

Primary data Data collected at source with clear honesty and without any subsequent modification.

Principal Components Analysis A statistical procedure for analysing a matrix of correlation coefficients involving three or more variables. Coefficients are correlated with their original root values but uncorrelated with each other. PCA is frequently used as a dimension reduction technique.

Probability The likelihood of an event occurring. A foundation to most statistical analysis. The theory explores the likelihood of an event taking place under specific conditions. Given a large number of repeated observations, the frequency of a particular and random outcome will stabilize to a final value. Probabilities are usually expressed as a % (or fractions of 1) (where 0% (or 0) means that the likelihood of that outcome is an absolute impossibility. Similarly, 100% (or 1) indicates an absolute certainty. In tests, a critical value is compared to a test statistic and the probability (P-value) is used to quantify the likelihood of the Null hypothesis being true.


-------------------------------------------------------------------------------------------------------------------------------------------------------

Q


Q-Q Plot A plot of Quantiles of the empirical distribution of a data set plotted against the theoretical distribution being proposed as the model.

Quadrat The basic cell or unit used in field trials and delineated from all other quadrats within the trial area.

Qualitative data Data that can only be described in terms of its non-numerical characteristics e.g. colour, shape, name, emotions etc.

Quartiles The position within an ordinal data set that represents 25% of the samples (lower quartile: Q1) or 50% of the sample (median: Q2) or 75% of the sample (the upper quartile: Q3). Such measures are best represented on a frequency distribution graph or Ogive. ***Extended definition>>>>>

Quartile Deviation The same as 'inter-quartile range' i.e. (Q3 - Q1) ÷ 2

Quantitative data Data that has the ability to be measured by magnitude and thus has a mathematical component added to the description.

Quota sampling A sampling method that once a predetermined number of examples has been reached then no more individuals are selected for that category.


-------------------------------------------------------------------------------------------------------------------------------------------------------

R


Random Numbers Numbers generated by a random process in which each and every number is independent of every other.

Random sampling A sampling method where every member or the population under study has an equal chance of being selected.

Randomness A hypothetical condition in which there is an equal chance of any one of a number of outcomes occurring within the realistic confines of the experiment / survey being undertaken.

Randomised Block Design A field trial design where each row contains one example of each treatment and the positions within the trial block are randomly allocated.

Range The lowest to the highest values in a dataset. In terms of a frequency distribution it is the simplest measure of dispersion but is highly sensitive to extreme values.

Ranks The data in question must be at least on the Ordinal scale. Each value in the original dataset is replaced with a ranked score indicating its position overall. The set is always ranked from lowest values to the highest. Ties must be accounted for. Ranking is a common requirement in Nonparametric testing.***Extended definition>>>>>

Ratio scale This is the highest level on the scales of measurement and is characterised by the possession of an absolute and non-arbitrary zero point. A person's age would be on the Ratio scale because there is a finite zero (i.e. date of birth).

Regression A general parametric technique that determines a precise mathematical function for the relationship between two variables. An assumption of linearity has to be invoked first of all. Then the relationship can be based on the formula for a straight line: y = a + bx (where a is the intercept term). The regression equation is solved using the 'least-squares' method. Both variables may be independent of each other or one may be a dependent variable (always placed on the Y axis). The regression formula yields the " regression of 'y' on 'x'". The data has to be at least on the Interval scale. So regression determines the mathematical nature of the relationship whilst Correlation determines the strength and direction of any given relationship. Multiple regression examines the relationship between several independent variables and a single dependent variable. The formula simply expands to: y = a + b1x1 + b2x2.....

Rebasing Moving the base year in an Index series to align with a second series.

Regression plot A scattergraph with the regression line superimposed upon it.

Retail Margin The percentage profit that a retailer gains over the costs of providing those goods or services.

Retail Price Index (RPI) The changes in the 'basket of prices' of the average shopper; used to make monthly comparisons and to monitor changes in price inflation / deflation.

Return on Capital (ROC) This is the percentage return that a firm is able to generate on the capital employed by the firm. This is sometimes referred to as Return on Investment (ROI).

Relative Frequency The ratio of the frequency of an outcome to the total number of time the routine was performed.

Replication The process whereby an investigative routine is repeated with no changes to the methodology being permitted (as far as is practicable).

Residuals The difference between the observed value and the predicted value on a regression (line of best fit) line (see above). A value for residual Y may be measured and the value may be positive (above / right of the line) or negative (below / left of the line).

'Robust' An adjective used to characterise a statistical test in terms of its sensitivity to failings such as bias, outliers, poor quality data or inaccurate measurements.

Running Medians One of the many techniques used for 'smoothing' a Time Series.


-------------------------------------------------------------------------------------------------------------------------------------------------------

S


Sample A set of data (assumed to be random) that is taken from a population and then used to estimate the true parameters of that larger population. A sample is thus a subset of the larger population. It is important to realise that a sample cannot be expected to have all the exact same characteristics of the parent population but as the sample size is increased, so the descriptive statistics will yield a closer and closer approximation to the population's true parameters.

Sampling error This is the error attributable (to a statistic) to the fact that a sample has been used for the measurement rather than the population as a whole. As above, as the sample size increases so the sampling error should reduce. The term is often confused with sampling bias which are inadequacies or weaknesses in the actual chosen sampling method leading to inaccurate predictions.

Sampling methods A crucial first stage in any statistical analysis is to make sure that the data to be collected will accurately reflect the aims and objectives of the project. Much concerned with ensuring that the samples taken during a trial, survey or experiment are able to yield a truly representative picture of what is happening in the parent population. There are well laid down rules that govern how samples should be selected. Some of the more common methods often used include 'random' , 'systematic', and 'stratified random'.

Scatterplot (or scattergram) An invaluable first step in probing the relationship between two variables. It is a straightforward xy graph. The spatial arrangement of the points placed at the intersects will immediately suggest whether there is a relationship between the variables or not and (with further analysis) how strong that relationship is.

Scree Plot (and Eigen values) In Principal Component Analysis, an Eigen value is a non-unit based measure or index of the total variance that is accounted for by a given variable. The maximum sum of all the Eigen values will be the same as the number of components present. Eigen values may also be displayed graphically using the Scree plot, so called because of the characteristic shape usually displayed. The point at which the slope changes dramatically (often called the 'knee' ) will indicate where the 'cut off' point is between components that will contribute to the solution and those that will not.

Seasonal Component / Effect One of the components of variation in a Time Series that is dependent upon the time of year

Secondary data Data collected from others and with no sure knowledge of its truth

(The) Scientific method The classical steps that have to be taken by which scientific knowledge is acquired. Firstly there needs to be the identification of the objective. Secondly, the formulation of an hypothesis, then the collection of data, the rationalising of that data prior to analysis. Testing against the original hypothesis, the cautious interpretation of the results and the formulation of a conclusion that still refers to the original hypothesis. Where sufficient and repeated works disagree with current hypotheses, there may be a possibility to formulate new natural laws.

Set A collection of items that have at least one characteristic in common. (See also: Venn diagrams)

Significance level The level (usually expressed as a %) at which we switch from accepting the Null hypothesis to accepting the Alternative hypothesis.

Simple random sampling A sampling method where all members / items within a population have an equal chance of selection.

Skewness A measure of the degree to which a distribution deviates from a standard normal distribution. In a standard normal distribution, the mean, mode and median would all be at the same point. With a positive skew, the right-hand tail is elongated and the mode and median will move to the left of the mean. With a negative skew, the left-hand tail is elongated etc.***Extended definition>>>>>

Slope The term applied to the inclination of the 'line of best fit' in Regression analysis. ***Extended definition>>>>>

Spatial Statistics / Analysis A large area of statistical testing that investigates the spatial patterns and / or 2D or 3D locations of places or objects. It is an interdisciplinary mix of mathematics, geometry and geography. A simple example is 'nearest neighbour analysis'.

Spearman's Rank Correlation Coefficient A non-parametric test where the data is either non-normal or both variables are on the Ordinal (ranked) scale. The test looks for a relationship (correlation) between the two variables.

Standard deviation An absolute measure of the spread of values (about the mean) in a frequency distribution. The data must be on the Interval or Ratio scale. It is the square root of the variance. It is the most commonly used measure of dispersion in which all values are taken into account. A low value suggests grouping whilst a high value indicates a wider spread within the data. The units are the same as used for the measured variable. This is not the case with the Variance which is a squared value and cannot therefore have units. ***Extended definition>>>>>

Standard error ( of the mean) This is a special form of Standard Deviation. It is used to gauge the reliability of an estimate based on a sample result. It is a measure of the average dispersion of all the sample means about the population mean. If there is enough of them, the pattern of all these sample means will form a Normal distribution of their own which, in turn can have its standard deviation calculated...it is this specific S.D. that is the known as the standard error.

Standard score (see z-scores)

Standard normal distribution A normal distribution where the mean is zero and the s.d. is 1. Forms the basis for all normal distribution tables of areas under the graph.

Standardised distribution A derived distribution when all the original values have been converted into z-scores.

Statistic A measurement derived from a sample which estimates a population parameter.

Statistics (Statistical Analysis) The branch of mathematics that deals with the issue of drawing conclusions from numerical information. There are strict rules concerning the methods employed to collect, analyse and interpret this information. These methods also rely heavily upon probability theory.

Stem and Leaf Plots A chart used to illustrate data. Tens are used as the 'stems' and units are accumulated behind their respective tens as 'leaves' emanating from the ten. Best seen as an illustration!

Stratified sampling A sampling method that pre-selects a part of the population based on prior information.

Student's t-test (see 't'-tests)

Stochastic Process A random process, usually a variable measured at a set of intervals over time or space which can exist in one of a number of states. The length of the que at a bus stop would exemplify a stochastic process.

Sum of squares One of the preliminary indicators of variation. Each point deviation from the mean is recorded, squared and then summed.

Systematic sampling A sampling method that where a level of regularity is superimposed on the data collecting process. E.g. every 10th person or every 5th tree. Within the test group, the choice for sampling will be evenly spread across the whole group.


-------------------------------------------------------------------------------------------------------------------------------------------------------

T


t-tests A parametric test (applied in different ways) that measures the significance of the difference between two samples. The tests fit into one of three forms. One can accommodate unmatched (independent) pairs of data (the Student's two-sample t-test) whilst the second copes with matched pairs of data. Finally, the one-sample test compares the sample mean with the hypothesised true mean. These tests are particularly useful because they remain accurate even with relatively small sample sizes (say under 30). ***Extended definition>>>>>

Tabulation Values Values set out in statistics tables that list critical values for comparison with calculated values, correlation coefficients, % points on a distribution, Random numbers etc.

Tail The tapering end of a distribution curve.

Test statistic The calculated (obtained) value obtained at the end of a test procedure and it is this figure that must be compared with the critical value (or its probability examined) in order to establish the statistical significance (or not) of the result. It is the values of these two figures that allows us to accept the Null hypothesis or accept the Alternative hypothesis.

Time-series A data collection regime in which observations (on one variable) are made regularly over a period of time. This is an area increasingly used in higher analysis, particularly when looking for trends. Within certain limits of acceptability, it is possible to extrapolate from a trend line to predict the values of a variable in the future.

Total cost All costs (fixed and variable) incurred in a given process.

Total revenue Unit price multiplied by the number of items

Transformation A mathematical device for converting data into a form that is more readily testable. E.g. to conform to the requirements of a normal distribution and the range of tests that that will permit. In the environmental sciences the most popular tests used are two-way and three-way ANOVA and regression. There are no straightforward non-parametric equivalents and so transforming the data may be the best route to take to make the data usable for such analysis.

Treatment In experimentation it is often necessary to measure the effects that differing inputs (e.g. dose rates) might have on the output. So it is the deliberate manipulation of the measured variable.

Trend component A long term movement in a time series.

Time series components A more specific term that is subdivided into four types; trend (see entry), cyclical (e.g. monthly, where fluctuations in the general trend recur at regular intervals) seasonal; where the fluctuations depend upon the time of year (e.g. the sale of easter eggs or fireworks) and finally, irregular,which is the trend component left over when all other components have been accounted for.

Trend surface analysis A special type of multiple regression analysis that allows a 3D surface to be 'condensed' into a two-dimensional graph.

Two-tailed test A test where the Alternative hypothesis does not indicate which direction any measured differences take.

Type 1 and 2 errors (see error)


-------------------------------------------------------------------------------------------------------------------------------------------------------

U


Unmatched data
A situation where two data sets are to be compared but are independent of each other because they have been sampled from different individuals groups or items. This means that any values derived from one sample has no effect on the values of the other e.g. collecting data for levels of organic matter in two unconnected rivers.

Unit value A specific value given to a variable and fixed for the duration of the test e.g. if we wanted to measure the calorific value of a new range of animal feeds, we could give all the feeds a unit value of say 1kg so that any comparisons were then on an even basis.


-------------------------------------------------------------------------------------------------------------------------------------------------------

V


Variable A measurable property, a given characteristic that enables one member of a statistical population to be distinguished from another. A variable is then categorised as either Qualitative (then Nominal or Ordinal) or Quantitative (and then Interval or Ratio). Quantitative variables can be further subdivided into discreet or continuous variables.

Variable cost A cost that changes with the level of production and the cost of raw materials.

Variability The amount of 'spread' or dispersion in a dataset.

Variance A measure of variation within a sample. It is the sum of the squares of all the value deviations from the mean divided either by n for the population variance or more commonly by n - 1 for the sample variance. The square root of the variance is the standard deviation. Variance values cannot have units.

Variance-ratio tests Another name for analysis of variance (ANOVA)

Venn Diagrams A diagram that illustrates the relationship between sets. They are sometimes referred to as Euler diagrams. The union of two sets (denoted by the 'u') is shown as two overlapping circles and is called the 'intersection of two sets'. This would indicate that the two separate sets now shared a characteristic in common.


-------------------------------------------------------------------------------------------------------------------------------------------------------

w


Weighted Mean A mean that takes into account the relative importance of the constituent elements. Useful in customer survey work.

Wilcoxon matched-pairs test A non-parametric test for examining paired samples where the data is on the Ordinal scale. The data must be in matched (i.e. related) pairs. The difference within each pair is ranked. The rationale is similar to that used for the Mann-Whitney test where the datasets are unmatched.


-------------------------------------------------------------------------------------------------------------------------------------------------------

X

-------------------------------------------------------------------------------------------------------------------------------------------------------

Y


Yate's Correction A special modification to the Chi-square test when a 2 x 2 contingency table is being used. There is less risk of making a Type 1 error when this correction is applied.

Y prime A predicted value on the Y axis given a value on the X axis.


-------------------------------------------------------------------------------------------------------------------------------------------------------

Z


Z-score A useful method of standardising values in a dataset providing they appear on the Interval or Ratio scales. The z-score is the individual deviation divided by the standard deviation. The advantage now is that values from different data sets may be directly compared with each other by way of their individual z-scores.

z-test A test to examine the differences between the means of two normally distributed samples where the data is unmatched and the sample size is large. ***Extended definition>>>>>


 

'Quick View' contents

Go to Index page

 

 

 


'Quick View'