Statistics at square one free download




















Download Beginning Microsoft Excel Ebook. Download Beyond Good and Evil Ebook. Download Breaking Point Book. Download Breakout Ebook Pdf. Download China: Friend or Foe Ebook. Download Desperado Ebook. Download Examples of Chinese Ornament Ebook. Quintaglio Ascension, Bk. Download Manufacturing Ideology Ebook. Download Midnight Robber Ebook. Download Oracle Text: Application developer's guide, release 9. Download Practical Magic Book. Computer Science and General Issues v.

Some reliable statistical analysis resources are now available free of charge on the web but the process of organising such resources is still at an early stage. My only criticism relates to a table indicating what statistical test should be used with particular kinds of data, which confusingly fails to distinguish binary variables from nominal ones. Having said that, readers will find a great deal of valuable material here. The reviewers have been asked to rate these books in terms of four items: readability, how up to date they are, accuracy and reliability, and value for money, using simple four point scales.

You will be able to get a quick price and instant permission to reuse the content in many different ways. Skip to main content. Log in via OpenAthens.

Log in using your username and password For personal accounts OR managers of institutional accounts. Forgot your log in details? We then order the leaves, as in Figure 1. The advantage of first setting the figures out in order of size and not simply feeding them straight from notes into a calculator e.

Is there a steady progression, a noteworthy hump, a considerable gap? Simple inspection can disclose irregularities. Furthermore, a glance at the figures gives information on their range.

The smallest value is 0. Note that the range can mean two numbers smallest, largest or a single number largest minus smallest. We will usually use the former when displaying data, but when talking about summary measures see Chapter 2 we will think of the range as a single number.

Median To find the median or midpoint we need to identify the point which has the property that half the data are greater than it, and half the data are less than it. For 15 points, the midpoint is clearly the eighth largest, so that seven points are less than the median and seven points are greater than it. This is easily obtained from Figure 1.

To find the median for an even number of points, the procedure is illustrated by an example. Suppose the pediatric registrar obtained a further set of 16 urinary lead concentrations from children living in the countryside in the same county as the hospital Table 1.

To obtain the median we average the eighth and ninth points 1. For example, if we had accidentally written 34 rather than 3. An interesting property of the median is shown by first subtracting the median from each observation, and changing the negative signs to positive ones taking the absolute difference.

For the data in Table 1. The sum of these is It can be shown that no other data point will give a smaller sum. Measures of variation It is informative to have some measure of the variation of observations about the median. A simple measure is the range, which is the difference between the maximum and minimum values although in Statistics, it is usually given as two numbers: the minimum and the maximum.

The range is very susceptible to what are known as outliers, points well outside the main body of the data. For example, if we had made the mistake of writing 32 instead 3. These are known as quartiles, and the median is the second quartile. The variation of the data can be summarized in the interquartile range, the distance between the first and third quartile, often abbreviated to IQR.

With small data sets, it may not be possible to divide the data set into exact quarters, and there are a variety of proposed methods to estimate the quartiles. For 15 observations, these are the 4th, 8th, and 12th points and from Figure 1. For 16 points, the quartiles correspond to the 4. To estimate, say the lower quartile, we find the 4th and 5th points, and then find a value which is one quarter the distance from the 4th to the 5th.

Thus the 4th and 5th points are 0. For the upper quartile we want a point which is three quarters the distance from the 12th to the 13th points, 2. The median is the second quartile and is calculated as before. Thus the three quartiles are 0. Thus, from Figure 1.

Thus the midpoint lies between 0. This is the first quartile. Similarly the third quartile is midway between 1. Thus, by this method, the IQR is 0.

These values are given by OpenStat. For large data sets, the two methods will agree, but as one can see, for small data sets they may differ. Data display The simplest way to show data is a dot plot. Figure 1. Take care if you use a scatterplot option in a computer program to plot these data: you may find the points with the same value are plotted on top of each other.

Sometimes the points in separate plots may be linked in some way; for example, the data in Tables 1. If possible the links should be maintained in the display, for example by joining matching individuals in Figure 1. This can lead to a more sensitive way of examining the data. When the data sets are large, plotting individual points can be cumbersome.

An alternative is a box—whisker plot. The box is marked by the first and third quartile, and the whiskers extend to the range. The median is also marked in the box, as shown in Figure 1. Data display and summary 3. This way, outlying points are shown separately. Histograms Suppose the pediatric registrar referred to earlier extends the urban study to the entire estate in which the children live.

He obtains figures for the urinary lead concentration in children aged over 1 year and under We can display these data as a grouped frequency table Table 1. These can also be displayed as a histogram as in Figure 1. Note one should always give the sample size on the histogram. Bar charts Suppose, of the children, 20 lived in owner occupied houses, 70 lived in council houses, and 50 lived in private rented accommodation. Type of accommodation is a categorical variable, which can be displayed in a bar chart.

We then display the data as a bar chart. The sample size should always be given Figure 1. Common questions What is the distinction between a histogram and a bar chart? Alas, with modern graphics programs, the distinction is often lost. A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars.

A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars. How many groups should I have for a histogram? In general one should choose enough groups to show the shape of a distribution, but not too many to lose the shape in the noise. It is partly aesthetic judgement but, in general, between 5 and 15, depending on the sample size, gives a reasonable picture. With equal intervals, the height of the bars and the area of the bars are both proportional to the number of subjects in the group.

With unequal intervals, this link is lost, and interpretation of the figure can be difficult. Within the constraints of legibility, show as much information as possible. There are only three quartiles and four quarters.

Another common technique is to treat the histograms as if they were bar charts, and plot the bars for each group adjacent to each other. Thus in Table 1. Exercises 1. See how the statistics found in a are affected.

Change 0. What sort of data are the following: Time in minutes waiting in ER, triage outcome no injury, minor injury, major injury , 12 Chapter 1 number of cases of road accident victims in the ER, type of accident in the ER fall, road accident, assault?

Reference 1. How to Display Data. Oxford: Wiley-Blackwell, CHAPTER 2 Summary statistics for quantitative data Summary statistics summarize the essential information in a data set into a few numbers, which, for example, can be communicated verbally.

The median and the interquartile range discussed in Chapter 1 are examples of summary statistics. Here we discuss summary statistics for quantitative data. Mean and standard deviation The median is known as a measure of location; that is, it tells us where the data are.

As stated in Chapter 1, we do not need to know all the data values exactly to calculate the median; if we made the smallest value even smaller or the largest value even larger, it would not change the value of the median.

Thus the median does not use all the information in the data and so it can be shown to be less efficient than the mean or average, which does use all values of the data. To calculate the mean, we add up the observed values and divide by their number. The total of the values obtained in Table 1. For example, replacing 2. A feature of the mean is that it is the value that minimizes the sum of the squares of the observations from a point, in contrast to the median which minimizes the sum of the absolute differences from a point Chapter 1.

The sum of the squares for all the observations is 9. No value other than 1. It is also true that the sum of the differences i. As well as measures of location we need measures of how variable the data are. We met two of these measures, the range and interquartile range, in Chapter 1. The range is an important measurement, for figures at the top and bottom of it denote the findings furthest removed from the generality.

However, they do not give much indication of the average spread of observations about the mean. This is where the standard deviation SD comes in. The theoretical basis of the standard deviation is complex and need not trouble the user.

We will discuss sampling and populations in Chapter 4. The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.

The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape. Many biological characteristics conform to a Normal distribution closely enough for it to be commonly used—for example, heights of adult men and women, blood pressures in a healthy population, random errors in many types of laboratory measurements, and biochemical data.

Figure 2. A more extensive set of values is given in Table A Appendix. Consequently, if we know the mean and standard deviation of a set of observations, we can obtain some useful information by simple arithmetic. If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.

Consequently the squares of the differences are added. The sum of the squares is then divided by the number of observations minus one to give the mean of the squares, and the square root is taken to bring the measurements back to the units we started with. In these circumstances they are one less than the total. The theoretical justification for this need not trouble the user in practice. However, consider having only one observation in a data set.

In this case the mean of the data is just that point. You might say, well that is true, the variance is zero, because there is no variability about that point. However we are trying to estimate the variance of the population see Chapter 4.

To gain an intuitive feel for degrees of freedom, consider if we had a row of n fence posts. How many fence panels would we need to make a fence? Once we know where the first fence post is, this determines where the others are.

The calculation of the standard deviation is illustrated in Table 2. The readings are set out in column 1. In column 2 the difference between each reading and the mean is recorded.

The sum of the differences is 0. In column 3 the differences are squared, and the sum of those squares is given at the bottom of the column. The sum of the squares of the differences or deviations from the mean, 9. This procedure illustrates the structure of the standard deviation, in particular that the two extreme values 0.

For example, in addition to studying the lead concentration in the urine of children, the pediatrician asked how often each of them had been examined by a doctor during the year. After collecting the information, he tabulated the data shown in Table 2. The mean is calculated by multiplying column 1 by column 2 , adding the products, and dividing by the total number of observations.

As we did for continuous data, to calculate the standard deviation, we subtract the mean from each of the observations in turn and then square it. In this case the observation is the number of visits, but because we have several children in each class, shown in column 2 , each squared number column 4 must be multiplied by the number of children. The sum of squares is given at the foot of column 5 , namely Table 2. Note that although the number of visits is not Normally distributed, the distribution is reasonably symmetrical about the mean.

It is common for discrete quantitative variables to have what is known as a skewed distribution, that is, they are not symmetrical. Sometimes a transformation will convert a skewed distribution into a symmetrical one.

When the data are counts, such as number of visits to a doctor, often the square root transformation will help, and if there are no zero or negative values a logarithmic transformation may render the distribution more symmetrical.

Data transformation An anesthetist measures the pain of a procedure using a mm visual analogue scale on seven patients. The results are given in Table 2. The data are plotted in Figure 2. The mean and median are Where the mean is bigger than the median, the distribution is positively skewed. For the logged data, the mean and median are 1.

Original scale: 1, 1, 2, 3, 3, 6, 56 Loge scale: 0, 0, 0. Thus it would be better to analyze the logged transformed data in statistical tests than using the original scale. In reporting these results, the median of the raw data would be given, but it should be explained that the statistical test was carried out on the transformed data. Note that the median of the logged data is the same as the log of the median of the raw data—however, this is not true for the mean.

The mean of the logged data is not necessarily equal to the log of the mean of the raw data. The antilog exp on a calculator of the mean of the logged data is known as the geometric mean, and is often a better summary statistic than the mean, for data from positively skewed distributions. For these data the geometric mean is 3. Some variables are measured naturally on a log scale e.

Summary statistics for quantitative data 21 Between subjects and within subjects standard deviation If repeated measurements are made of, say, blood pressure on an individual, these measurements are likely to vary.

This is within subject, or intrasubject, variability, and we can calculate a standard deviation of these observations.

If the observations are close together in time, this standard deviation is often described as the measurement error. Measurements made on different subjects vary according to between subject, or intersubject, variability. If many observations were made on each individual, and the average taken, then we can assume that the intrasubject variability has been averaged out and the variation in the average values is due solely to the intersubject variability.

Single observations on individuals clearly contain a mixture of intersubject and intrasubject variation, but we cannot separate the two since the within subject variability cannot be estimated with only one observation per subject. It is often quoted as a measure of repeatability for biochemical assays, when an assay is carried out on several occasions on the same sample. It has the advantage of being independent of the units of measurement, but also it has numerous theoretical disadvantages.

It is usually nonsensical to use the coefficient of variation as a measure of between subject variability. The mode The mode is the most common value and along with the mean and the median is a measure of location.

It can be used for grouped continuous data, for count data and for categorical data. For example, in Table 1. The mode is only used for describing data.

Common questions When should I quote the mean and when should I quote the median to describe my data? It is a commonly held misapprehension that for Normally distributed data one uses the mean, and for non-Normally distributed 22 Chapter 2 data one uses the median. Alas, this is not so: if the data are approximately Normally distributed, the mean and the median will be close; if the data are not Normally distributed, then both the mean and the median may give useful information.

Consider a variable that takes the value 1 for males and 0 for females. This is clearly not Normally distributed. Similarly, the mean from ordered categorical variables can be more useful than the median, if the ordered categories can be given meaningful scores.

For example, a lecture might be rated as 1 poor to 5 excellent. The usual statistic for summarizing the result would be the mean. For some outcome variables such as cost , one might be interested in the mean, whatever the distribution of the data, since from the mean can be derived the total cost for a group.

However, in the situation where there is a small group at one extreme of a distribution e. When should I use a standard deviation to summarize variability? The standard deviation is only interpretable as a summary measure for variables that have approximately symmetric distributions. It is often used to describe the characteristics of a group, for example, in the first table of a paper describing a clinical trial.

It is often used, in my view incorrectly, to describe variability for measurements that are not plausibly normal, such as age. For these variables, the range or interquartile range is a better measure. The standard deviation should not be confused with the standard error, which is described in Chapter 4 and where the distinction between the two is spelled out.

Formula appreciation We can see from formula 2. This is confirmed in Table 2. Note that in Table 2. Try to avoid the temptation to spurious accuracy offered by computer printouts and calculator displays! Hint if the mean and standard deviation are about the same size, and if the observations must be positive, then the distribution will be skewed.

Exercises 2. He obtained the following figures: never, 12 people; once, 24; twice, 42; three times, 38; four times, 30; five times, 4. What is the mean number of times those people had been vaccinated and what is the standard deviation? Is the standard deviation a good measure of variation in this case?

What proportion of the data is excluded? What are the best ways of graphically displaying the summaries of these data? We assign values 0 and 1 to the two states. For a single variable there are two ways of summarizing the information, proportions and odds. Proportions can be classified as risks or rates. Consider 10 observations: We could say that 5 out of 10 observations were 1, that is, a proportion 0.

A proportion that is common in medicine is a prevalence. This is defined as the number of people in a population with a particular condition divided by the number of people in the population.

This is sometimes multiplied by a round number such as , so we have the prevalence per thousand, which is easier to understand.

For example, the prevalence of type II diabetes is currently 0. A proportion is a special sort of ratio, in that it must lie between 0 and 1. Another sort of ratio is a rate. This is the proportion of events that occur within a given time period. For example, the population of the UK is approximately 60 Every year about people die.

This is often expressed per , so that we say the crude mortality rate for the UK is about 10 per thousand per year. If the data referred to earlier arose because we followed up a group of 10 people for say a year and Statistics at Square One, XIth edition. Strictly speaking, epidemiologists would call this an incidence rate and would require a time period to be specified. When one hears a risk quoted, always ask over what period of time. After all, in the long run, the risk of death is one!

An alternative way of looking at the 10 observations is to say that out of the 10 observations 5 observations were 1, and 5 were 0, that is, a ratio of , or what is known as an odds of 1 to 1.

We might say something has a fifty-fifty chance, meaning a probability of 0. Odds are commonly used amongst the horse racing fraternity, where odds of 10 to 1 mean that out of 11 races they would expect a horse to win only once. Usually in betting, odds are bigger than one since bookmakers would not quote you odds on something they thought is likely to happen.

However, odds can be less than one, and so, unlike proportions, their only restriction is that they must be positive. It is a simple matter to relate odds o to proportions p. Thus an odds of 1 implies a proportion of 0. Table 3. The results are given in Table 3. To compare two proportions, what we really want is to look at the contrast between differing therapies. We can do this by looking at either the difference in risks or the ratio of risks.

Consider the difference in risk first. If we ignore the sign, this is sometimes known as the absolute risk difference ARD , or if the risk in the intervention group is lower than the control as the absolute risk reduction.

One way of thinking about this is if patients were treated under placebo and treated under isoniazid, we would expect 16 to have died on placebo and 8. Thus an extra 7. Another way of looking at this is to ask: how many patients would be treated for one extra person to be saved by isoniazid? Thus roughly if 13 patients were treated to placebo and 13 to isoniazid, we would expect 1 fewer patient to die on isoniazid. This is known as the number needed to treat NNT or if the treatment is harmful, the number needed to treat for harm, NNTH and is simply expressed as the inverse of the absolute risk difference.

Summary statistics for binary data 27 For example, in a clinical trial of pravastatin against usual therapy to prevent coronary events in men with moderate hypercholestremia and no history of myocardial infarction, the NNT is Thus you would have to treat 42 men with prevastatin to prevent one extra coronary event, compared with the usual therapy.

However, it is important to realize that comparison between NNTs can only be made if the baseline risks are similar. We can also express the outcome from the isoniazid trial as a risk ratio or relative risk RR , which is the ratio of the two risks, experimental risk divided by control risk. With a relative risk less than one, we can also consider the relative risk reduction RRR. The RRR is then known as the prevented fraction in the exposed. In Chapter 14, Table 3. We can also summarize the trial in terms of odds.

In this case the odds ratio has almost the same value as the relative risk. This demonstrates an important fact: the odds ratio is a close approximation to the relative risk when the baseline risk is low, but is a poor approximation if the baseline risk is high.

Treatments can do harm as well as good. As an example, consider Kennedy et al. Here the risk in the control group is 0. The relative risk of death or shunt in the intervention group compared with standard therapy is 1.

This can be Summary statistics for binary data 29 Table 3. As a ratio of two numbers, the relative risk hides the actual size of the numbers.

Thus a relative risk of 2 could be 8 people out of 10 having an event compared with 4 people out of 10, or it could be 2 people out of having an event compared with 1 person out of These have a completely different interpretation. This when a relative risk is quoted always ask about the absolute risks as well, so that a proper interpretation can be made. For example, the risk of deep vein thrombosis in women on a new type of contraceptive is 30 per women years, compared to 15 per women years on the standard type.

Thus the relative risk is 2, which shows that the new type of contraceptive carries quite a high risk of deep vein thrombosis. However, an individual woman need not be unduly concerned since she has a probability of 0.

Relative risks versus odds ratios The odds ratio may not seem like an intuitively obvious statistic, but it has some useful properties. Clearly one cannot simply multiply the risk by the incidence in the unexposed to get the risk in the exposed, since one would get a risk greater than 1, i. However, there are no such problems with the odds ratio. A further use for the odds ratio arises when the data come from a cross-sectional study or a case—control study see Chapter 14 for 30 Chapter 3 Table 3.

In a case—control study, it is not possible to calculate a relative risk directly, but one can use the odds ratio to estimate the relative risk. Suppose there are two conditions A and B, which are present or absent, and we wish to see if there is an association between the two. We rewrite Table 3. However we can also argue down the columns. Thus the odds ratio for A given B is the same as the odds ratio for B given A.

To illustrate this consider Table 3. Thus the relative risk of having eczema, given that a child has hay fever, is 0. We can also find the odds ratio of having eczema given that a child has hay fever as 0. We can consider the table the other way around, and ask what is the risk of hay fever given that a child has eczema. Thus the relative risk of hay fever given that a child has eczema is 3.

Thus we can either say the children with hay fever have five times the odds of getting eczema, or that children with eczema have five times the odds of getting hay fever. This will be approximately true for risks because hay fever and eczema are quite rare in the population, but would not be true if the incidence was higher. Another useful property of the odds ratio is that the odds ratio for an event not happening is just the inverse of the odds ratio for it happening.

Odds ratios and case—control studies The design of case—control and cohort studies will be discussed in Chapter These relate exposure to some hazard to outcome in the form of disease or death.

A cohort study measures exposure and then observes events to answer the question, if one is exposed to a hazard E what is the probability of disease D i. Prob D E? A case— control study argues the other way around.

It measures events and looks backwards for exposure i. Prob E D. However, we will demonstrate that this cannot be the case. Similarly, the usual measure of the relative risk no longer holds. Suppose the investigator decided to double the number of controls as shown in Table 3. Since the risk of disease if someone is exposed to Table 3.

However, when the assumption of a low absolute risk holds true which is usually the situation for case—control studies , then the odds ratio and the relative risk are close, and so in this case the odds ratio is assumed to approximate the relative risk that would have been obtained if a cohort study had been conducted instead of a case—control study. Paired alternatives Sometimes it is possible to record the results of treatment or some sort of test or investigation as one of two alternatives.

For instance, two treatments or tests might be carried out on pairs obtained by matching individuals, or the pairs might consist of successive treatments of the same individual. These types of studies may be crossover trials or matched case—control studies and are described in Chapter The results can be set out as shown in Table 3. This seems rather counterintuitive.

However, consider a situation where we want to know whether one of two drugs is better. We use a crossover design in which each patient gets both drugs in random order.

Then if the patient responds to both, this does not tell us which of the two is better. Similarly if the patient does not respond to either, 34 Chapter 3 Table 3. It is only when a patient responds to one and not the other do we glean any information as to which drug is better.

For example, a registrar in the gastroenterological unit of a large hospital in an industrial city sees a considerable number of patients with severe recurrent aphthous ulcer of the mouth. Claims have been made that a recently introduced preparation stops the pain of these ulcers and promotes quicker healing than existing preparations. Over a period of 6 months, the registrar selected every patient with this disorder and paired them off, as far as possible, by reference to age, sex, and frequency of ulceration.

Finally she had patients in 54 pairs. To one member of each pair, chosen by the toss of a coin, she gave treatment A, which she and her colleagues in the unit had hitherto regarded as the best; to the other member she gave the new treatment, B. Both forms of treatment are local applications, and they cannot be made to look alike.

Consequently, to avoid bias in the assessment of the results, a colleague recorded the results of treatment without knowing which patient in each pair had which treatment.

The results are shown in Table 3. Thus the odds of responding on A are 0. Summary: choice of summary statistics for binary data from a non-matched study Table 3. Common questions When should I quote an odds ratio and when should I quote a relative risk? The odds ratio is difficult to understand and most people think of it as a relative risk anyway. Thus for prospective studies, the relative 36 Chapter 3 risk should be easy to derive and should be quoted, and not the odds ratio.

For case—control studies, one has no option but to quote the odds ratio. For cross-sectional studies, one has a choice, and if it is not clear which variables are causal and which are outcome, then the odds ratio has the advantage of being symmetric, in that it gives the same answer if the causal and outcome variables are swapped.

A major reason for quoting odds ratios is that they are the output from logistic regression, an advanced technique discussed in Statistics at Square Two. In this situation, it is important to label the odds ratios correctly, and consider situations in which they may not be good approximations to relative risks. Formula appreciation Equation 3. In equation 3. Is it reasonable to assume that the odds ratio is a good approximation to a relative risk?

If you are trying to evaluate a therapy, does the absolute level of risk given in the paper correspond to what you might expect in your own patients? Exercises 3. Summary statistics for binary data 37 c The difference in risks is 0. It showed that the combined risk of cardiovascular death, stroke, and myocardial infarction was Select ONE option only.

References 1. Effect of isoniazid prophylaxi on mortality and incidence of tuberculosis in children with HIV: randomised controlled trial. BMJ ;—9. The number needed to treat: a clinically useful measure of treatment effect.

BMJ ;—4. New York: Churchill Livingstone, International randomised controlled trial of acetazolamide and furosemide in post-haemorrhagic ventricular dilatation in infancy. Lancet ;— Incidence and prognosis of asthma and wheezing from early childhood to age 33 in a national British cohort. Statistics notes: the odds ratio. BMJ ; Campbell MJ. Statistics at Square Two, 2nd ed. Sex differences in speed of emergency and quality of recovery after anaesthetic: cohort study. BMJ ;—



0コメント

  • 1000 / 1000