Understanding and Evaluating Veterinary Clinical Research
Results from investigations conducted in clinical settings contribute greatly to determining how veterinarians practice medicine. It is important for the practitioner to understand how clinical information is collected, analyzed, and communicated in journals and presentations at conferences. Clinical research is either retrospective in observational studies, looking at historical medical records as the source of data, or prospective in both experimental and observational studies, where the study is designed before any patients are included. Prospective, experimental studies provide the more reliable results, although they form a minority of published reports. Randomized, controlled trials are the most reliable format, and attempts should be made to use this design more often in veterinary medicine. Care must be taken in the conduct of clinical research to reduce sources of bias that can yield false findings, particularly in small, retrospective studies. Statistical analysis is the key to data interpretation, but must be applied appropriately to avoid either wrong assumptions or misconception. Regardless of how studies are conducted, it is important for the practitioner to be an astute reader of the clinical literature. An understanding of clinical research methods will result in better medical standard of care recommendations and practice.
Introduction
All veterinarians use treatments in medical practice reflecting their belief in the current standard of care. Such beliefs were cultivated in school and matured through readings and presentations at meetings. But how does adoption of these treatments ultimately become accepted, replacing earlier practices? In most instances, standard treatments have evolved over decades and are based upon medical investigations called clinical research. Over time, the science of clinical research has improved and become increasingly quantitative. As a result, more is known about which treatments are efficacious and which are not. In some instances, therapies once thought to be effective have become contraindicated. Such improvement in scientific rigor has shifted the conduct of clinical practice toward evidence-based medicine, which refers in part to the use of medical therapies founded upon scientifically sound studies conducted in clinical settings.1,2
While the veterinary medical literature continues to expand, there is considerable variability in the scientific strength of published reports. How, then, is the consumer of medical literature to judge the validity of published studies and modify clinical practice accordingly? This paper discusses how the practitioner can make the most of reading and understanding the medical literature.
To help the reader, key terms applying to clinical research are shown in italics throughout the text (on the first use only). These terms are defined and referenced in the glossary (Table 1).
Categories of Clinical Investigations
It is important for the reader of the medical literature to understand the type of clinical investigation being reported because some types of studies are more rigorous (i.e., more likely to yield valid or unbiased results) than others. With many methods of study design classification to draw from, the following definitions represent one approach to revealing their underlying structure. Figure 1 shows the most common categories of investigations and their relative strength of evidence.



Citation: Journal of the American Animal Hospital Association 48, 5; 10.5326/JAAHA-MS-5803
Timing
Clinical research is categorized based upon whether it is retrospective or prospective. Although there is no universal agreement on the meaning of these terms, throughout this report, retrospective investigations are defined as those undertaken after a diagnosis is made and therapy applied; therefore, clinical care and observation of results have already occurred when the study began. A retrospective analysis examines a patient’s medical history using data from medical, laboratory, and other records. By definition, all retrospective studies are nonexperimental. In contrast, prospective investigations are usually methodologically stronger than retrospective ones because, depending on the design, either patient treatment/exposure commences or patient outcome occurs only after the initiation of the study. In clinical trials, for example, there is a prior hypothesis (the medical question to be answered) and a prespecified protocol is written before patient enrollment. A prospective study is designed and initiated before treatment and observation begin; therefore, plans can be made regarding how future patients will be selected for treatment, how treatment is applied, and how data will be collected and analyzed.
Investigators conducting retrospective studies may also choose to distinguish their studies by the order in which treatment or exposure and the clinical outcome is measured. This can have important implications for inferring causation if the clinical outcome can subsequently affect the exposure of interest. For example, suppose an investigator wishes to assess whether spending time in an outdoor environment can influence the risk of feline hyperthyroidism and asks owners about their ill cat’s (relative) time spent indoors and outdoors. One might find that hyperthyroid cats are more likely to spend time indoors than outdoors. However, it is plausible that when the cat became ill, but before it was diagnosed, the illness affected the cat’s indoor/outdoor preference. That is, a cat that earlier preferred being outdoors might, when it began to experience its illness, have instead preferred being indoors. This example illustrates how the outcome (hyperthyroidism) can influence the value of the potential risk factor.
Experimental and Nonexperimental Investigations
Clinical research can also be distinguished by whether it is experimental, where the investigator is able to manipulate treatments or exposures in study enrollees, or nonexperimental (also called observational), where study enrollees (or their owners) have self-selected their treatments or exposures. Examples of experimental studies are randomized clinical trials and randomized crossover trials. Examples of nonexperimental studies are cross-sectional, cohort, and case-control studies. Those designs are distinguished from each other by how individuals are sampled for inclusion into the study population. Whenever possible, experimental studies are almost always preferable to nonexperimental studies because of the ability of the investigator to control for sources of bias. However, experimental studies are not always possible due to either ethical impediments or pragmatic considerations (e.g., study duration, costs, and rarity of study subjects).
Longitudinal and Cross-sectional Studies
Clinical research can also be categorized as longitudinal (studies in which the information is collected over a proscribed period of time, allowing for the determination of causes that precede effects) or cross-sectional (where information is collected at either a single or brief period of time, often precluding the determination of a temporal relation between variables under study). To illustrate, consider the clinical course of a disease as analogous to water flowing in a pipe. It is possible to study the water flow by following a portion of the water down the pipe over time and observing its outflow (i.e., the longitudinal study). Alternatively, it is possible to examine one location on the pipe at one point in time (i.e., the cross-sectional study). Either type of study can form the basis of a valid clinical study if properly applied and interpreted; however, therapeutic efficacy is best studied in a longitudinal design because it is far easier to temporally distinguish cause (treatment) from putative effect (clinical outcome). Only in longitudinal studies can the incidence of some clinical endpoint (or comparative measures of incidence) be measured. In cross-sectional studies, the only direct measure available is prevalence.
Examples of Common Retrospective Clinical Research Designs
A common type of report in the veterinary medical literature is the longitudinal retrospective case series, a look back at a single, sequential group of patients with a similar diagnosis or treatment that was usually made or provided at either one or several institutions. For example, an investigator might report on a group of dogs with pancreatitis that were treated with a new medical therapy over the past several years. Using medical records as the source of data, diagnostic characteristics, treatment, and clinical outcomes are extracted and reported using descriptive statistics, such as mean and median canine pancreatic lipase (cPL) values before and after therapy or overall rates of survival. Longitudinal retrospective case series are resource-sparing (inexpensive), are performed quickly, and can include a large number of patients if cases are common. But not being planned in advance, they often suffer from the lack of a uniform treatment protocol, which means that patients will have been managed clinically on a case-by-case basis. In addition, patients selected for the new medical therapy may not be representative of all dogs with pancreatitis (for example). An extreme example of this would be when pancreatitis patients with the worst prognosis were preferentially given the new therapy, which could make even superior treatments appear ineffective. And, the absence of a control group in the case series study precludes definitive conclusions. Thus, the question, “what would the cPL values and survival have been with conventional versus new therapy?” remains unanswered.
In an effort to solve the problem of no control group, additional types of retrospective studies are used. One of these is the retrospective cohort study. Here, study subjects are animals that have been exposed to a risk factor (e.g., living strictly outdoors) and those not exposed (e.g., living strictly indoors). The groups are then compared for the incidence (measured as risk or rate) of disease between the two groups (e.g., contracting a communicable infectious disease). All individuals enrolled in a cohort study are, by definition, at risk for developing the clinical outcome of interest. That is, they have not experienced the outcome at the time they began to be followed under study.
A related nonexperimental controlled design is the retrospective case-control study. In that example, animals are identified that already have a certain disease (e.g., chronic bronchitis). Next, a control group without bronchitis is identified. The analysis then examines the prior presence of risk factor(s) that might have led to the disease (e.g., having owners who smoke). Study subjects are sampled specifically based on their clinical outcome status, which means they have either experienced the outcome (cases) or remain at risk for experiencing the outcome (controls). Such sampling precludes investigators from directly measuring risk among subgroups in case-control studies.
Although retrospective cohort and retrospective case-control studies are methodologically stronger than a retrospective case series because of the presence of a control group, the compared groups may differ in characteristics that are unknown, not measurable, or not entered into the records, causing bias and incorrect conclusions. Moreover, in all retrospective studies of treatment, the reasons why clinicians elect to administer certain therapies to some patients but not to others can lead to groups that are inherently noncomparable (i.e., confounded). For example, if clinicians use a drug for patients with a poor prognosis, but not for patients with a more favorable outlook, any differences in therapeutic outcome could not distinguish drug effects from outcomes affected by underlying disease severity. For those reasons, case-control and cohort studies are better used in epidemiologic research to study determinants of health outcomes, rather than treatment effects.
Examples of Common Prospective Clinical Research Designs
Observational Studies
Cohort and case-control studies can also be prospective. These nonexperimental designs are used when patients cannot realistically be assigned by the investigator to a treatment or exposure group. For example, in the indoor/outdoor example above it is not practical to prospectively assign one group of pets to live in a particular setting. In such cases, a prospective cohort study can be designed in which a group of pets already living indoors and a group of pets already living outdoors are identified. These two groups can then be prospectively followed for risks or rates of communicable diseases. A prospective case-control study can sequentially enroll new patients (cases that develop a communicable disease and controls that have not developed one) following the commencement of the study. Then, information about potentially causal risk factors present prior to study inclusion can be obtained.
Experimental Studies
The simplest experimental design is the longitudinal, uncontrolled, prospective case series in which the investigator decides prior to the onset of a study to include either a set number of consecutive, future cases or an unspecified number of future cases in a finite time period using predefined criteria for study inclusion and monitoring. For example, a new surgical procedure for repairing a patent ductus arteriosus might be evaluated, but it is not thought ethical to deny surgery to some patients. All patients who meet the enrollment criteria are treated and followed to a study endpoint (e.g., repair successful, repair failure, death) so that the success and mortality proportions can be calculated. Because this type of study is planned in advance, data collection, and follow-up can be consistent among enrolled cases. The design, however, is unsuitable for testing hypotheses because of the absence of a control group; therefore, it should be reserved only for those situations where a controlled study is not feasible to accomplish.
The best between-group comparative study designs in medical settings are the experimental prospective parallel trials. Patients are assigned to one of two or more treatment groups as they sequentially enter the study. Patients in all groups are identically managed in parallel, and factors that naturally influence outcomes occur over time, but that are unrelated to treatment assignment (e.g., concurrent diseases, concomitant therapies, spontaneous improvement, measurement errors), will tend to equalize between the groups as group size increases. Ultimately, their effects “balance out.”
Randomized clinical trials are considered the most scientifically sound type of parallel study because of the use of a control group of patients similar to the treated group. To enhance the likelihood that the response rates in both groups would be identical if both were treated identically, the group assignment is made by a process of randomization. For example, dogs with atopic dermatitis confirmed by allergen skin testing would be randomly assigned (using a technique so that each dog has an equal probability of being assigned to either group) to treatment with either a new oral immunosuppressant drug or to a placebo. All dogs would be followed, clinically monitored, and have histopathologic confirmation of skin lesions over time. The overall incidence of allergic dermatitis during a defined follow-up period would be compared between the groups.
Another commonly used experimental study design in a clinical setting is a crossover trial. Instead of comparing groups comprised of different individuals, this study compares different treatments within the same individual. A formidable advantage of such as design is the intrinsic control of within-individual confounders, such as underlying physiologic characteristics, disease severity, or genetic predisposition. For example, patients have their arterial blood gases evaluated under anesthesia A, are allowed to recover from and completely eliminate anesthesia A, and are then evaluated again under anesthesia B. Upon entering such a trial, patients are randomized to one of the orders of anesthesia administration: A then B or B then A. This study design requires two strenuous assumptions. First, that there are no underlying trends in clinical outcome over time that could mask the effects of the treatments, and second, that there is no carryover effect from the first to the second treatment.
Looking for Bias in Clinical Reports
Bias is the unintended influence of the investigator, the participants (or their characteristics), the study procedures, or the analysis that compromises the validity and accuracy of the results. Bias is not dishonesty, and most bias is unknowingly introduced in the design and/or conduct of a study. Unfortunately, bias has an insidious and detrimental effect on study interpretations, which can lead to conflicting published medical reports. Bias can occur in any type of study, but it is more common in retrospective, observational, and uncontrolled studies.
In a retrospective case series, biased conclusions may arise when there is inadvertent or purposeful selection of the medical records of only certain study patients instead of all consecutive cases (i.e., selection bias or “cherry picking”). In one egregious example, patients who responded to treatment are better remembered by the investigator and therefore included in the study, but treatment failures are not, leading to a wholly nonrepresentative sample of the patient population. This is why a case series report must count all of the patients available for analysis, and the reader should question if this was correctly done.
Follow-up (censoring) bias can arise in any longitudinal study when individuals already enrolled elect to leave the study population or are lost to follow-up. When the reasons for exiting a study are related to the factors under study, such as a treatment, then incomplete ascertainment of individuals occurs and a comparison of treatments can lead to a misleading conclusion. Suppose, for example, that a group of patients is being treated with a drug known to have serious side effects. Patients with the worst prognosis are those most likely to experience the side effects and are more likely to drop out of the study. The members of the treatment group remaining in the study will have better survival than those that dropped out, potentially underestimating the mortality rate compared with the control group, which did not have the high drop-out rate.
Measurement bias, occurring in retrospective, prospective, experimental, and observational studies, arises in the course of measuring either explanatory or outcome variables. Measurement bias can occur when a measurement instrument (which could be a machine, an interviewer, a survey, etc.) is miscalibrated or if the investigator fails to correctly measure the study’s endpoint. Such bias can be differential, meaning that the measurement error of one variable differs across categories of another variable, or nondifferential, when the measurement error of one variable is the same across levels of a different variable. For example, if the investigator assesses the outcomes more carefully in patients in one treatment group compared with patients in another group, this would be differential misclassification of outcome. Conversely, if data abstracted from a medical record systematically led to an underestimation of overall treatment administered, but this error applied equally to patients with all possible outcomes (e.g., survival or death), then this would be nondifferential classification of exposure.
One of the best methods to reduce the opportunity for measurement bias in clinical research is blinding. In a single-blind study, the owner or person directly caring for the animal does not know what treatment their patient is receiving, meaning she/he is blinded. In a double-blind study, neither the owner/caretaker nor the practitioner/investigator knows the treatment, reducing both the placebo effect and investigator bias. If there is no blinding and everyone involved knows who receives what treatment, the study is called nonblinded or open-label.
Although very advantageous, blinding is sometimes difficult to achieve, such as when treatments being compared are surgical versus medical therapy. One way to reduce bias in such a study is to have independent experts, unaware of the treatment of each patient, perform the key study measurements. For example, either having an independent, blinded cardiologist perform echocardiograms or engaging an independent radiologist to evaluate radiographs will reduce bias. Some studies reduce bias by having objective, rather than subjective, study outcomes. For instance, laboratory measures, body weight, and death are all objective endpoints and less prone to be affected by bias.
Other sources of measurement bias are the application of improper study standards, such as utilizing miscalibrated equipment, having untrained individuals interpret laboratory tests, and utilizing laboratory cut-offs based on tests with imperfect sensitivities and specificities to distinguish normal from diseased animals (i.e., declaring healthy animals to be outside normal limits [false positives] or diseased animals to be within normal limits [false negatives]).
Confounding bias occurs when the treatment assignment for a patient is not random, leading to treatment groups that would have had different clinical outcomes even if they were treated identically. For example, the investigator may unknowingly refrain from assigning the more ill patients to the control arm of a nonrandomized study. This would tend to make a treatment look less efficacious than it actually is. A valid study should therefore state how patients were assigned to treatment groups. Randomization provides the highest likelihood that the parallel groups will be comparable, or balanced at the study start, with respect to determinants of the clinical outcome besides treatment (i.e., confounders). The larger the number of patients randomized, the better chance there is of the treatment groups being balanced and comparable.
Confounding bias is an even greater concern in observational studies because treatment or exposure are either self-selected or are chosen for reasons that have nothing to do with the goals of a study. One of the most common examples of this in a clinical setting is “confounding by indication.” In this example, a clinician selects a treatment for a patient based on clinical severity, prognosis, owner preference, side effects, or other reasons. The effects of these reasons for the selected treatment are themselves predictors of clinical outcomes. Statistical analysis is the most common method of correcting for confounding in observational studies, although there remains the risk of confounding by either unknown or unmeasurable variables, including unrecorded reasons for a clinician selecting one treatment over another.
A study that is free from sources of bias is said to be internally valid or accurate because its findings accurately estimate effects in the source (reference) population that gave rise to the study population. In contrast, faulty interpretations may occur when the patients do not represent the actual (target) population that one would like to generalize results to, meaning that the study is not necessarily externally valid. This error arises when the source (reference) population is not a random sample of the total (target) patient population that could potentially be impacted by the research findings. For example, study patients may be selected specifically because they have the most severe form of the disease. Although the study findings are internally valid with respect to that particular subset of patients (assuming there are no other sources of bias), the results may not then be generalized to all patients with that diagnosis. For this reason it is important that the enrollment criteria, the rules for which patients enter the study, are well thought out and well-explained in the study report. This underscores the critically important point that random selection and randomization are the keys to valid study inference.
Another type of bias with a more global influence is publication bias. Medical journals may favor “positive” over “negative” studies, so reports showing a positive effect of treatment may be submitted and accepted for publication more frequently. As a result, studies with negative results may not be made public as often. This weight of published evidence could lead readers to believe that therapies are more effective than they really are. Journals should be willing to publish well conducted studies with negative as well as positive outcomes.
Bias is also introduced when there is conflict of interest on the part of the investigator. Factors such as source of funding, prestige, corporate associations, academic standing, pressure to publish, and similar problems may cause the investigator to judge data differently than if such factors were not present. Double-blinding helps to reduce this type of bias. Many journals now require authors to state potential conflicts of interest in published reports.
Statistical Testing
Clinical research relies upon a broad array of statistical tests too numerous and complex to discuss in this review paper. Instead, the reader is referred to one of numerous texts on statistics in medicine.3–6 However, the key statistical concepts important to evaluating clinical research are reviewed here.
For making comparisons between groups, the P value is the most ubiquitous statistical concept for the reader of the literature to understand. In most circumstances, the P value can be approximately defined as the probability of finding differences or effects at least as large as those obtained in a study when no differences or effects really exist (the latter is referred to as the null hypothesis). The P value is always expressed as a number bounded by 0 and 1. It is vitally important to understand that the P value does not provide the probability that the null hypothesis is true or false. Instead, testing is performed under the assumption that the null hypothesis is true and that the P value refers to results when the null hypothesis is true (not whether the null hypothesis is true). The opposite of the null hypothesis is the alternative hypothesis.
Medical researchers and clinicians often seek a dividing line to help them decide whether study evidence is compelling enough to make some conclusion or decision about observed effects. It has become convention to set this value at 0.05. This means that study authors avoid erroneously claiming that a treatment has some effect no more than one out of 20 times (5%). This is referred to as the level of significance or α. If the calculated P value is ≤α then the results are referred to as “statistically significant” (Table 2). The smaller the P value, the less compatible the data are with the null hypothesis.
Falsely rejecting the null hypothesis is a type I error, while failing to reject H0 when the H0 is, in reality, false is a type II error. If α is set at 0.05, the likelihood of a type I error is approximately 5% under H0. As the power of the study increases, the likelihood of a type II error decreases. α, level of significance; H0, null hypothesis
The level of significance used to conclude if a result is significant is always determined before a study is carried out. Commonly, the significance level is set at 0.05 (5%); however, it need not be, and other values for statistical significance can be used. For example, a level of significance of 0.01 is also commonly used, and is a more rigorous test with less likelihood of erroneously rejecting the null hypothesis.
Conventional hypothesis testing is inherently designed to disprove, rather than prove. It is often overlooked that the null hypothesis, whether correct or not, is always assumed to be correct for testing purposes. The data arising in the course of a study is then contrasted with the null hypothesis under its assumption of veracity. This is tantamount to positing the question, “Is it likely to observe study results at least as large (or small) when the null hypothesis is correct?” If the findings are compatible with what one would expect under the null hypothesis then the latter will not be rejected. Conversely, if the findings are unlikely to have arisen when the null hypothesis holds then the latter will be rejected (Table 2). Hypothesis testing is an attempt to interject objectivity into this decision-making process.
To further illustrate the inter-relationship of P values and levels of significance, consider a study to determine if there is a significant association (with a level of significance of 5%) between treatment (versus placebo) and a clinical response (Table 3). Suppose a small study is done with five treated patients (two of which had a positive clinical response [40%]) and five patients receiving a placebo (one of which had a positive clinical response [20%]). The relative risk of a positive response in this study would be 2.0 (i.e., 0.40/0.20), and a test to see if this is significantly different from the null value (relative risk of 1.0 [i.e., no association]) yields a P value of 1.0. Because the P value is higher than the level of significance, the null hypothesis would not be rejected, and the study authors would conclude that the data from the study is not at all unexpected (and is, indeed, very likely) when the null hypothesis of no association is correct.
Fisher exact test
All three experiments are identical in all respects, except for the number of treated patients. Note the effect of sample size on the P value and the likely conclusions.
Now suppose the study was 10-fold larger. In that case, 20/50 (40%) treated dogs developed a positive response and 10/50 (20%) dogs receiving a placebo developed a positive response. An identical test of the null hypothesis of no association would now yield a P value of 0.049. Because the P value is less than the level of significance, the null hypothesis is rejected and the study authors would conclude that the relative risk of 2.0 from the study was significantly different from 1.0. In other words, the larger study’s data were unlikely to occur when no real relationship between treatment and a positive clinical response actually exists, prompting the authors to conclude that the null hypothesis should no longer be accepted.
Many clinical investigations look for significant differences, or superiority, of one treatment group to another, as in the example above. However, investigators’ goals may be different. They may wish to show that two treatments (or effects) are essentially medically indistinguishable from each other (i.e., equivalent). The failure to reject the null hypothesis of no treatment difference in a conventional test does not imply that the treatments are equivalent, but only that there is insufficient evidence to confidently demonstrate superiority. Studies specifically designed to test if two treatments are equivalent are called equivalence trials. In such studies, the investigator must postulate prior to the onset of the study, based upon clinical judgment, an equivalence range of treatment differences that is deemed medically not meaningful. The null hypothesis in an equivalence study is that the groups (the test therapy and the positive control) are not equivalent outside this range. The finding of a study that empirically demonstrates (with high confidence) that the plausible range of treatment effects fall entirely within the range specified prior to the study onset is declared treatment equivalence.
For example, an investigator may believe that two different treatments for blood pressure are equivalent if the difference in treatments fall within ±10 mm Hg. If the study shows with high confidence that one treatment is >10 mm Hg greater (or lower) than the other, then the null hypothesis fails to be rejected, and the conclusion is that the drugs are not equivalent. Conversely, if the study shows with high confidence that one treatment is <10 mm Hg greater (or lower) than the other, then the null hypothesis is rejected with the conclusion that the drugs are equivalent.
Related to the concept of equivalence testing is noninferiority testing. Instead of trying to show that two treatments are equivalent in efficacy (i.e., one treatment is no different than the other), an investigator may be satisfied with the less restrictive goal of showing that a novel treatment’s measured success is at least no worse (inferior) than the success achieved using a conventional treatment. In such studies, one begins by assuming a different kind of null hypothesis than what is usually seen in statistical testing: that the novel treatment is inferior to the conventional treatment (or, equivalently, that the conventional treatment is superior to the novel treatment) by a specified amount that is judged medically important enough to influence the choice of treatments. Rejection of the null hypothesis indicates that within the amount of tolerable (not medically important) differences, the novel treatment is not significantly worse than the conventional one. For example, consider the case of a conventional treatment that achieves a cure 80% of the time, and a new treatment that achieves a cure 75% of the time. The difference between these two cures is 5%, and if an investigator would be willing to tolerate as clinically acceptable a difference of ≤5% then the null hypothesis that the novel treatment is inferior to the conventional treatment by ≥5% is rejected, leading to the conclusion that the novel treatment is not inferior based on the study results.
Study Precision and Power
The P value is influenced by the study’s sample size. In turn, sample size affects the statistical power of a study, which is the ability of an investigator to reject the null hypothesis when it is false (Table 2). Imagine randomizing dogs to receive either a drug or a placebo and observing a response rate of 40% and 20%, respectively. Using Table 3 again, the hypothetical results of three such experiments with an identical outcome are shown. As the number of dogs (the sample size) increases, the P value declines, meaning the statistical power for the investigator to declare significant the empirically different drug and placebo effects is greater with more patients. Only with 50 dogs/group was the power adequate to reasonably conclude that the drug worked. If only 5 dogs or 25 dogs were studied (i.e., a study of inadequate statistical power), the conclusions may have (erroneously) been that the treatment was not effective.
Unfortunately, in much of the clinical literature, statistical power is insufficient to meet the objectives of the study. It is important for the reader to be aware that nonsignificant results could still mean that differences or effects, potentially important ones, may yet exist. It therefore becomes incumbent on a consumer of the medical literature to not merely accept findings of nonsignificance as tantamount to an absence of differences/effects, but also focus on (1) the study’s sample size, (2) the magnitude of the differences/effects regardless of their significance, and (3) the magnitude of the precision of the statistics as quantified by standard deviations, standard errors, confidence intervals, etc. (defined below). Meta-analyses are one method of attempting to reconcile such equivocal information across different studies.
The statistical power of a study is also influenced by the magnitude of the treatment effect to be measured. The larger the treatment effect, the greater the power of the study to detect it (if all other factors are constant). However, a corollary to this is that the finding of statistically significant differences (or effects) does not necessarily imply that such differences (or effects) are large or even medically meaningful. In a large enough study, even those effects that are small can be statistically significantly different.
Another factor that affects the power of a study is the amount of variation between patients or between measurements within a treatment group, summarized by variance, standard error, and standard deviation. As these measures of variability increase, the P value also increases. Because of the mathematical inter-relationships between P values, variances, magnitude of treatment effects, and sample sizes, less variability (“tighter data”) translates into increased statistical power. With too much variability in measurements, an underpowered study can (erroneously) make an effective treatment appear worthless, leading to a false negative result (a type II error).
The variability in the data can also be combined with summary point estimates (means, proportions, odds ratios, etc.) and represented as a confidence interval, a range of values within which one can reasonably state with a particular confidence that a true difference or effect lies. The most commonly used confidence interval is the 95% interval. It is becoming more common in clinical study reporting to state both the P value and the confidence interval, because the latter provides more intuitively understandable information about a plausible range of the true magnitude of differences or effects. The smaller the confidence interval, the less the variance.
Statistics Used to Describe Data
Descriptive statistics are used to report the characteristics of a single group of patients. Once data are collected, mathematical formulas are used to produce summaries of the data, such as the mean, median, range, rates, ratios, and proportions. Descriptive statistics are commonly used when summarizing a case series, describing the baseline characteristics of study groups, or reporting results in individual treatment groups in a parallel study, but not for comparing groups.
Sometimes, the outcome of a study is expressed as a risk or a rate, which are distinctly different measures of incidence; however, under certain conditions (e.g., disease rarity) they can approximate each other. Risk (or, more correctly, average risk) is the proportion of new events that occur in a population in a defined period of time. Risks are bounded by 0 (0%) and 1.0 (100%) and are dimensionless (i.e., have no units). However, risk must always be presented in a time context. For example, the risk of death of puppies is likely to be very low in the first year of life, but close to 1.0 (100%) after the first two decades of life. A rate, in contrast, is analogous to a speed or velocity, and is used to quantify the occurrence of new events/unit of time. Hence, rates are always expressed as events/unit of time. For example, if one wanted to provide a measure of how often panleukopenia occurred in a cattery, the total number of cases that occurred in the population divided by the total number of days at risk in the population would be the incidence rate of panleukopenia (e.g., 2.5 new cases of panleukopenia for every 100 cat-days at risk). Risks and rates are used for different purposes in describing incidence. For example, in a clinical trial comparing the efficacy of two antibiotics to lead to a cure for cystitis, both drugs may lead to a 100% cure after a therapeutic period, but the preferred drug would be the one that led to a cure sooner than the other (the one with the higher rate of cure).
Results are also sometimes presented as a ratio of the average risk in a treatment group to the average risk in a control group, an example of which is the relative risk (also referred to as risk ratio, cumulative incidence ratio, incidence proportion ratio) and the closely related odds ratio (the ratio of the odds of disease in the treatment and control groups), where the odds are defined as the risk divided by one minus the risk). A relative risk of 3.0 indicates a threefold greater risk in one study group versus a comparison group. Similarly, an odds ratio of 3.0 indicates a threefold greater odds in one group versus the other. Conversely, a relative risk of 0.33 indicates an average risk in one group that is one-third the average risk of a comparison group. Those ratios are common in clinical reports. When such ratios are used, they should always be accompanied by a confidence interval. For example, if the relative risk is 3.0 (95% confidence interval, 1.8–4.9) this indicates a 95% confidence that the true ratio lies between 1.8 and 4.9. If the confidence interval includes the integer 1, we cannot be 95% or more confident that the groups differ. Relative risks are typically reported in randomized clinical trials and cohort studies, while odds ratios are statistics primarily estimated in case-control studies. The odds ratio is itself a difficult statistic to understand, but when the incidence of the outcome is rare (i.e., ≤5%), it can be interpreted as being roughly equivalent to the relative risk, which is much more intuitive.
A common type of statistical analysis used in published reports is survival analysis or time-to-event analysis (events can be various endpoints, such as remission, recovery, hospital discharge, death, etc.). This statistical method estimates the cumulative frequency (or rate) of an event as a function of time. Probably the most common type of estimation method used in survival analysis is the Kaplan-Meier (i.e., the product-moment method) as shown in Figure 2. In Kaplan-Meier plots, the Y axis is expressed as the probability of the study population remaining at risk for the event, and all plot lines have the same y intercept (100% at time 0), and the X axis is time. If a patient either discontinues the study or is lost to data collection before an event occurs, the patient’s data up to that time is used in the analysis and the patient becomes censored thereafter. Survival analyses also lead to statistics comparing the rate of an outcome, such as recovery or death or time to remission between two or more groups. Those relative rates are better known as incidence rate ratios or hazard rate ratios, which measure how much proportionately greater (or lesser) the incidence rate of an outcome is in one group compared with another.



Citation: Journal of the American Animal Hospital Association 48, 5; 10.5326/JAAHA-MS-5803
Considerations when Reading Published Reports
Clinicians can be astute readers of the literature by better understanding study design, bias, and statistics as discussed above. But what about actually reading the literature? When critically reviewing an article reporting a clinical investigation, at a minimum the elements shown in Table 4 should be found in the introduction, materials and methods, or results sections of the report. Some of those items may not be relevant for every study, but the reader should assure themselves of this determination.
The reader should also wonder what the prior hypothesis is. Every study has a medical question that is being asked. This is the reason for conducting the study, and the prior hypothesis is derived from this question. It is advisable that there be only one primary prior hypothesis; however, there may also be secondary hypotheses. The stated hypothesis should be specific. In the pancreatitis example used earlier, the investigator should decide whether the cPL levels or survival is the primary hypothesis, and the study should be designed accordingly. This is important because the duration of the study will need be longer if the prior hypothesis is that the new medical therapy improves survival, whereas cPL levels could be measured over a shorter period. It is obvious the prior hypothesis drives many aspects of the investigation.
What is the overall study design of the clinical investigation? Is it appropriate for the primary hypothesis? Referring to Figure 1, the investigators should have used the investigative design that will yield the strongest possible evidence, but one that is still practical to conduct. Small, retrospective case series studies are at particular risk of having low accuracy and the reader should be wary of this design. In many such cases, a retrospective case-control or cohort study could have been performed and a control group would then be available. In the event of a brand new surgical therapy, the initial study should be a prospective, rather than a retrospective, case series. After all, the investigator knows a new therapy will be tested and can accordingly plan in advance instead of just waiting for n number of cases to accumulate. The prospective study utilizes the same number of patients, but provides considerably better evidence.
How were patients chosen for inclusion in the investigation? The enrollment criteria should be clearly defined in the paper. The reader should ask if the study population that resulted from the chosen enrollment criteria will best answer the clinical question, and if it will be generalizable to the patients in the real world (i.e., external validity). It may have been unsafe to enroll a particular subset of patients into a trial because they were too clinically fragile, but the reasoning should be stated. If the study was nonblinded, was there selection bias during the enrollment process? Would blinding have helped this and if the study was not blinded, why? Was there a scientific (clinical) reason or was it too expensive, to time consuming, etc.? The reader should expect due diligence in the conduct of clinical investigations.
What was the method of assigning patients to study groups in an experimental study? The benefits of randomization have been discussed above, but it is also important to note when randomization was conducted relative to study entry. Randomization should always be undertaken as close to study enrollment as possible.
What was the method and extent of blinding? Were only the owners unaware of the treatment? How was this assured? If the study was double-blind, how was this achieved? Often, there is a third party (e.g., a pharmacist) who holds the randomization schedule and actually assigns the treatment. In a double-blind randomized clinical trial, both groups must receive “treatment” that is identical in appearance, taste, smell, etc. A new drug that tastes bitter may be rejected by animals whereas a placebo is not, and this could lead to important bias when the drug group is underdosed. Or, if the investigator knows the treatment assignment, they might more readily terminate participation of patients in the control group for fear of doing harm. This type of inequality of patient management causes follow-up bias.
Is the duration of the study appropriate for the prior hypothesis to be adequately tested? Survival takes longer to assess than a laboratory measure. Or, in a case-control study, does the investigation look back far enough in time to include past risk factors? How were study drop-out patients managed? Study drop-outs reduce the patient numbers and decrease statistical power when too many patients have been lost. Were extra patients enrolled to account for this loss? A type of analysis called intention to treat uses statistical evaluation of all patients who entered in the study, not just the subjects that completed the study. An intent to treat approach partially corrects for the effect of drop-outs. In general, the more discontinuations there are, the weaker the study is.
Every published report of a clinical investigation should include a statement of the clinical outcome measures used. This is partly, but not entirely, driven by the prior hypothesis. In the pancreatitis example, the prior hypothesis may be that the new therapy improves the outcome of the illness, but the primary clinical outcome measure needs to be either cPL or survival, not both. Multiple outcome measures encourage multiple statistical comparisons, which weaken a study and lead to false assumptions.
This is because the more statistical tests that are performed, the greater the likelihood that at least one test result will be spurious, and the reader should be on the lookout for such multiple testing. For example, with 10 independent statistical tests, when the null hypothesis is correct, and with the probability of each test being falsely positive set at 5% (P value of 0.05), the probability of having at least one false positive comparison out of 10 (i.e., falsely rejecting the null hypothesis) is approximately 40%. (The probability of at least one false positive out of 10 tests is 1 - (0.95)10 or approximately 0.4.) Because of this, statistically sound clinical investigations will have a limited number of prestated statistical comparisons (usually two or three). If there there are multiple comparisons, the study authors should adjust for them. This adjustment may be made by setting the cut-off for establishing statistical significance at a lower level (e.g., 0.01). The reader should be cautious when too many unadjusted statistical tests are present in one study.
The same problem of observing false positives arises when indiscriminately using complete blood count plus chemistry panels to screen for abnormalities in a survey type study. Both panels together contain as many as 30 different measurements; therefore, the probability that at least one measurement will fall outside the normal range by chance alone (defined as a range that captures 95% of normal individuals) is 79% (1 - (0.95)30 is approximately 0.79).
The problem of multiple statistical comparisons often occurs when a study is analyzed. For example, the primary clinical finding could have a P value >0.05, but the investigator wishes to preserve the nominal level of significance (e.g., 5%) to make the trial a “success”. In that situation, an investigator may undertake a post hoc analysis of the data, looking for other differences of secondary importance between treatment groups (i.e., data dredging). One or more post hoc comparisons may then yield a P value lower than the level of significance. It is tempting to report that because the trial was properly randomized, blinded, and controlled, any differences noted with a P value <0.05 are significant; however, this ignores the reality that as more statistical tests are run, the greater the likelihood of spuriously rejecting the null hypothesis, thus effectively creating a level of significance well above 0.05 or the nominally selected value. To preserve nominal levels of significance, and to avoid multiple false positive errors, statistical adjustments for multiple comparisons must be performed.
Multiple statistical comparisons are also a problem in case-control studies that use medical records to look for associations between hypothesized risk factors and a particular disease. Numerous comparisons are made and a P value potentially calculated for each comparison. It is then concluded that comparisons with a P<0.05 are significant associations. Again, because of the rules applying to multiple statistical tests, this use of the P value is ill-advised. Instead, reporting confidence intervals is preferred in observational studies examining multiple risk factors.
Investigators should always strive to present confidence intervals together with their statistics because they convey considerably more information than a P value alone. Using the previous example in Table 3, a P value <0.05 merely tells us that the study findings are unlikely when the null hypothesis is correct, and that the lower end of a 95% confidence interval would remain above the null value of 1.0. In contrast, a 95% confidence interval would be far more informative because it would reveal that whatever the true relative risk was in the population, the investigators could be 95% confident that the true relative risk lies between 1.04 and 3.83.
The reader should be sure the published report contains a statement of the statistical power of the study. It should be clear that this was selected prior to the determining the target sample size. It is common in the veterinary literature to see investigations that are underpowered to detect a difference or a treatment effect because of inadequate numbers of patients. This tends to result in a type II error, making an effective therapy appear useless.
The question must always be asked, does the paper report on the final results of the investigation, or on preliminary results? Sometimes, results of an investigation are divulged too early, and presentation of interim analysis from an ongoing study is fraught with danger. Poster sessions at conferences are notorious for the presentation of preliminary results that also may not have undergone prior peer review. A study may legitimately be stopped if an interim analysis demonstrates a safety problem, such as excess deaths. If an interim analysis is conducted and reveals no difference in efficacy or safety, the study should continue to completion and no interim report need be made.
It is incumbent upon the reader to determine if the published report gives a correct and balanced clinical interpretation. One type of error with regard to interpretation is to confuse statistical significance with clinical significance. A large superiority study with high power may be able to detect a 10% difference between two treatments with a convincingly low P value. Although the authors would rightly conclude and report a “significant” treatment effect, the decision to modify medical practice should be based upon more than this finding. A 10% treatment difference may be either therapeutically inconsequential or the incidence of adverse events could be greater and therefore contraindicated under the better treatment. In this setting, the P value can be misleading if strictly used for decision-making. Further, medical management choices should be based on all consequences and costs of replacing a standard treatment with one of marginal medical superiority.
Another common error of interpretation is the concept of association versus causation (i.e., the rooster who believes his crow causes the sun to rise). Finding that A and B are statistically associated in any study is not proof that A causes B. It is only an indication of association, or relatedness, and may be explained by other factors related both to A and B. For example, although use of cat litter is associated with longer lifespan in cats, the explanation for this association is almost certainly because indoor cats, which preferentially use cat litter, have longer life expectancy than cats that live outdoors. Guidelines exist for weighing evidence from experimental and nonexperimental studies, including the classic references by Sir Austin Bradford Hill (1965) and Alfred S. Evans (1976) related to causation.7,8
Conclusion
The science of clinical research has evolved into a specialized field over the last several decades. Although complex, the conduct of scientifically sound investigations conducted in clinical settings in veterinary medicine is very achievable if attention is paid to study design; management of bias by randomization and blinding; and appropriate statistical methods. It is recommended that, wherever practical, veterinary clinical practice of treatment efficacy be based on prospective, rather than retrospective, studies and that small observational studies with limited numbers of patients be viewed with caution. The reader needs to be an astute critic of published studies to determine which therapies are effective and warrant adoption and which do not.

Major categories of clinical investigations and the relative strength of evidence of each. The base of the triangle is broader because retrospective case series tend to be more common than randomized controlled trials (for example).

Example of a Kaplan-Meier plot from a retrospective study of cutaneous mast cell tumors in dogs.9 Three groups of dogs were compared based on their mitotic index categories. Each point in the curve where the line drops vertically is an event (i.e., death). The curve for the group with the highest mitotic index drops quickly, reflecting poor patient survival (median survival, 2 mo). Survival analysis curves are more accurate at earlier time points when there are more patients represented than at later time points after patients are censored from the analysis.
Contributor Notes


