This appendix explains some important technical aspects of appraising scientific research, which is inevitably the trickiest aspect of evidence-based practice for non-researchers. As we note in this guide, most people professionals won’t need to become researchers themselves, but a sensible aim is to become ‘savvy consumers’ of research.
To support this, below we explain four aspects of appraising scientific research:
- The three conditions that show causal relationships.
- Common study designs.
- Assessing methodological appropriateness.
- Interpreting research findings (in particular effect sizes).
We hope that this assists you in developing enough understanding to be able to ask probing questions and apply research insights.
Three conditions to show causal relationships
In HR, people management and related fields, we are often concerned with questions about ‘what works’ or what’s effective in practice. To answer these questions, we need to get as close as possible to establishing cause-and-effect relationships.
Many will have heard the phrase ‘correlation is not causality’ or ‘correlation does not imply causation’. It means that a statistical association between two measures or observed events is not enough to show that one characteristic or action leads to (or affects, or increases the chances of) a particular outcome. One reason is that statistical relationships can be spurious, meaning two things appear to be directly related, but are not.
For example, there is a statistically solid correlation between the amount of ice-cream consumed and the number of people who drown on a given day. But it does not follow that eating ice-cream makes you more likely to drown. The better explanation is that you’re more likely to both eat ice-cream and go swimming (raising your chances of drowning) on sunny days.
So what evidence is enough to show causality? Three key criteria are needed:3
- Association: A statistical relationship (such as a correlation) between reliable measures of an intervention or characteristic and an important outcome.
- Temporality or prediction: That one of these comes before the other, rather than the other way round. We obtain this from before-and-after measures to show changes over time.
- Other factors (apart from the intervention or influencer of interest) don’t explain the relationship: We obtain this from various things: studying a control group alongside the treatment group to see what would have happened without the intervention (the counterfactual); randomizing the allocation of people to intervention and control to avoid selection bias, and controlling for other relevant factors in the statistical analysis (for example, age, gender or occupation).
Common study designs
Different study designs do better or worse jobs at explaining causal relationships.
Single studies
- Randomised controlled trials (RCTs): Conducted well, these are the ideal method that meets all three criteria for causality. They are often referred to as the ‘gold standard’ of impact studies.
- Quasi-experimental designs: These are a broad group of studies that go some way towards meeting the criteria. While weaker than RCTs, they are often much more practical or ethical to conduct, and can provide good evidence for cause and effect. One example is single-group before-and-after studies. Because these don’t include control groups, we don’t know whether any improvement observed would have happened anyway, but by virtue of being longitudinal they at least show that one thing happens following another.
- Parallel cohort studies: These compare changes in outcomes over time for two groups who are similar in many ways but treated differently in a way that is of interest. Because people are not randomly allocated to the two groups, there is a risk of ‘confounders’ – that is, factors that explain both the treatment and outcomes, and interfere with the analysis. But these studies are still useful as they show change over time for intervention and control groups.
These research designs go much further to show cause-and-effect or prediction than cross-sectional surveys, which only observe variables at one point in time. In survey analysis, statistical relationships could be spurious or the direction of causality could even be the opposite to what you might suppose. For example, a simple correlation between ‘employee engagement’ and performance could exist because engagement contributes to performance, or because being rated as high-performing makes people feel better.
Other single study designs include controlled before-after studies (also called 'non-randomized controlled trials' or 'controlled longitudinal studies'), controlled studies with post-test only, and case studies. Case studies often use qualitative methods, such as interviews, focus groups, documentary analysis, narrative analysis, and ethnography or participant observation. Qualitative research is often exploratory, in that it is used to gain an understanding of underlying reasons or opinions and generate new theories. These can then be tested as hypotheses in appropriate quantitative studies.
Systematic reviews and meta-analyses
Systematic reviews and meta-analyses are central to evidence-based practice. Their strength is that they look across the body of research, allowing us to understand the best available evidence on a topic overall. In contrast, even well-conducted single studies can give different results on the same topic, due to differences in context or the research approaches used.
Characteristics of these are as follows:
- Systematic reviews: These are studies that summarise the body of studies on the same topic. They use consistent search terms in different scientific databases, ideally appraise the quality of studies and are explicit about the methods used. The CIPD conducts evidence reviews based on rapid evidence assessments (REAs), a shortened form of the systematic review that follows the same principles.
- Meta-analysis: This is often based on a systematic review. It is a study that uses statistical analysis to combine the results of individual studies to get a more accurate estimate of an effect. It can also be used to analyse what conditions make an effect larger or smaller.
More information on research designs can be found in CEBMa resources.
Assessing methodological appropriateness
When conducting an evidence review, we need to determine which research evidence is ‘best’ (that is, most trustworthy) for the question in hand, so we can prioritise it in our recommendations. At the same time, we assess the quality of research evidence to establish how certain we can be of our recommendations: well-established topics often have a strong body of research, but the evidence on new or emerging topics is often far less than ideal.
This involves appraising the study designs or research methods used. For questions about intervention effectiveness or cause-and-effect, we use tables such as that below to inform a rating of evidence quality. Based on established scientific standards, we can also estimate the trustworthiness of the study. Hypothetically, if you were deciding whether to use a particular intervention based on evidence that was only 50% trustworthy, you would have the same 50/50 chance of success as tossing a coin, so the evidence would be useless. On the other hand, using evidence that was 100% trustworthy would give you certain success. Of course, in reality nothing is 100% certain, but highly trustworthy research can conclusively demonstrate that, in a given context, an intervention has a positive or negative impact on the outcomes that were measured.
Table 1: Methodological appropriateness of effect studies and impact evaluations
Research design |
Level |
Trustworthiness |
Systematic review or meta-analysis of randomized controlled studies |
AA: Very high |
95% |
Systematic review or meta-analysis of non-randomized controlled and/or before-after studies |
A: High |
90% |
Randomized controlled study |
Systematic review or meta-analysis of controlled studies without a pre-test or uncontrolled study with a pre-test |
B: Moderate |
80% |
Non-randomized controlled before-after study |
Interrupted time series |
Systematic review or meta-analysis of cross-sectional studies |
C: Limited |
70% |
Controlled study without a pre-test or uncontrolled study with a pre-test |
Cross-sectional survey |
D: Low |
60% |
Case studies, case reports, traditional literature reviews, theoretical papers |
E: Very low |
55% |
Notes: Trustworthiness takes into consideration not only which study design was used but also how well it was applied. Table reproduced from CEBMa (2017), based on the classification system of Shadish, Cook and Campbell (2002)4 and Petticrew and Roberts (2006)5.
There are two important points to note about using such hierarchies of evidence. First, as we discuss in this guide, evidence-based practice involves prioritising the best available evidence. A good mantra here is ‘the perfect is the enemy of the good’: if studies with very robust (highly methodologically appropriate) designs are not available on your topic of interest, look at others. For example, if systematic reviews or randomized controlled studies are not available on your question, you will do well to look at other types of studies, such as those with quasi-experimental designs.
Second, although many questions for managers and people and HR relate to effectiveness or causality, this is by no means always the case. Broadly, types of research questions include the following:
Table 2: Types of research question
Effect, impact |
Does A have an effect/impact on B? What are the critical success factors for A? What are the factors that affect B? |
Prediction |
Does A precede B? Does A predict B over time? |
Association |
Is A related to B? Does A often occur with B? Do A and B co-vary? |
Difference |
Is there a difference between A and B? |
Prevalence or frequency |
How often does A occur? |
Attitudes and opinion |
What is people's attitude toward A? Are people satisfied with A? How many people prefer A over B? Do people agree with A? |
Experience, perceptions, feelings, needs |
What are people's experiences, feelings or perceptions regarding A? What do people need to do/use A? |
Exploration and theory building |
Why does A occur? How does A impact/affect B? Why is A different from B? |
Different methods are suited to different types of questions. For example, a cross-sectional survey is a highly appropriate or trustworthy design for questions about association, difference, prevalence, frequency and attitudes. And qualitative research is highly appropriate for questions about experience, perceptions, feelings, needs and exploration and theory building. For more discussion of this, see Petticrew and Roberts (2003).
Effect sizes and interpreting research findings
Even if practitioners wanting to be evidence-based can search for and find relevant research, they are left with another challenge: how to interpret it. Unfortunately, academic research in human resource management is often highly technical, written in inaccessible language and not closely linked to practice. A recent analysis found that in a sample of 324 peer-reviewed articles, half of them dedicated less than 2% of the text to practical implications, and where implications were discussed, this was often obscure and implicit.
Even if published research does include good discussion of practical implications, it’s helpful and perhaps necessary for practitioners wishing to draw on them to understand the findings. This can be tricky, as they contain fairly technical statistical information.
Statistical significance
There’s an obvious need to simplify the technical findings of quantitative studies. The typical way to try to simplify research findings is to focus on statistical significance, or p-values. Reading through a research paper, this may seem intuitive, as the level of significance is identified with asterisks: typically, * means sufficiently significant and ** or *** means highly significant. However, there is a lot of confusion about what the p-value is – even quantitative scientists struggle to translate it into something meaningful and easy to understand – and a growing number of scientists are arguing that it should be abandoned. What’s more, statistical significance does nothing to help a practitioner who wants to know if a technique or approach is likely to have a meaningful impact – that is, it does not answer the most important practical question of how much difference an intervention makes.
Effect sizes
The good news is that effect sizes do give this information. The information is still technical and can still be hard to understand, as studies often use different statistics for effect sizes. Fortunately, however, we can translate effect sizes into every-day language. A useful tool is 'Cohen’s Rule of Thumb', which matches different statistical measures to small/medium/large categories.6
According to Cohen:
- a ‘small’ effect is one that is visible only through careful examination – so may not be practically relevant
- a ‘medium’ effect is one that is ‘visible to the naked eye of the careful observer’
- a ‘large’ effect is one that anybody can easily see because it is substantial. An example of a large effect size is the relationship between sex and height: if you walked into a large room full of people in which all the men were on one side and all the women on the other side, you would instantly see a general difference in height.
The rule of thumb has since been extended to account for very small, very large and huge results.7
Effect sizes need to be contextualised. For example, a small effect is of huge importance if the outcome is the number of fatalities, or indeed, sales revenue. Compared to this, if the outcome is work motivation (which is likely to affect sales revenue but is certainly not the same thing) even a large effect will be less important. This shows the limits of scientific studies and brings us back to evidence from practitioners and stakeholders, who are well placed to say what outcomes are most important.