A complete walkthrough of every stage — from defining your PICO question and registering your protocol to searching 9 databases, dual-reviewer screening, meta-analysis, GRADE, and PRISMA 2020 reporting.
Every systematic review begins with a clearly defined, structured research question. The most widely used framework is PICO — Population, Intervention, Comparison, and Outcome. Getting this right before you search is non-negotiable: your PICO determines your search strategy, your inclusion criteria, and ultimately the scope of your entire review.
| PICO element | What it defines | Example (hypertension review) |
|---|---|---|
| Population (P) | Who the evidence applies to — age, condition, setting | Adults aged 40+ with essential hypertension |
| Intervention (I) | The exposure, treatment, or test under investigation | ACE inhibitor monotherapy |
| Comparison (C) | The control condition or alternative treatment | Calcium channel blocker or placebo |
| Outcome (O) | The measured endpoint — primary and secondary | Systolic blood pressure reduction, cardiovascular events |
For diagnostic accuracy reviews, use PICO-D (adding the Diagnosis element). For qualitative or scoping reviews, the PCC framework (Population, Concept, Context) may be more appropriate.
A PICO like "any intervention for any cardiovascular outcome in any adult" will return tens of thousands of records. Be specific enough that your review is answerable. You can always run separate reviews for different subpopulations or interventions.
Before searching a single database, write a full protocol — including your PICO, eligibility criteria, search strategy, data extraction plan, risk of bias approach, and statistical analysis plan. Then register it publicly.
Why register? Pre-registration prevents outcome reporting bias, protects you from accusations of post-hoc analysis changes, and is increasingly required by high-impact journals. PRISMA 2020 explicitly requires you to report your registration number.
A comprehensive search strategy is what separates a systematic review from a narrative review. You need to capture every eligible study — including unpublished studies, grey literature, and studies in languages other than English — to minimise publication bias.
PRISMA 2020 recommends searching all relevant databases for your field. For clinical and health research, the minimum is:
Research consistently shows that no single database contains all relevant literature. Searching PubMed alone is sufficient only for very narrow, PubMed-indexed topics — and even then, you risk missing trials published in non-indexed journals or available only as grey literature.
Your search string translates the PICO into database-specific syntax using Medical Subject Headings (MeSH) for PubMed, Emtree for Embase, and free-text synonyms for all databases. A well-constructed search combines:
cardiovasc* captures cardiovascular, cardiovasculaire, etc.After running your searches across all databases, you will have a pool of records that contains significant overlap — the same study indexed in five databases counts as five records. Deduplication is the process of removing these duplicate records before screening begins.
In a typical multi-database search of 6–9 sources, 30–60% of records are duplicates. Accurate deduplication is essential — missed duplicates inflate your screening workload; over-aggressive deduplication removes legitimate unique records.
Every systematic review guideline — Cochrane Handbook, PRISMA 2020, JBI Manual — requires independent dual-reviewer screening at both the title/abstract and full-text stages. Two reviewers screen each record independently, without seeing each other's decisions, to eliminate selection bias. Disagreements are resolved by discussion or a third reviewer.
| Screening approach | Bias risk | Inter-rater agreement | Accepted by journals |
|---|---|---|---|
| Single reviewer | High | N/A | No |
| Dual reviewer, not blinded | Moderate | Moderate | Sometimes |
| Dual reviewer, blind (independent) | Low | Cohen's κ reported | Yes ✓ |
Inter-rater agreement — typically measured using Cohen's kappa (κ) — should be calculated and reported. A κ of 0.61–0.80 is considered substantial agreement; above 0.80 is almost perfect. Low κ indicates your inclusion criteria need clarification.
Studies that pass title/abstract screening move to full-text assessment. You retrieve the full paper for each record and apply your eligibility criteria rigorously. Every exclusion at this stage must be documented with a specific reason, as PRISMA 2020 requires you to report the number of excluded full-texts and reasons for exclusion.
For every included study, two reviewers independently extract the data you will use in your synthesis. Data extraction forms should be piloted on 2–3 studies before the main extraction begins, to ensure all reviewers interpret fields consistently.
Many papers report standard error (SE) rather than standard deviation (SD), or present medians and interquartile ranges rather than means. For meta-analysis, you need SD. Convert SE to SD using SD = SE × √n. For IQR, you can estimate SD using Wan et al. (2014) or Luo et al. (2018) methods.
Risk of bias (RoB) assessment evaluates the methodological quality of each included study — specifically, whether design or conduct limitations could have introduced systematic error into the results. The instrument you use depends entirely on the study design.
| Tool | Study design | Domains | Rating scale |
|---|---|---|---|
| RoB 2 | Randomised controlled trials | 5 domains | Low / Some concerns / High |
| ROBINS-I | Non-randomised intervention studies | 7 domains | Low / Moderate / Serious / Critical |
| Newcastle-Ottawa Scale | Cohort and case-control studies | 8 stars across 3 domains | Stars (0–9) |
| QUADAS-2 | Diagnostic test accuracy studies | 4 domains | Low / High / Unclear |
| AXIS | Cross-sectional / prevalence studies | 20 items | Yes / No / Unsure |
Risk of bias judgements should be made at the outcome level, not just the study level — a study can have low RoB for its primary outcome but high RoB for secondary outcomes if measurement methods differed.
If sufficient studies report compatible outcomes, you can combine their effect estimates statistically in a meta-analysis. This produces a pooled estimate with a confidence interval that incorporates between-study variance — giving you more statistical power than any individual study.
Assumes all studies estimate the same true effect. Appropriate only when studies are near-identical replicates. Produces narrower confidence intervals — misleadingly so when heterogeneity exists. Rarely appropriate in clinical research.
Assumes the true effect varies across studies. The default for most clinical meta-analyses. Produces wider, more honest confidence intervals. Requires estimation of between-study variance τ². DerSimonian-Laird or Paule-Mandel estimators supported.
The most widely used τ² estimator. Computationally simple and well-understood. Can underestimate variance when k < 5 studies. The historical standard.
Iterative method-of-moments estimator. Preferred when k < 10 studies because it is less biased. Produces more conservative (wider) confidence intervals. Increasingly recommended in recent methodological literature.
Heterogeneity is the variation in effect estimates across studies that exceeds chance. It is one of the most misunderstood concepts in meta-analysis.
I² is the percentage of variance attributable to between-study differences rather than sampling error. An I² of 80% does not tell you the effect estimates are dramatically inconsistent — it tells you 80% of variance is between-study. With large samples, even trivial between-study variance produces high I². Report τ² (the absolute variance) and the prediction interval alongside I².
The 95% prediction interval is the most clinically useful heterogeneity statistic: it describes the range within which the true effect in a new similar study would be expected to fall 95% of the time. A wide prediction interval crossing the null line means the intervention may be beneficial in some settings and harmful in others — a fundamentally different clinical message than a narrow confidence interval.
| Outcome type | Effect measure | Pooled on | Back-transformed |
|---|---|---|---|
| Continuous (same scale) | Mean Difference (MD) | Raw scale | N/A |
| Continuous (different scales) | Standardised Mean Difference (SMD) | SD units | N/A |
| Dichotomous | Odds Ratio (OR) or Risk Ratio (RR) | Log scale | Exponentiated for display |
| Dichotomous | Risk Difference (RD) | Raw proportion | N/A |
| Time-to-event | Hazard Ratio (HR) | Log scale | Exponentiated for display |
GRADE (Grading of Recommendations Assessment, Development and Evaluation) is the international standard for rating how confident you are in your pooled estimate. GRADE produces a certainty rating — High, Moderate, Low, or Very Low — for each outcome, based on systematic evaluation of five downgrade domains and three upgrade criteria.
| Domain | Direction | When to apply |
|---|---|---|
| Risk of bias | ↓ Downgrade | Included studies have serious or critical methodological limitations |
| Inconsistency | ↓ Downgrade | Unexplained heterogeneity — wide prediction interval, high I² |
| Indirectness | ↓ Downgrade | Evidence does not directly answer the PICO question |
| Imprecision | ↓ Downgrade | Wide CI crossing the minimal important difference threshold |
| Publication bias | ↓ Downgrade | Funnel plot asymmetry, Egger regression significant (p < 0.1) |
| Large effect size | ↑ Upgrade | RR > 2 or < 0.5, and RoB is low |
| Dose-response gradient | ↑ Upgrade | Clear relationship between dose and effect magnitude |
| Plausible confounding | ↑ Upgrade | Confounding would reduce the observed effect, so true effect is larger |
RCT evidence starts at High certainty. Observational evidence starts at Low certainty. From there, you downgrade or upgrade based on the above domains, arriving at your final certainty rating for each outcome.
PRISMA 2020 is the mandatory reporting standard for systematic reviews. The 2020 update by Page et al. (BMJ, 2021) introduced significant revisions to the flow diagram and expanded the checklist from 27 to 27 items with more detailed guidance.
The most visible PRISMA deliverable is the flow diagram — a visual record of how many records were identified, screened, assessed for eligibility, and ultimately included. PRISMA 2020 revised this to include two separate identification streams:
Verflux pulls the exact record counts from your search and screening modules and builds a PRISMA 2020-compliant flow diagram automatically. Counts are editable, and the final diagram exports as a 300 DPI PNG ready for journal submission — without touching PowerPoint or drawing software.
All analytical features included in the free trial. No credit card, no installation, no R or Python.
Create free account