How long does a systematic review take?

A traditional systematic review takes 6 to 18 months. With modern tools like Verflux that automate database searching, deduplication, and meta-analysis, this can be reduced to 2 to 6 weeks.

How many databases should I search for a systematic review?

PRISMA 2020 recommends searching at least 2 to 3 major databases relevant to your field. For most clinical reviews, you should search PubMed/MEDLINE, Embase, and at least one additional database such as Cochrane CENTRAL, Scopus, or Web of Science.

What is the PICO framework?

PICO stands for Population, Intervention, Comparison, and Outcome. It is the standard framework for structuring the research question in a systematic review, ensuring the search strategy and inclusion criteria are clearly defined before searching begins.

Do I need two reviewers for a systematic review?

Yes. Dual-reviewer screening at both title/abstract and full-text stages is a core requirement of systematic review methodology. Each reviewer screens independently, and disagreements are resolved by consensus or a third reviewer.

PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is the current international standard for reporting systematic reviews, published by Page et al. in BMJ in 2021. It includes a 27-item checklist and a revised flow diagram with dual identification streams.

How to Conduct a Systematic Review in 2026: Step-by-Step

A complete walkthrough of every stage — from defining your PICO question and registering your protocol to searching 9 databases, dual-reviewer screening, meta-analysis, GRADE, and PRISMA 2020 reporting.

Define your research question with PICO

Every systematic review begins with a clearly defined, structured research question. The most widely used framework is PICO — Population, Intervention, Comparison, and Outcome. Getting this right before you search is non-negotiable: your PICO determines your search strategy, your inclusion criteria, and ultimately the scope of your entire review.

PICO element	What it defines	Example (hypertension review)
Population (P)	Who the evidence applies to — age, condition, setting	Adults aged 40+ with essential hypertension
Intervention (I)	The exposure, treatment, or test under investigation	ACE inhibitor monotherapy
Comparison (C)	The control condition or alternative treatment	Calcium channel blocker or placebo
Outcome (O)	The measured endpoint — primary and secondary	Systolic blood pressure reduction, cardiovascular events

For diagnostic accuracy reviews, use PICO-D (adding the Diagnosis element). For qualitative or scoping reviews, the PCC framework (Population, Concept, Context) may be more appropriate.

💡

Common mistake: scoping too broadly

A PICO like "any intervention for any cardiovascular outcome in any adult" will return tens of thousands of records. Be specific enough that your review is answerable. You can always run separate reviews for different subpopulations or interventions.

Write and register your protocol

Before searching a single database, write a full protocol — including your PICO, eligibility criteria, search strategy, data extraction plan, risk of bias approach, and statistical analysis plan. Then register it publicly.

Why register? Pre-registration prevents outcome reporting bias, protects you from accusations of post-hoc analysis changes, and is increasingly required by high-impact journals. PRISMA 2020 explicitly requires you to report your registration number.

✅

Where to register

PROSPERO — the international prospective register for systematic reviews (most widely recognised)
OSF (Open Science Framework) — broader scope, accepts all review types
Cochrane — if conducting a Cochrane systematic review

Build your search strategy

A comprehensive search strategy is what separates a systematic review from a narrative review. You need to capture every eligible study — including unpublished studies, grey literature, and studies in languages other than English — to minimise publication bias.

Which databases to search

PRISMA 2020 recommends searching all relevant databases for your field. For clinical and health research, the minimum is:

PubMed / MEDLINE Essential

The primary database for biomedical literature. Free via NCBI Entrez API. Over 36 million citations.

Scopus Essential

Elsevier's multidisciplinary database. Broader than PubMed, covering engineering, social sciences, and arts.

Europe PMC

Covers life sciences literature including preprints. Particularly strong for funded UK and EU research.

OpenAlex

Open-source alternative to Web of Science. Over 250 million works, fully open API, no subscription required.

CrossRef

The DOI registration agency. Excellent for finding grey literature and preprints across all disciplines.

Cochrane CENTRAL

Specialised trials register. Best source for randomised controlled trial records, especially older ones.

⚠️

Single-database searches miss 15–40% of eligible studies

Research consistently shows that no single database contains all relevant literature. Searching PubMed alone is sufficient only for very narrow, PubMed-indexed topics — and even then, you risk missing trials published in non-indexed journals or available only as grey literature.

How to build the search string

Your search string translates the PICO into database-specific syntax using Medical Subject Headings (MeSH) for PubMed, Emtree for Embase, and free-text synonyms for all databases. A well-constructed search combines:

Subject headings (controlled vocabulary) — standardised terms assigned by indexers (e.g. MeSH term "Hypertension")
Free-text synonyms — all variant spellings, brand names, abbreviations (e.g. "high blood pressure", "HTN")
Boolean operators — AND to combine PICO elements, OR to combine synonyms within an element
Truncation and wildcards — e.g. cardiovasc* captures cardiovascular, cardiovasculaire, etc.

Deduplicate and screen: title/abstract stage

After running your searches across all databases, you will have a pool of records that contains significant overlap — the same study indexed in five databases counts as five records. Deduplication is the process of removing these duplicate records before screening begins.

In a typical multi-database search of 6–9 sources, 30–60% of records are duplicates. Accurate deduplication is essential — missed duplicates inflate your screening workload; over-aggressive deduplication removes legitimate unique records.

Dual-reviewer screening: why it is mandatory

Every systematic review guideline — Cochrane Handbook, PRISMA 2020, JBI Manual — requires independent dual-reviewer screening at both the title/abstract and full-text stages. Two reviewers screen each record independently, without seeing each other's decisions, to eliminate selection bias. Disagreements are resolved by discussion or a third reviewer.

Screening approach	Bias risk	Inter-rater agreement	Accepted by journals
Single reviewer	High	N/A	No
Dual reviewer, not blinded	Moderate	Moderate	Sometimes
Dual reviewer, blind (independent)	Low	Cohen's κ reported	Yes ✓

Inter-rater agreement — typically measured using Cohen's kappa (κ) — should be calculated and reported. A κ of 0.61–0.80 is considered substantial agreement; above 0.80 is almost perfect. Low κ indicates your inclusion criteria need clarification.

Full-text screening and PRISMA flow diagram

Studies that pass title/abstract screening move to full-text assessment. You retrieve the full paper for each record and apply your eligibility criteria rigorously. Every exclusion at this stage must be documented with a specific reason, as PRISMA 2020 requires you to report the number of excluded full-texts and reasons for exclusion.

📊

PRISMA 2020 flow diagram: the four stages

Identification — records identified through database searching + other sources (citation searching, grey literature, hand searching)
Screening — records screened at title/abstract after deduplication; records excluded
Eligibility — full-text articles assessed; excluded with reasons
Included — studies included in qualitative synthesis and/or meta-analysis

Data extraction

For every included study, two reviewers independently extract the data you will use in your synthesis. Data extraction forms should be piloted on 2–3 studies before the main extraction begins, to ensure all reviewers interpret fields consistently.

What to extract

Study characteristics — design, country, setting, funding source, registration number
Population details — sample size, age, sex distribution, baseline characteristics, inclusion/exclusion criteria
Intervention details — dose, duration, route, comparator description
Outcome data — for continuous outcomes: mean, SD, n per arm; for dichotomous: event counts and totals per arm; for time-to-event: hazard ratios and confidence intervals
Follow-up — duration and completeness

⚠️

The SD problem: when papers report SE or IQR

Many papers report standard error (SE) rather than standard deviation (SD), or present medians and interquartile ranges rather than means. For meta-analysis, you need SD. Convert SE to SD using SD = SE × √n. For IQR, you can estimate SD using Wan et al. (2014) or Luo et al. (2018) methods.

Risk of bias assessment

Risk of bias (RoB) assessment evaluates the methodological quality of each included study — specifically, whether design or conduct limitations could have introduced systematic error into the results. The instrument you use depends entirely on the study design.

Tool	Study design	Domains	Rating scale
RoB 2	Randomised controlled trials	5 domains	Low / Some concerns / High
ROBINS-I	Non-randomised intervention studies	7 domains	Low / Moderate / Serious / Critical
Newcastle-Ottawa Scale	Cohort and case-control studies	8 stars across 3 domains	Stars (0–9)
QUADAS-2	Diagnostic test accuracy studies	4 domains	Low / High / Unclear
AXIS	Cross-sectional / prevalence studies	20 items	Yes / No / Unsure

Risk of bias judgements should be made at the outcome level, not just the study level — a study can have low RoB for its primary outcome but high RoB for secondary outcomes if measurement methods differed.

Meta-analysis: pooling effects

If sufficient studies report compatible outcomes, you can combine their effect estimates statistically in a meta-analysis. This produces a pooled estimate with a confidence interval that incorporates between-study variance — giving you more statistical power than any individual study.

Choosing the right statistical model

Fixed-effect vs Random-effects: the essential decision

Fixed-effect model

Assumes all studies estimate the same true effect. Appropriate only when studies are near-identical replicates. Produces narrower confidence intervals — misleadingly so when heterogeneity exists. Rarely appropriate in clinical research.

Random-effects model

Assumes the true effect varies across studies. The default for most clinical meta-analyses. Produces wider, more honest confidence intervals. Requires estimation of between-study variance τ². DerSimonian-Laird or Paule-Mandel estimators supported.

DerSimonian-Laird (DL)

The most widely used τ² estimator. Computationally simple and well-understood. Can underestimate variance when k < 5 studies. The historical standard.

Paule-Mandel (PM)

Iterative method-of-moments estimator. Preferred when k < 10 studies because it is less biased. Produces more conservative (wider) confidence intervals. Increasingly recommended in recent methodological literature.

Understanding heterogeneity: I², Q, and τ²

Heterogeneity is the variation in effect estimates across studies that exceeds chance. It is one of the most misunderstood concepts in meta-analysis.

⚠️

I² does not measure the amount of heterogeneity — it measures its proportion

I² is the percentage of variance attributable to between-study differences rather than sampling error. An I² of 80% does not tell you the effect estimates are dramatically inconsistent — it tells you 80% of variance is between-study. With large samples, even trivial between-study variance produces high I². Report τ² (the absolute variance) and the prediction interval alongside I².

The 95% prediction interval is the most clinically useful heterogeneity statistic: it describes the range within which the true effect in a new similar study would be expected to fall 95% of the time. A wide prediction interval crossing the null line means the intervention may be beneficial in some settings and harmful in others — a fundamentally different clinical message than a narrow confidence interval.

Effect measures by outcome type

Outcome type	Effect measure	Pooled on	Back-transformed
Continuous (same scale)	Mean Difference (MD)	Raw scale	N/A
Continuous (different scales)	Standardised Mean Difference (SMD)	SD units	N/A
Dichotomous	Odds Ratio (OR) or Risk Ratio (RR)	Log scale	Exponentiated for display
Dichotomous	Risk Difference (RD)	Raw proportion	N/A
Time-to-event	Hazard Ratio (HR)	Log scale	Exponentiated for display

GRADE: rating certainty of evidence

GRADE (Grading of Recommendations Assessment, Development and Evaluation) is the international standard for rating how confident you are in your pooled estimate. GRADE produces a certainty rating — High, Moderate, Low, or Very Low — for each outcome, based on systematic evaluation of five downgrade domains and three upgrade criteria.

Domain	Direction	When to apply
Risk of bias	↓ Downgrade	Included studies have serious or critical methodological limitations
Inconsistency	↓ Downgrade	Unexplained heterogeneity — wide prediction interval, high I²
Indirectness	↓ Downgrade	Evidence does not directly answer the PICO question
Imprecision	↓ Downgrade	Wide CI crossing the minimal important difference threshold
Publication bias	↓ Downgrade	Funnel plot asymmetry, Egger regression significant (p < 0.1)
Large effect size	↑ Upgrade	RR > 2 or < 0.5, and RoB is low
Dose-response gradient	↑ Upgrade	Clear relationship between dose and effect magnitude
Plausible confounding	↑ Upgrade	Confounding would reduce the observed effect, so true effect is larger

RCT evidence starts at High certainty. Observational evidence starts at Low certainty. From there, you downgrade or upgrade based on the above domains, arriving at your final certainty rating for each outcome.

PRISMA 2020 reporting

PRISMA 2020 is the mandatory reporting standard for systematic reviews. The 2020 update by Page et al. (BMJ, 2021) introduced significant revisions to the flow diagram and expanded the checklist from 27 to 27 items with more detailed guidance.

The most visible PRISMA deliverable is the flow diagram — a visual record of how many records were identified, screened, assessed for eligibility, and ultimately included. PRISMA 2020 revised this to include two separate identification streams:

Stream 1 — database and register searching
Stream 2 — other sources (citation searching, grey literature, hand-searching, author contact)

⚡

Verflux auto-generates your PRISMA 2020 diagram

Verflux pulls the exact record counts from your search and screening modules and builds a PRISMA 2020-compliant flow diagram automatically. Counts are editable, and the final diagram exports as a 300 DPI PNG ready for journal submission — without touching PowerPoint or drawing software.