The skill of reading a medical paper is taught implicitly — some students absorb it from journal club, others never do. The undertrained version reads the abstract, glances at the conclusion, and accepts whatever the authors claim. The trained version goes to the methods first, asks five questions, and finishes the paper either trusting the result or knowing exactly where it falls apart. This piece walks through the structure and the questions.
The standard paper structure (and why you read it out of order)
Most papers follow IMRaD: Introduction, Methods, Results, and Discussion. People read them in that order, which is the wrong order for appraisal. The introduction is where authors set up the narrative they want you to believe. The discussion is where they spin the results. If you start there, you will be biased before you reach the methods. The correct order for appraisal is:
- Methods first. What did they actually do?
- Results second. What did they find, irrespective of the spin?
- Introduction third. Does the question they asked match the question the methods could answer?
- Discussion last. Is their interpretation supported by the results, or are they overreaching?
Five questions for the methods section
1. What was the study design and is it the right one for the question?
An RCT can answer "does this intervention cause this outcome." A cohort study can answer "is this exposure associated with this outcome." A case-control study is appropriate for rare outcomes. A cross-sectional study can answer "how common is this." If the authors used a case series to claim causation, the design does not support the conclusion. This is the most common appraisal failure in undergraduate journal clubs.
2. Who was in the study and how were they selected?
Read the inclusion and exclusion criteria. Three patterns to flag: (a) heavy exclusions that make the sample non-representative of the patients you would actually treat ("excluded if BMI > 35, age > 70, eGFR < 60, on warfarin" — that excludes most of your real patients); (b) recruitment from a single tertiary center for a question about general population effects; (c) volunteers responding to advertisements, which selects for unusually motivated patients.
3. Was the comparison fair?
For RCTs, was randomization concealed and was the allocation truly random? Block randomization with predictable sequences can be gamed. Open-label trials are vulnerable to performance bias. For observational studies, were the comparison groups similar at baseline (Table 1) and were confounders adjusted for? An unadjusted comparison between sicker treated patients and healthier untreated ones will look like the treatment hurts.
4. Was the outcome measured the same way in everyone?
If the primary outcome required a subjective assessment (e.g., pain on a 0–10 scale, functional independence), were the assessors blinded to group assignment? Unblinded assessment of subjective outcomes inflates effect sizes by 30 to 50 percent on average. If imaging was used, were the readers blinded?
5. What happened to people who dropped out?
Find the flow diagram (CONSORT for trials, STROBE for cohort studies). A trial that enrolled 400 patients and analyzed 280 has lost 30 percent of its data. If the missing patients were not analyzed using intention-to-treat principles or appropriate imputation, the effect estimate is biased. The same applies to cohort studies with heavy loss to follow-up.
Reading the results section
Two practical points. First, look at the absolute numbers, not just the relative risk. A "50 percent reduction" sounds impressive but if it is from 2 percent to 1 percent in absolute terms, the number needed to treat is 100. Second, confidence intervals matter more than p-values. A p-value of 0.04 tells you the effect is statistically distinguishable from zero. The CI tells you the range of plausible effect sizes. A CI of [0.2, 0.99] for a hazard ratio includes effects ranging from large benefit to almost none — that is much less certain than a CI of [0.4, 0.6], even if both have the same p-value.
How to use p-values without misreading them
A p-value is the probability of observing data at least this extreme if the null hypothesis is true. It is not the probability that the null hypothesis is true. It is not the probability that the treatment works. A p-value of 0.04 does not mean "96 percent likely to be real."
Multiple-testing inflation: if you test 20 outcomes at p < 0.05, one will be significant by chance. Look at the pre-registered primary outcome and treat secondary outcomes as hypothesis-generating, even if a few are p < 0.05.
Red flags worth flagging
- Conflict of interest mismatched to author affiliation. An industry-sponsored trial whose primary author is a paid consultant of the sponsor warrants closer scrutiny — not automatic dismissal, but skepticism.
- Post-hoc primary outcome change. If the registered protocol said outcome X and the published paper reports outcome Y as primary, the authors moved the goalposts. Check the protocol vs. the publication.
- Subgroup analyses presented as primary findings. A trial that failed to show overall effect but reports "significant effect in patients with high baseline LDL" is fishing.
- Heroic adjustment. An observational study that adjusts for 25 confounders to make an effect appear is more likely to have overfit than to have uncovered truth.
- Surrogate-only outcomes. A trial showing biomarker improvement but no patient-centered benefit is preliminary, not practice-changing.
A short appraisal worksheet
Before journal club, fill in:
- Design: ___ (RCT, cohort, case-control, etc.)
- Population: ___ (who actually got studied)
- Intervention vs comparator: ___
- Primary outcome (pre-specified): ___
- Sample size and dropout: ___ enrolled, ___ analyzed
- Blinding: ___ (participants / assessors / analysts)
- Effect estimate with 95% CI: ___
- Absolute risk reduction and NNT: ___
- Three biggest threats to validity: ___
- Does the discussion overstate? Y / N
If you can fill this in for any paper without re-reading the abstract for help, you can appraise.
Once you can critically appraise individual papers, the natural next step is synthesizing across papers — which is what a systematic review does formally.