AAOS Bulletin - April, 2005

Levels of Evidence and Grades of Recommendations

An evaluation of literature leads to a common evaluation system

By James G. Wright, MD, MPH, FRCS

In response to concerns expressed by the fellowship about a standardized, uniform system for classifying evidence, the American Academy of Orthopaedic Surgeons (AAOS) has adopted an evaluation system for categorizing research.

After an extensive literature and Web site review to identify all systems for rating evidence, which was conducted by my student Trishna Smuth, a task force of representatives from the AAOS Evidence-Based Practice Committee, the Journal of Bone and Joint Surgery (JBJS) and Clinical Orthopaedics and Related Research determined that the JBJS table would be the most appropriate system to use, with some revision.

The task force subsequently developed the Levels of Evidence table and Grades of Recommendation. These will be implemented in the development of future evidence-based guidelines at AAOS. JBJS also has adopted the revised Levels of Evidence and published them on its Web site at www.ejbjs.org. The AAOS is committed to ensuring that guidelines used in the care of orthopaedic patients are based on the best research evidence available.

The goal of EBP

The aim of evidence-based practice (EBP) is to provide the best answer to questions about appropriate interventions in a timely fashion. One way to get evidence is to search the literature to identify relevant articles and evaluate (or critically appraise) the quality of those articles. But this can be a challenging task. A recently published article indicated that a total of 26,945 papers were published between 1991 and 2000 in the top seven peer-reviewed medical journals alone.1

Obviously, no one individual can be aware of all that is being published. Levels of Evidence are one way to sift through all this literature.

The principle behind Levels of Evidence is that all articles constitute evidence, but some are more persuasive through virtue of their study design. However, what physicians really need is someone to sort through and summarize all this literature for them. This is one of many purposes of the Levels of Evidence standards.

Summarizing the literature is an important strategy to address clinical questions. One approach to summarizing the literature is evidence analysis. Evidence-based analysis systematically reviews the literature and summarizes the evidence to provide answers to clinical questions

In deciding whether to accept the evidence and adopt it as guidelines, physicians must first consider the make-up of the group that did the analysis. These workgroups typically consist of experts and clinical peers and have some authority or mandate, such as from a specialty society. Second, surgeons must consider the process to insure that it includes a comprehensive, systematic and critical appraisal of the literature and/or use of Levels of Evidence.

Researchers may also have summarized the literature and published their results. Literature reviews can be narrative or systematic. Narrative reviews are opinion-based and usually suffer from all the limitations of expert opinion. Systematic reviews are evidence-based and use a comprehensive review of the literature. When the results of multiple studies are combined together, they are called meta-analyses.

Grades of recommendation

The deliberations of evidence-based analysis are usually summarized as recommendations, graded by the number and quality of studies. Multiple grading schemes are available but most are based on Levels of Evidence, as outlined in the accompanying chart.

Grades of recommendations provide a simple summary of the literature on treatment effectiveness (see below). Summary recommendations may support, and in some cases not support, the use of particular treatments. Grades of recommendation are determined by looking at all the literature and then applied to a certain procedure or intervention.

There are four grades of recommendation: A, B, C and I. Treatments that receive an A are supported by good evidence (Level I studies with consistent finding) for or against recommending intervention. Treatments that receive a B are supported by fair evidence (Level II or Level III studies with consistent findings) for or against recommending intervention. C-graded treatments have conflicting or poor quality evidence (Level IV or Level V studies) not allowing a recommendation for or against intervention. Treatments that receive an I do not have sufficient evidence to make a recommendation.

Surgeons can expect to see references to Levels of Evidence more frequently in the future. For example, Levels of Evidence will become part of the process of abstract submission for the AAOS Annual Meeting in 2006 (JBJS 2005:87:161). Grades of Recommendation will also be introduced gradually into orthopaedics. JBJS will be using Grades of Recommendation for Current Concepts Reviews beginning in July 2005.


1. Rahman M, Pukui T. A decline in the U.S. share of research articles. N Engl J Med 2002: 347:1211-2.

James G. Wright, MD, MPH, FRCS, is a member of the AAOS Evidence-Based Practice Committee. He can be reached at jim.wright@sickkids.ca.

Levels of Evidence For Primary Research Question1

Types of Studies


Therapeutic Studies

Investigating the results of treatment

Prognostic Studies

Investigating the effect of a patient characteristic on the outcome of disease

Diagnostic Studies

Investigating a diagnostic test

Economic and Decision Analyses

Developing an economic or decision model

Level I

• High quality randomized clinical trial (RCT) with statistically significant difference or no statistically significant difference but narrow confidence intervals

• Systematic review2 of Level I RCTs (and study results were homogenous3)

• High quality prospective study4 (all patients were enrolled at the same point in their disease with ≥ 80% follow-up of enrolled patients)

• Systematic review2 of Level I studies

• Testing of previously developed diagnostic criteria on consecutive patients (with universally applied reference “gold” standard)

• Systematic review2 of Level I studies

• Sensible costs and alternatives; values obtained from many studies; with multiway sensitivity analyses

• Systematic review2 of Level I studies

Level II

• Lesser quality RCT (e.g. < 80% follow-up, no blinding, or improper randomization)

• Prospective4 comparative study5

• Systematic review2 of Level II studies or Level 1 studies with inconsistent results

• Retrospective6 study

• Untreated controls from an RCT

• Lesser quality prospective study (e.g. patients enrolled at different points in their disease or <80% follow-up.)

• Systematic review2 of Level II studies

• Development of diagnostic criteria on consecutive patients (with universally applied reference “gold” standard)

• Systematic review2 of Level II studies

• Sensible costs and alternatives; values obtained from limited studies; with multiway sensitivity analyses

• Systematic review2 of Level II studies

Level III

• Case control study7

• Retrospective6 comparative study5

• Systematic review2 of Level III studies

• Case control study7

• Study of non-consecutive patients; without consistently applied reference “gold” standard

• Systematic review2 of Level III studies

• Analyses based on limited alternatives and costs; and poor estimates

• Systematic review2 of Level III studies

Level IV

Case series8

Case series

• Case-control study

• Poor reference standard

• Analyses with no sensitivity analyses

Level V

Expert opinion

Expert opinion

Expert opinion

Expert opinion

1. A complete assessment of quality of individual studies requires critical appraisal of all aspects of the study design.

2. A combination of results from two or more prior studies.

3. Studies provided consistent results.

4. Study was started before the first patient enrolled.

5. Patients treated one way (e.g. cemented hip arthroplasty) compared with a group of patients treated in another way (e.g. uncemented hip arthroplasty) at the same institution.

6. The study was started after the first patient enrolled.

7. Patients identified for the study based on their outcome, called “cases” (e.g. failed total arthroplasty), are compared to those who did not have that outcome, called “controls” (e.g. successful total hip arthroplasty).

8. Patients treated one way with no comparison group of patients treated in another way.

Close Archives | Previous Page