How does a Large Language Model (LLM) Artificial Intelligence (AI) Fare with Adult Reconstruction Knowledge?

Volume VIII, Number 1 | Spring 2024

How does a Large Language Model (LLM) Artificial Intelligence (AI) Fare with Adult Reconstruction Knowledge?

¹Lum Z, ¹Collins D, ¹Guntupalli L, ¹Dennison S, ²Choudhary S, ²Saiz A, ²Randall R
¹Nova Southeastern University, Fort Lauderdale, FL, United States; ²University of California Davis, Sacramento, CA, United states

Introduction
As the capabilities of artificial intelligence (AI) continue to advance, it is important to regularly evaluate competency to maintain high standards and preventing potential errors or biases, that could deliver misinformation that could harm patients or spread inaccurate information. A new AI model using large language models (LLM) and non-specific domain areas has gained recent attention in its novel way to process information. We wanted to test its performance to correctly answer hip & knee arthroplasty questions compared to other subject types and taxonomy question type (recall, interpretation, knowledge application).

Methods
We asked ChatGPT, 3173 questions based on the Orthopaedic In-Training Exam (OITE) and 757 questions from the real OITE. Questions were categorized by subject type, and by taxonomy type. These questions were then entered into the AI chatbot and score was recorded. Multivariate logistic regression analysis was performed comparing hip & knee arthroplasty questions with other question types, and based upon taxonomy.

Results
After exclusions, ChatGPT answered 960/1871 (51%) of total questions correctly and 77/194 (40%) of hip & knee arthroplasty questions correctly, which was in the lower performing subject types. Reconstruction testing exhibited poorer performance than Basic Science (p<0.001, 3.23 OR [2.26-4.65]), Pathology/Oncology (p<0.001, 2.82 OR [1.72-4.65], Knee & Sports Medicine (p<0.001, OR 1.99 [1.34-2.96]) and Pediatrics (p=0.011, OR 1.70 [1.12-2.56]). When evaluating sub-group taxonomy analysis, univariate logistic regression demonstrated the AI’s lower performance in taxonomy type 3 compared to type 1 (50% vs 41%, p=0.049).

Discussion
The AI LLM may not be effective in answering orthopaedic questions related to hip and knee arthroplasty. Furthermore, the study’s taxonomy analysis highlights the importance of considering the question structure when evaluating AI performance. Ultimately, as AI continues to evolve and advance, it will be important to consider its limitations and potential biases to ensure its responsible and ethical use.

Image Number: 1

Image Number: 2

Image Number: 3

The Journal of the American Osteopathic Academy of Orthopedics

Steven J. Heithoff, DO, FAOAO
Editor-in-Chief

Editorial Board

To submit an article to JAOAO

Visit AOAO.org
Contact us

Share this content on social media!

© AOAO. All copyrights of published material within the JAOAO are reserved. No part of this publication can be reproduced or transmitted in any way without the permission in writing from the JAOAO and AOAO. Permission can be requested by contacting Joye Stewart at [email protected].