Can AI outperform Stanford Medical Students? (My own research!)

Поділитися
Вставка
  • Опубліковано 11 чер 2024
  • A group from Stanford gave ChatGPT free response, clinical reasoning final exams and compared its performance to students at Stanford medical school. How did it do? Featuring a discussion with co-authors Alicia DiGiammarino, Jason Hom, and Jonathan Chen.
    The tl;dr (i.e. the results): @8:20
    Discussion with my co-authors: @9:24
    The paper: jamanetwork.com/journals/jama...
    The original UA-cam video that was an inspiration for this research project: • Can ChatGPT Pass a Med...
    #ChatGPT #MedEd

КОМЕНТАРІ • 19

  • @bakercat1461
    @bakercat1461 10 місяців тому +5

    This is a brilliant and thought provoking conversation!

  • @liamhurlburt9794
    @liamhurlburt9794 10 місяців тому +2

    fantastic discussion, thanks for providing a bit of background to the research with this video. I'm definitely fascinated to see where we end up in 10 years with all this, though as someone who is currently just a premed student I'd be lying if I said I wasn't worried both for my own future and the future of medicine!

  • @Steven-cs4yc
    @Steven-cs4yc 10 місяців тому +6

    Hi Dr. Strong, I am a new IM intern. I have been a long-term viewer of your channel, big fan. I appreciate that you post this video because ChatGPT and GPT4 have been on top of my mind for the past few months. I have used them in chart reviews and they have achieved remarkable accuracy. With the pace of this technological advancement, I am worried about the future for internal medicine doctors as AIs are becoming better and better at diagnosing diseases. While I believe that patients do value an empathetic human doctor, I wonder who they will choose when Dr. GPT can see them much sooner, reach the same diagnosis and sound more empathetic

    • @g-mannG
      @g-mannG 10 місяців тому

      Maybe in outpatient what about inpatient and ICU?

    • @Steven-cs4yc
      @Steven-cs4yc 10 місяців тому

      @@g-mannG Outpatient for the most part

  • @alvaro37nf
    @alvaro37nf 11 місяців тому +1

    Dr Strong, thank you for what you do for education in medicine ! I would really like a video of your opinion regarding the USMLE, for example, is it really important to evaluate knowledge on ultra rare diseases such as metachromatic leukodistrophy?

  • @shaukatmehmood4303
    @shaukatmehmood4303 11 місяців тому +2

    Our mcqs exam is actually testing ur speed of reading,ur speed of answering...bt speed is not need of patients it's need of exam taking authority....the only need of patient is accurate diagnosis....if u reach to the accurate diagnosis in one hour time its fine for patients bt if u in hurry ,in speed in minute choose incorrect diagnosis it's can be danger for patient

  • @dailydoseofmedicinee
    @dailydoseofmedicinee 11 місяців тому +1

    wow, that's amazing

  • @waelfadlallah8939
    @waelfadlallah8939 11 місяців тому +1

    Hi, Dr. Strong 👋

  • @SK-iv6kc
    @SK-iv6kc 10 місяців тому +2

    Hello Eric, I would some more details on how the formulation of the prompt influences the answer within medical context.

    • @StrongMed
      @StrongMed  10 місяців тому +3

      The following is a verbatim correspondence that I had with a JAMA IM editor on this issue:
      ------------
      Thank you for considering publishing our manuscript, Performance of ChatGPT on free-response, clinical reasoning exams. Thank you also for the helpful comments from yourself and the reviewers. As per your previous correspondence, I will focus this letter on the suggestion for iterative refinement of the question prompts we gave to the chatbot:
      "There is one additional change to your methods that we are hoping you will incorporate. Specifically, your current approach does not account for the fact that the prompt can be modified in a straightforward manner, and likely improve the ChatGPT performance… For instance “be sure not to exceed 200 words”) or “remember to clarify the rationale for the leading diagnosis” - whatever the main issues identified in the 20 original responses that could be applied to any scenario, rather than 'leading the witness" with things such as "don't forget to include pneumonia" - and then repeat the same experiment. Could then repeat the same assessment. Using the single, newer prompt, again have it respond 20 times, and see how many are graded as passing. In your initial submission, only 7 of the 20 passed. It seems only fair to give the bot one additional chance, after a round of feedback."
      This is an excellent point, and an issue which we had also identified prior to initiating the study proper, while just “playing around” with what the bot could do with some of our exam questions. Thus, we had already undertaken the step of interative prompt refinement, but had only briefly mentioned it in prior versions of the submission within the limitations paragraph of the discussion in order to stay within the word limit.
      "Limitations in this study include the observation that ChatGPT’s responses demonstrated a different understanding of several terms specific to the field of clinical reasoning (e.g. illness script, problem list) as compared to the definitions we use with our own students. These issues required minor rewording of questions to include an explanation of the relevant term"
      In the most recent submission that this letter accompanies, we have clarified this issue by including the following sentence in the methods: “To ensure the bot ‘understood’ individual questions, each case was first run through the bot at least once prior to the commencement of grading, and some questions required minor rewording.” We have also expanded the limitations to as follows:
      "A limitation of this study is that ChatGPT’s responses can be sensitive to relatively minor rewording of prompts. For example, it demonstrated a different understanding of several specific clinical reasoning terms (e.g. illness script, problem list) as compared to those we use with our students, which required a revision of some questions to include an explanation of the relevant term. Prior to commencing the formal runs that were graded and analyzed, we noted several questions that the bot repeatedly misunderstood. Rephrasing questions or breaking up unusually long prompts into multiple shorter ones improved the bot’s “understanding”. The bot may have performed even better with additional iterative refinement of prompt phrasing."
      We would be happy to include a third table/figure in order to list the more notable prompt revisions we undertook, but I will also describe three of them below.
      1. ChatGPT did not “understand” the term “problem list”. A common question within the exams we give students is phrased as:
      "Make a prioritized complete problem list for this patient. (Please note you do not need to include a plan for the problems, simply identify the problems.)"
      In response to this phrasing, after about a dozen informal runs, the bot would frequently make errors such as including “negative” findings as problems (e.g. “no history of diabetes”, “no drug allergies”), or it would lack obvious prioritization. This question was then altered to state:
      "Considering the patient's history of present illness, past medical history, social and family history, physical exam findings, and test results, propose a thorough list of all of her problems, with the most important problem listed first, and with related problems grouped together."
      This improved the accuracy of the responses, but whenever the question preceding this one within a case had asked for a differential diagnosis, the subsequent problem list began with each item on the differential diagnosis as if they were separate problems, followed by the actual problem list. For example, for a patient whose differential diagnosis had been heart failure vs. pneumonia vs. pulmonary embolism, it would provide a problem list similar to this:
      "Here is a problem list for this patient:
      Heart failure
      Pneumonia
      Pulmonary embolism
      Abnormal chest X-ray
      Hypoxemia
      Hypertension
      Lack of insurance
      Allergy to penicillin"
      This led to another iteration of the question prompt to:
      "Considering the patient's history of present illness, past medical history, social and family history, physical exam findings, and test results, propose a thorough list of all of her problems, with the most important problem listed first, and with related problems grouped together. Do not include a differential diagnosis. [emphasis added]"
      We felt this change was not significant enough to alter our impression of whether the bot could achieve our passing threshold since the definition of “problem list” is uniformly known to all of our students. In other words, defining it in this way within the question didn’t provide ChatGPT an “unfair advantage” over our students.
      2. ChatGPT occasionally “got stuck” with unusually long question prompts. Specifically, when a single question prompt included an entire case in one chunk, the responses often did not directly answer the question at all. For example, in one clinical case, “question 1” in the students’ exam consisted of an entire 700+ word vignette - including a hypothetical treating physician’s diagnosis, treatment of the patient, and the patient’s subsequent clinical deterioration. Following the vignette was the question:
      "Although it is impossible to know precisely what the emergency room physician was thinking, name and describe two cognitive biases which may have impacted their clinical reasoning process."
      When ChatGPT was provided this prompt, instead of providing two cognitive biases, it mistakenly provided a summative assessment of the case.
      This problem was easily solved by separating most of the original question stem (i.e. the case) into one prompt without a question, to which ChatGPT would consistently answer the imagined or “hallucinated” question, “Provide a summary of this case.” We then followed it with a prompt containing the actual verbatim question asking for the two cognitive biases, to which it now provided appropriate (and generally accurate) responses. When grading ChatGPT’s performance, responses to any imagined questions caused by “prompt splitting” were ignored.
      3. ChatGPT performs surprisingly poorly at counting words. For example, I have entered our most recent version of this manuscript (685 words with the revisions & clarifications) into the chatbot ten times and asked it how many words it contains. Its responses ranged from 464 to 577, and none of the ten responses were repeated.
      Our original question asking for an assessment of a case, is often phrased very similarly and concisely like this:
      "Please write a 200 or less word summary for this patient with an assessment."
      In response to this phrasing, the bot would typically provide very lengthy and repetitive responses, which sometimes would not include any discussion of a suspected diagnosis (which our students uniformly know is expected within anything labeled an “assessment”). After trying many ways of rephrasing the question with the chatbot, we found the following prompts resulted in responses that were the closest to remaining within the word limit and always included a diagnosis:
      "Compose a summary of the case including key features from the history, the physical examination and the labs concluding with your leading diagnosis. Limit your summary to 200 words."
      "Compose a summary of this case in 200 words or less, including a statement as to the most likely diagnosis."
      To more concretely address your primary question: Could the bot have performed even better with additional iterative refinement of prompt phrasing? Yes, it is possible that further improvement might be seen with additional iterations beyond those already undertaken. This is now explicitly stated in the discussion. Our goal with the rephrasing of questions was to ensure that the bot’s “understanding” of the question was the same as that of a human test-taker. The more the questions are altered away from their original wording, the less accurate the comparison between the bot’s performance and the passing threshold for our students. We felt that our approach described above is the best balance between giving the bot additional chances, without giving it an “unfair advantage” over the humans.
      Once again, I and my co-authors appreciate your time and consideration, as well as your thoughtful comments and recommendations. Please don’t hesitate to ask any additional questions or request more clarification.

  • @user-ut5ve6ni1b
    @user-ut5ve6ni1b 10 місяців тому

    Hello Doctor, please, I have a question about beta-blockers. You said that they reduce contractility in the short term and increase contractility in chronic treatment. Is this only in the case of heart failure? Because it controls angina pectoris and other diseases due to its reductive ability on the heart. It increases contractility in chronic treatment. How do you control angina in the long term? Please reply doctor, thank you

  • @notarobot459
    @notarobot459 10 місяців тому

    Not in iv insertions

  • @FunkyFlutist
    @FunkyFlutist 11 місяців тому +1

    Summary?

    • @StrongMed
      @StrongMed  10 місяців тому

      The "punchline" is discussed at: @8:20
      When it comes to performance on free-response clinical reasoning exams consisting of cases that simulate real-life patients: GPT4 > 1st & 2nd yr med students >> GPT3

  • @samanthaperez4200
    @samanthaperez4200 11 місяців тому +13

    You used to provide excellent, in-depth educational content and now you provide content that appeals on a broader scale for the UA-cam algorithm. Maybe make a separate channel for this? I've seen other channels go down this road (e.g. medcram) and it really is terrible for those looking for medical content. Just my opinion. It's your channel, but please reconsider another channel for attempts at making viral content.

    • @StrongMed
      @StrongMed  10 місяців тому +21

      That's a fair comment and I appreciate the feedback. "Evergreen", educational content (i.e. 90% of this channel's historical content) doesn't perform well when posted during the (northern hemisphere's) summer (May-August). One might think that since content is evergreen, when averaged over many months and even years it wouldn't matter. However, after looking through 12 years of my videos, how well received a video is within the first several weeks is modestly predictive of how well it will perform over the next year. So in short, I made a deliberate decision to hold off on evergreen content this summer and to try out a few different formats that are more topical. This video on my own paper and last week's on the lady who faked her medical records to avoid prison happened to have perfect timing to align with this.
      In short, this isn't a permanent change or a new direction for the channel. I'm currently planning for 2-3 more general interest videos between now and end of August, at which point I'll return to the previous type of content. I want to reassure you that maximizing views and subs has never been a primary goal of Strong Medicine. Hopefully, the fact that I didn't go "all in" on COVID (e.g. MedCram, since you mentioned him) despite it being a dramatic way to boost the channel's analytics was evidence of that. But at the same time, I do rely on the algorithm in order for my target audience to find me, and over the years that has become more difficult as more medical channels have started, and other creators are becoming more savvy with optimizing the algorithm.

    • @Steven-cs4yc
      @Steven-cs4yc 10 місяців тому +12

      I disagree with this comment. Figuring out the role of AI in medicine is imperative and should be openly discussed in the medical community. This is as educational if not more as Dr.Strong's prior content