Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • FireWire400@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    12 hours ago

    Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it’s better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

    Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

    • Saterz@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 hours ago

      Well it is a 9B model after all. Self hosted models become a minimum “intelligent” at 16B parameters. For context the models ran in Google servers are close to 300B parameters models

  • CetaceanNeeded@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    ·
    23 hours ago

    I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

    Hilariously one of the suggested follow ups in Open Web UI was “What if I don’t have a car - can I still wash it?”

  • elbiter@lemmy.world
    link
    fedilink
    English
    arrow-up
    44
    arrow-down
    1
    ·
    1 day ago

    I just tried it on Braves AI

    The obvious choice, said the motherfucker 😆

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    38
    ·
    1 day ago

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • mycodesucks@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      12 hours ago

      Yes, but it’s going to repeat that way FOREVER the same way the average person got slow walked hand in hand with a mobile operating system into corporate social media and app hell, taking the entire internet with them.

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      1 day ago

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

      Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 day ago

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

    • XeroxCool@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

  • MojoMcJojo@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    5
    ·
    1 day ago

    Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

  • Slashme@lemmy.world
    link
    fedilink
    English
    arrow-up
    36
    ·
    2 days ago

    The most common pushback on the car wash test: “Humans would fail this too.”

    Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

    71.5% said drive.

    So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      22
      ·
      2 days ago

      It is an online poll. You also have to consider that some people don’t care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      At least some of that are people answering wrong on purpose to be funny, contrarian, or just to try to hurt the study.

  • pimpampoom@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    1 day ago

    They didn’t take into account the “thinking mode” most model pass when thinking is activated

  • TrackinDaKraken@lemmy.world
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    2
    ·
    2 days ago

    I think it’s worse when they get it right only some of the time. It’s not a matter of opinion, it should not change its “mind”.

    The fucking things are useless for that reason, they’re all just guessing, literally.

    • Hazzard@lemmy.zip
      link
      fedilink
      English
      arrow-up
      6
      ·
      1 day ago

      They also polled 10,000 people to compare against a human baseline:

      Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher “drive” rate.

      That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

      • myfunnyaccountname@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        ·
        10 hours ago

        Can they do to samplings for that? One in a city with a decent to good education system. The other in the backwoods out in the middle of nowhere…where family trees are sticks.

      • Modern_medicine_isnt@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        1 day ago

        This here is the point most people fail to grasp. The AI was taught by people. And people are wrong a lot of the time. So the AI is more like us than what we think it should be. Right down to it getting the right answer for all the wrong reasons. We should call it human AI. Lol.

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          4
          ·
          1 day ago

          Like I said the person above, there is no wrong answer. Its all about assumptions. It is a stupid trick question that no one would ask.

            • NewNewAugustEast@lemmy.zip
              link
              fedilink
              English
              arrow-up
              2
              ·
              1 day ago

              LOL! That is a great answer.

              I have a Microsoft story. I know some one who was hired to stop them from continuing an open source project. They gave them a good salary, stock options, and an office with a fully stocked bar. They said do whatever you want, they figured they would get a good developer and kill the open source competition (back in the Ballmer days).

              Sadly, given money, no real ambition to create closed source software, they mostly spent their days in their office and basically drank themselves to death.

              Microsoft just kills everything it touches.

    • NewNewAugustEast@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      1 day ago

      What is the wrong answer though? It is a stupid question. I would look at you sideways if you asked me this, because the obvious answer is “walk silly, the car is already at the car wash”. Otherwise why would you ask it?

      Which is telling because when asked to review the answer, the AI’s that I have seen said, you asked me how you were going to get to the car wash. Assumption the car was already there.

    • eronth@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      1 day ago

      Yeah I straight up misread the question, so I would have gotten it wrong.

  • vane@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    2 days ago

    I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

  • BanMe@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    2 days ago

    In school we were taught to look for hidden meaning in word problems - checkov’s gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

    If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

    Normally I’d ask “why are we comparing AI to the human mind when they’re not the same thing at all,” but I feel like we’re presupposing they are similar already with this test so I am curious to the answer on this one.

    • punkibas@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 days ago

      At the end of the article they talk about how to overcome this problem for LLMs doing something akin to what you wrote.

  • tover153@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    2 days ago

    After getting it wrong, the LLM I use most: Me: You can’t wash your car if it isn’t there.

    Ah. Yes. That is an excellent and devastatingly practical correction.

    In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

    This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

    So:

    Start engine.

    Travel 50 meters.

    Avoid eye contact with pedestrians.

    Commit fully.

    You are not lazy. You are complying with system requirements.

  • TankovayaDiviziya@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    5
    ·
    edit-2
    1 day ago

    We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

    Edit: I know Lemmy scoff at LLM, but people probably also used to scoff at Veirbest’s steam machine that it will never amount to anything. Give it time and it will improve. I’m not endorsing AI by the way, I am on the fence about the long term consequence of it, but whether people like it or not, AI will impact human lives.

    • Rob T Firefly@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      arrow-down
      1
      ·
      1 day ago

      LLMs are not children. Children can have experiences, learn things, know things, and grow. Spicy autocomplete will never actually do any of these things.

        • herrvogel@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 day ago

          LLMs can’t learn. It’s one of their inherent properties that they are literally incapable of learning. You can train a new model, but you can’t teach new things to an already trained one. All you can do is adjust its behavior a little bit. That creates an extremely expensive cycle where you just have to spend insane amounts of energy to keep training better models over and over and over again. And the wall of diminishing returns on that has already been smashed into. That, and the fact that they simply don’t have concepts like logic and reasoning and knowing, puts a rather hard limit on their potential. It’s gonna take several sizeable breakthroughs to make LLMs noticeably better than they are now.

          There might be another kind of AI that solves those problems inherent to LLMs, but at present that is pure sci-fi.

        • Rob T Firefly@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          1 day ago

          Our microorganism ancestors also did all those things, and they were far beyond anything an LLM can do. Turning a given list of words into numbers, doing a string of math to those numbers, and turning the resulting numbers back into words is not consciousness or wisdom and never will be.

          • TankovayaDiviziya@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            1
            ·
            1 day ago

            You think microorganisms can reason? Wow, AI haters are grasping for straws.

            Honestly, I don’t understand Lemmy scoffing at AI and thinking the current iteration is all it ever will be. I’m sure some thought that the automobile technology would not go anywhere simply because the first model was running at 3mph. These things always takes time.

            To be clear, I’m not endorsing AI, but I think there is a huge potential in years to come, for better or worse. And it is especially important to never underestimate something, especially by AI haters, because of what destructive potential AI has.

    • kshade@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      ·
      1 day ago

      We have already thrown just about all the Internet and then some at them. It shows that LLMs can not think or reason. Which isn’t surprising, they weren’t meant to.

      • eronth@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        5
        ·
        1 day ago

        Or at least they can’t reason the way we do about our physical world.

        • Nalivai@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          1 day ago

          You’re failing into the same trap. When the letters on the screen tell you something, it’s not necessarily the truth. When there is “I’m reasoning” written in a chatbot window, it doesn’t mean that there is a something that’s reasoning.

      • Nalivai@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        1 day ago

        By now it’s kind of getting clear that fundamentally it’s the best version of the thing that we get. This is a primetime.
        For some time, there was a legit question of “if we give it enough data, will there be a qualitative jump”, and as far as we can see right now, we’re way past this jump. Predictive algorithm can form grammatically correct sentences that are related to the context. That’s it, that’s the jump.
        Now a bunch of salespeople are trying to convince us that if there was one jump, there necessarily will be others, while there is no real indication of that.