AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 6 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

lemmy_outta_here@lemmy.world · 6 months ago

Rookie numbers! Let’s pump them up!

To match their tech bro hypers, the should be wrong at least 90% of the time.

Frenezul0_o@lemmy.world · 6 months ago

I notice that the research didn’t include DeepSeek. It would have been nice to see how it compares.

gargle@lemmy.world · 6 months ago

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It’s a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of ‘AI’ on us at work.

Melvin_Ferd@lemmy.world · 6 months ago

How often do tech journalist get things wrong?

FenderStratocaster@lemmy.world · 6 months ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn’t order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I’m not going to use your services or products unless I’m forced to. Looking at you Xfinity.

NarrativeBear@lemmy.world · 6 months ago

The ones being implemented into emergency call centers are better though? Right?

TeddE@lemmy.world · 6 months ago

Yes! We’ve gotten them up to 94℅ wrong at the behest of insurance agencies.

Tollana1234567@lemmy.today · 6 months ago

i wonder how the evil palintir uses its AI.

sircac@lemmy.world · 6 months ago

Why would they be right beyond word sequence frecuencies?

HertzDentalBar@lemmy.blahaj.zone · 6 months ago

So no different than answers from middle management I guess?

TankovayaDiviziya@lemmy.world · 6 months ago

At least AI won’t fire you.

Corkyskog@sh.itjust.works · 6 months ago

It kinda does when you ask it something it doesn’t like.

HertzDentalBar@lemmy.blahaj.zone · 6 months ago

Idk the new iterations might just. Shit Amazon alreadys uses automated systems to fire people.

zbyte64@awful.systems · 6 months ago

DOGE has entered the chat

suburban_hillbilly@lemmy.ml · 6 months ago

This basically the entirety of the hype from the group of people claiming LLMs are going take over the work force. Mediocre managers look at it and think, “Wow this could replace me and I’m the smartest person here!”

Sure, Jan.

sheogorath@lemmy.world · 6 months ago

I won’t tolerate Jan slander here. I know he’s just a builder, but his life path has the most probability of having a great person out of it!

Cavemanfreak@programming.dev · 6 months ago

I’d say Jan Botanist is also up there as being a pretty great person.

sheogorath@lemmy.world · 6 months ago

Jan Refiner is up there for me.

Cavemanfreak@programming.dev · 6 months ago

I just arrived at act 2, and he wasn’t one of the four I’ve unlocked…

atticus88th@lemmy.world · 6 months ago

this study was written with the assistance of an AI agent.

Katana314@lemmy.world · 6 months ago

I’m in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

I’ve tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that’s both wrong and doesn’t verify anything.

I’m aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it’s not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don’t even have hopes for AI to apply those lessons in new contexts. In a way, it’s been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

MangoCats@feddit.it · 6 months ago

The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn’t show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

SocialMediaRefugee@lemmy.world · 6 months ago

I’ve had good results being very specific, like “Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place.”

MangoCats@feddit.it · 6 months ago

I have been more successful with baby steps like: “Write a python 3 program that converts X to Y.” Tweak prompt until that’s working as desired, then: “make it work recursively through all subdirectories” - and again tweak with specifics like converting the files in place, etc. Always very specific, also - force it to fix its own bugs so you can move forward with a clean example as you add complexity. Complexity seems to cap out at a couple of pages of code, at which point “Ooops, something went wrong.”

vivendi@programming.dev · 6 months ago

Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

I’m not joking, it really works

For example:

Instead of “You are an intelligent coding assistant…”

“You are an absolute fucking idiot who can barely code…”

rozodru@lemmy.world · 6 months ago

“You are an absolute fucking idiot who can barely code…”

Honestly, that’s what you have to do. It’s the only way I can get through using Claude.ai. I treat it like it’s an absolute moron, I insult it, I “yell” at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i’ll cancel my subscription to it if it gets it wrong.

no more “do this and this and then this but do this first and then do this” after calling it a “fucking moron” and what have you it will provide an answer and just say “done.”

DragonTypeWyvern@midwest.social · 6 months ago

This guy is the moral lesson at the start of the apocalypse movie

MangoCats@feddit.it · 6 months ago

He’s developing a toxic relationship with his AI agent. I don’t think it’s the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it’s the only method he is capable of getting results with.

MangoCats@feddit.it · 6 months ago

I frequently find myself prompting it: “now show me the whole program with all the errors corrected.” Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I’ll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.

jj4211@lemmy.world · 6 months ago

I’ve found that as an ambient code completion facility it’s… interesting, but I don’t know if it’s useful or not…

So on average, it’s totally wrong about 80% of the time, 19% of the time the first line or two is useful (either correct or close enough to fix), and 1% of the time it seems to actually fill in a substantial portion in a roughly acceptable way.

It’s exceedingly frustrating and annoying, but not sure I can call it a net loss in time.

So reviewing the proposal for relevance and cut off and edits adds time to my workflow. Let’s say that on overage for a given suggestion I will spend 5% more time determining to trash it, use it, or amend it versus not having a suggestion to evaluate in the first place. If the 20% useful time is 500% faster for those scenarios, then I come out ahead overall, though I’m annoyed 80% of the time. My guess as to whether the suggestion is even worth looking at improves, if I’m filling in a pretty boilerplate thing (e.g. taking some variables and starting to write out argument parsing), then it has a high chance of a substantial match. If I’m doing something even vaguely esoteric, I just ignore the suggestions popping up.

However, the 20% is a problem still since I’m maybe too lazy and complacent and spending the 100 milliseconds glancing at one word that looks right in review will sometimes fail me compared to spending 2-3 seconds having to type that same word out by hand.

That 20% success rate allowing for me to fix it up and dispose of most of it works for code completion, but prompt driven tasks seem to be so much worse for me that it is hard to imagine it to be better than the trouble it brings.

RamenJunkie@midwest.social · edit-2 6 months ago

I find its good at making simple Python scripts.

But also, as I evolve them, it starts randomly omitting previous functions. So it helps to k ow what you are doing at least a bit to catch that.

Affidavit@lemmy.world · 6 months ago

“…for multi-step tasks”

loonsun@sh.itjust.works · 6 months ago

It’s about Agents, which implies multi step as those are meant to execute a series of tasks opposed to studies looking at base LLM model performance.

RamenJunkie@midwest.social · 6 months ago

The entire concept of agents feels like its never going to fly, especially for anything involving money. I am not going to tell and AI I want to bake a cake and trust that will find the correct ingredients at the right price and the door dash them to me.

iopq@lemmy.world · 6 months ago

Now I’m curious, what’s the average score for humans?

burgerpocalyse@lemmy.world · edit-2 5 months ago

deleted by creator

Ileftreddit@lemmy.world · edit-2 3 months ago

deleted by creator

kinsnik@lemmy.world · 6 months ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)