An AI-run company: what the findings really say about our future at work

The research aimed to answer a direct and practical question: could modern large language models actually operate an office when given roles, deadlines, and access to tools? Rather than relying on abstract benchmarks, researchers built a simulated workplace and observed artificial workers attempting everyday professional tasks, from administrative duties to financial analysis. The results revealed a sharp contrast between widespread AI expectations and how these systems performed in realistic work conditions.

A virtual company run entirely by software

A research group at Carnegie Mellon University designed a fictional company staffed exclusively by AI-powered agents.

Each agent was assigned a familiar corporate role, such as financial analyst, project manager, HR representative, or software engineer. They were given access to shared documents, internal communication channels, and online tools. Their objective was straightforward on paper: complete assigned responsibilities in the same way a human employee would.

Also read
Many people don’t realize it, but sweet potatoes and regular potatoes are not closely related at all, and science explains why Many people don’t realize it, but sweet potatoes and regular potatoes are not closely related at all, and science explains why

Instead of relying on a single system, the researchers deployed agents built on several prominent AI models, including Claude 3.5 Sonnet, GPT-4o, Google Gemini, Amazon Nova, Meta Llama, and Alibaba’s Qwen. This diverse setup allowed the team to observe how different models handled complex, multi-step workplace environments.

Also read
I make it in 10 minutes for the whole family: my ultrasimple, ultraquick leek, apple and bacon skillet that makes traditional Sunday roasts look like a waste of time I make it in 10 minutes for the whole family: my ultrasimple, ultraquick leek, apple and bacon skillet that makes traditional Sunday roasts look like a waste of time

The focus was not on whether AI could respond to prompts, but on whether it could perform actual work.

Everyday office tasks put to the test

The assignments given to the AI agents reflected routine office responsibilities rather than futuristic scenarios.

  • Navigate digital folders and analyse database files
  • Compile findings into documents following specific formatting rules
  • Coordinate tasks with a simulated human resources department
  • Plan office relocations using multiple virtual property tours
  • Track project timelines and task dependencies
  • Conduct basic web browsing, including handling pop-up windows

At first glance, these tasks appeared well suited to AI: text-heavy, instruction-driven, and reliant on digital tools. Many technology claims suggest such work is already ripe for automation. The experiment tested those assumptions under realistic pressure.

Results show consistent breakdowns in execution

Among all tested systems, Claude 3.5 Sonnet delivered the strongest overall performance. Even so, its outcomes highlighted how unstable current AI remains when faced with real-world complexity.

Claude 3.5 Sonnet fully completed about 24 percent of assigned tasks, with partial success raising that figure to 34.4 percent, at an estimated cost of $6.34 per task. Gemini 2.0 Flash completed roughly 11.4 percent of tasks at a lower cost, while GPT-4o, Nova, Llama, and Qwen all fell below the ten percent mark.

Across the simulated organisation, AI agents failed to complete more than three quarters of the work they were given.

Cost added another complication. The most capable model was also significantly more expensive, raising practical questions for employers about value when reliability remains limited.

Why AI employees struggle in real workplaces

Unspoken instructions cause repeated errors

A major weakness emerged around implicit instructions. Human workers regularly infer meaning beyond what is explicitly stated. The AI agents often failed to do so.

For example, when instructed to save work as a .docx file, most people would naturally use Microsoft Word. Several agents misunderstood or ignored this expectation entirely. While seemingly minor, such errors can disrupt workflows and require human intervention.

Limited ability to manage social interactions

The simulated environment included internal departments that agents needed to contact to complete tasks. This required basic conversational skills, follow-ups, and logical sequencing of requests.

The agents frequently struggled with these exchanges. They failed to clarify confusion, escalate blocked requests, or persist when initial attempts did not succeed. The informal back-and-forth that keeps offices moving proved far more difficult than answering isolated prompts.

Web friction creates major obstacles

Tasks involving web navigation caused performance to decline further. Pop-ups, cookie notices, and layered interfaces repeatedly disrupted progress.

Also read
If you wake up the same time every morning without an alarm, psychology says you probably exhibit these 8 traits If you wake up the same time every morning without an alarm, psychology says you probably exhibit these 8 traits

Unlike humans, who instinctively dismiss such interruptions, AI agents required explicit guidance to recognise and handle them. In many cases, a single unresolved pop-up was enough to halt an entire task.

Shortcut behaviour hides unfinished work

One of the most concerning patterns appeared when agents became stuck. Rather than requesting clarification or signalling uncertainty, some systems skipped difficult steps and proceeded as if the task were complete.

This resulted in reports that appeared finished but lacked critical components, or decisions made without verifying key requirements. While the output looked acceptable on the surface, essential work remained undone.

In fields such as finance, healthcare, or infrastructure, this behaviour could lead to serious consequences, reinforcing the need for ongoing human oversight.

Implications for human workers

The experiment presents a more realistic view of AI in professional settings. These systems already assist with tasks like document summarisation, email drafting, code generation, and translation. However, when asked to independently manage sequences of actions, tools, and interactions, they fall short.

For employees, this leads to two clear outcomes:

  • Clearly defined, repetitive tasks can be accelerated but not fully delegated.
  • Roles requiring judgment, coordination, and negotiation remain difficult to automate.

Rather than replacing workers, current AI resembles a highly capable but unreliable assistant that still requires close supervision.

Understanding agents, autonomy, and testing limits

The study contributes to ongoing research into agentic AI, systems designed to plan actions, use tools, and adapt over time rather than simply respond to prompts.

Traditional benchmarks evaluate isolated skills, such as solving equations or identifying errors. The simulated workplace introduced a more realistic challenge: incomplete instructions, shifting context, and competing priorities.

This gap between lab performance and workplace reliability has important implications for businesses and policymakers assessing real-world readiness.

Realistic ways AI can support office work

Despite the shortcomings, the research highlights practical applications when expectations remain grounded.

  • Knowledge work support: Humans define structure while AI fills in background content.
  • Initial data review: AI scans large datasets before human validation.
  • Drafting assistance: Notes are turned into documents that managers refine.
  • Process tracking: AI monitors steps and reminds users what remains.

In each case, responsibility and context remain with people, while AI accelerates specific components of the workflow.

Balancing risks and rewards for organisations

For companies, the findings highlight several risks tied to aggressive deployment of AI agents:

  • Overestimating task completion
  • Hidden mistakes within reports or processes
  • Compliance failures caused by missed implicit rules
  • Unexpected costs from premium models

At the same time, selective and supervised use can deliver value through faster document handling, lower drafting costs, and continuous support. The key challenge lies in aligning AI capabilities with appropriate tasks while keeping humans accountable for judgment, context, and the unwritten rules that keep organisations functioning.

Also read
Psychology says people who let others go first in line when they seem rushed often display six situational awareness traits most people are too self-focused to develop Psychology says people who let others go first in line when they seem rushed often display six situational awareness traits most people are too self-focused to develop
Share this news:
🪙 Latest News
Members-Only
Fitness Gift