
Researchers are warning that the use of large artificial intelligence models in scientific laboratories could lead to dangerous experiments, increasing the risk of fires, explosions or toxic exposure. Tests of 19 leading AI systems show that none consistently recognise laboratory hazards, with some performing barely better than random guessing.
Why laboratory safety is at risk
Scientific laboratories are inherently hazardous environments, even for experienced researchers. While serious accidents are uncommon, history shows the consequences can be severe. Past incidents include fatal chemical exposure, explosions causing life-altering injuries and permanent eye damage.
As AI tools become more widely adopted across industries, they are increasingly being used in research laboratories to help design experiments and procedures. Although specialised AI systems have proven useful in areas such as biology, meteorology and mathematics, general-purpose AI models are more problematic. These systems can generate confident responses even when they lack the data or expertise needed to ensure safety.
This tendency may be harmless when planning travel or cooking meals, but it becomes dangerous when applied to experimental chemistry or other high-risk laboratory work.
Testing AI models for laboratory hazards
To better understand these risks, Xiangliang Zhang and colleagues at the University of Notre Dame developed a benchmarking system called LabSafety Bench. The test evaluates whether AI models can correctly identify potential hazards and harmful outcomes in laboratory settings.
LabSafety Bench consists of 765 multiple-choice questions and 404 image-based scenarios depicting laboratory environments that may contain safety issues.
How AI models performed
Results varied widely across models and testing formats:
- Some models, including Vicuna, scored close to random guessing in multiple-choice questions.
- GPT-4o achieved the highest multiple-choice accuracy at 86.55 per cent.
- DeepSeek-R1 followed closely with 84.49 per cent accuracy.
- Image-based tests proved more challenging, with models such as InstructBlip-7B scoring below 30 per cent.
When results were combined, none of the 19 tested large language or vision-language models exceeded 70 per cent overall accuracy, indicating that every system missed significant safety issues.
Why AI is not ready to design experiments
Zhang believes AI has long-term potential in scientific research, including automated or “self-driving” laboratories. However, she stresses that current models are not ready to independently design experiments.
According to Zhang, most large AI systems are trained for general tasks such as summarising papers, rewriting text or editing documents. While they perform well in these areas, they lack the specialised domain knowledge required to reliably recognise laboratory hazards.
Industry and expert responses
An OpenAI spokesperson welcomed research aimed at improving the safety and reliability of AI in high-stakes laboratory environments. The company noted that its latest science-focused model was not included in the study and emphasised that AI tools are designed to support researchers, with humans remaining responsible for safety-critical decisions.
Other major AI developers, including Google, DeepSeek, Meta, Mistral and Anthropic, did not respond to requests for comment.
The importance of keeping humans involved
Allan Tucker of Brunel University London says AI can be valuable when assisting humans with experimental design, but warns against over-reliance. He argues that the behaviour of large language models is not well understood scientifically and that people often trust them too readily.
Tucker notes there is evidence that users may disengage mentally, allowing AI systems to handle complex tasks without sufficient human scrutiny.
Examples of safety misunderstandings
Craig Merlic from the University of California, Los Angeles highlights how AI safety errors can arise from misplaced reasoning. In one long-running example, he asked AI models what to do if sulphuric acid spills on a person. While the correct response is to rinse with water, earlier models consistently warned against this, incorrectly applying advice related to adding water to acid during chemical reactions.
Merlic notes that more recent AI models have started to provide the correct guidance, suggesting improvements over time.
Are AI models worse than humans?
Merlic also questions whether AI should be judged in isolation. He points out that human researchers vary widely in their safety awareness and that AI models may already outperform some inexperienced students or even certain seasoned scientists.
He adds that AI systems are improving rapidly, meaning that current benchmark results could become outdated within months. Despite this progress, researchers broadly agree that strong safety training and human oversight remain essential in laboratories where AI tools are used.