AI agent safety benchmark BeSafe-Bench tested 13 production-grade agents and found none could complete 40% of tasks while ...