作者 | Nexus AI 团队编辑 | Kitty大型语言模型(LLMs)的迅速发展催生了新一代自主编码智能体,它们能够理解需求、浏览代码库,并在最少的人工干预下实现功能。以 Cursor、Claude Code 和 Codex 为代表的 AI 编程工具在现有基准测试中已经取得了令人瞩目的成果。然而,现有的评测基准(如 SWE-Bench ...
New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between ...
As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...
A new study finds vibe coding improves when humans give the instructions, but declines when AI does, with the best hybrid setup keeping humans foremost, with AI as an arbiter or judge. New research ...
Large language models (LLMs) like ChatGPT and Claude are best known for their writing abilities, drafting ad copy, summarizing reports, and helping brainstorm blog content. However, most marketers ...
AI coding agents from OpenAI, Anthropic, and Google can now work on software projects for hours at a time, writing complete apps, running tests, and fixing bugs with human supervision. But these tools ...
A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are ...
A marriage of formal methods and LLMs seeks to harness the strengths of both.
LLMs tend to lose prior skills when fine-tuned for new tasks. A new self-distillation approach aims to reduce regression and ...
The convergence of cloud computing and generative AI marks a defining turning point for enterprise security. Global spending ...