Coding LLMs - 搜索 News

小红书发布 SWE-Bench Mobile：当 AI Agent 面对亿级用户 App 代码库，最高 ...

作者 | Nexus AI 团队编辑 | Kitty大型语言模型（LLMs）的迅速发展催生了新一代自主编码智能体，它们能够理解需求、浏览代码库，并在最少的人工干预下实现功能。以 Cursor、Claude Code 和 Codex 为代表的 AI 编程工具在现有基准测试中已经取得了令人瞩目的成果。然而，现有的评测基准（如 SWE-Bench ...

2 天

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real ...

New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between ...

VentureBeat

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

Unite.AI

Vibe Coding Suffers When AI’s Role Expands

A new study finds vibe coding improves when humans give the instructions, but declines when AI does, with the best hybrid setup keeping humans foremost, with AI as an arbiter or judge. New research ...

Searchenginejournal.com

LLMs That Code: Why Marketers Should Care (Even If You’ve Never Touched An IDE)

Large language models (LLMs) like ChatGPT and Claude are best known for their writing abilities, drafting ad copy, summarizing reports, and helping brainstorm blog content. However, most marketers ...

Ars Technica

How AI coding agents work—and what to remember if you use them

AI coding agents from OpenAI, Anthropic, and Google can now work on software projects for hours at a time, writing complete apps, running tests, and fixing bugs with human supervision. But these tools ...

SiliconANGLE

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are ...

Communications of the ACM

Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification

A marriage of formal methods and LLMs seeks to harness the strengths of both.

InfoWorld

Researchers propose a self-distillation fix for ‘catastrophic forgetting’ in LLMs

LLMs tend to lose prior skills when fine-tuned for new tasks. A new self-distillation approach aims to reduce regression and ...

20 小时

Securing The Intelligent Cloud: How AI And LLMs Are Redefining Cyber Defense

The convergence of cloud computing and generative AI marks a defining turning point for enterprise security. Global spending ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果