22 links
tagged with large-language-models
Click any tag below to further narrow down your results
Links
Deep Think with Confidence (DeepConf) is introduced as a method to improve reasoning efficiency and performance in large language models by using internal confidence signals to filter out low-quality reasoning traces. It requires no additional training or tuning and can be easily integrated into existing systems. Evaluations show significant accuracy improvements and a reduction in generated tokens on various reasoning tasks.
The Context Window Architecture (CWA) is proposed as a disciplined framework for structuring prompts in large language models (LLMs), addressing their limitations such as statelessness and cognitive fallibility. By organizing context into 11 distinct layers, CWA aims to enhance prompt engineering, leading to more reliable and maintainable AI interactions. Feedback and collaboration on this concept are encouraged to refine its implementation in real-world scenarios.
Sleep-time compute is introduced as a method to enhance the efficiency of large language models by allowing them to anticipate user queries and pre-compute relevant data, significantly reducing test-time compute requirements. The study shows that this approach can lower compute needs by approximately 5x and improve accuracy by up to 18% on specific reasoning tasks. Additionally, a Multi-Query extension is proposed to further optimize compute costs across related queries.
The document provides a factual overview of the sizes and training data of various large language models (LLMs) from GPT-2 to Llama-4, emphasizing the evolution of model parameters and the challenges associated with training these models. It highlights the shift from purely text continuation engines to models designed for specific roles, such as AI chatbots, and discusses the implications of this trend on the intelligence and capabilities of LLMs. Additionally, it notes the increasing complexity and ethical concerns surrounding the datasets used for training these models.
Continued scaling of large language models (LLMs) may not yield diminishing returns as previously thought; even small improvements in accuracy can lead to significant advancements in long-horizon task execution. The study reveals that LLMs struggle with longer tasks not due to reasoning limitations, but execution errors that compound over time, highlighting the importance of model size and strategic thinking in improving performance.
The repository serves as a comprehensive resource for the survey paper "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey," detailing various reinforcement learning methods and their applications to large language models (LLMs). It includes tables summarizing methodologies, objectives, and key mechanisms, alongside links to relevant papers and resources in the field of AI.
Cluster-driven Expert Pruning (C-Prune) is a novel framework designed to enhance the efficiency of Mixture-of-Experts (MoE) large language models by addressing issues of expert redundancy within and across layers. By implementing layer-wise expert clustering followed by global cluster pruning, C-Prune effectively reduces model size and improves performance compared to existing pruning methods. Extensive experiments validate its effectiveness on various MoE models and benchmarks.
Large Language Models (LLMs) are vulnerable to data poisoning attacks that require only a small, fixed number of malicious documents, regardless of the model's size or training data volume. This counterintuitive finding challenges existing assumptions about AI security and highlights significant risks for organizations deploying LLMs, calling for urgent development of robust defenses against such vulnerabilities.
Managing unstructured data at scale presents significant challenges for organizations, especially as the demand for its integration with Generative AI grows. The article discusses the Medallion Architecture framework and its evolution to accommodate unstructured data, emphasizing the importance of a unified data management strategy that leverages large language models for improved data processing and analysis.
The paper explores the enhancement of reward modeling in reinforcement learning for large language models, focusing on inference-time scalability. It introduces Self-Principled Critique Tuning (SPCT) to improve generative reward modeling and proposes a meta reward model to optimize performance during inference. Empirical results demonstrate that SPCT significantly enhances the quality and scalability of reward models compared to existing methods.
C3PO introduces a novel approach for optimizing expert pathways in Mixture-of-Experts (MoE) Large Language Models at test time, significantly improving accuracy by 7-15% through collaborative re-weighting of core experts in critical layers. By utilizing surrogate objectives based on successful neighboring samples, C3PO enhances efficiency, enabling models with fewer parameters to outperform larger counterparts. The method demonstrates superior performance over existing test-time learning techniques across various benchmarks.
The research investigates how Large Language Models (LLMs) internalize new knowledge through a framework called Knowledge Circuits Evolution, identifying computational subgraphs that aid in knowledge storage and processing. Key findings highlight the influence of new knowledge relevance, the phase shift in circuit evolution, and a deep-to-shallow evolution pattern, which could enhance continual pre-training strategies for LLMs.
TextQuests introduces a benchmark to evaluate the performance of Large Language Models (LLMs) in classic text-based video games, focusing on their ability to engage in long-context reasoning and learning through exploration. The evaluation involves assessing agents' progress and ethical behavior across various interactive fiction games, revealing challenges such as hallucination and inefficiency in dynamic thinking. The aim is to help researchers better understand LLM capabilities in complex, exploratory environments.
ByteDance has unveiled the Seed-OSS-36B, an open-source large language model with a remarkable 512K token context, surpassing many competitors. The release includes three variants aimed at balancing performance and research flexibility, enabling extensive applications without licensing fees.
Reinforcement learning (RL) is becoming essential in developing large language models (LLMs), particularly for aligning them with human preferences and enhancing their capabilities through multi-turn interactions. This article reviews various open-source RL libraries, analyzing their designs and trade-offs to assist researchers in selecting the appropriate tools for specific applications. Key libraries discussed include TRL, Verl, OpenRLHF, and several others, each catering to different RL needs and architectures.
JudgeLRM introduces a novel approach to using Large Language Models (LLMs) as evaluators, particularly in complex reasoning tasks. By employing reinforcement learning with judge-wise rewards, JudgeLRM models significantly outperform traditional Supervised Fine-Tuning methods and current leading models, demonstrating superior performance in tasks that require deep reasoning.
LLM4Ranking is a unified framework designed to facilitate the utilization of large language models (LLMs) for document reranking in various applications, such as search engines. It offers a simple and extensible interface, along with evaluation and fine-tuning scripts, allowing users to experiment with different ranking methods and models on popular datasets. The framework aims to enhance the performance and efficiency of LLMs in document reranking tasks and is available as open-source code.
The article explores the different coding personalities exhibited by leading large language models (LLMs) and how these traits influence their performance and usefulness in software development. It delves into the unique characteristics and behaviors of various LLMs, highlighting how understanding these coding styles can enhance human-LLM collaboration in programming tasks.
The article discusses the future of software engineering in 2025 with the integration of large language models (LLMs). It explores the potential impacts on coding practices, collaboration, and the skill sets required for engineers as AI becomes more prevalent in the software development process. Key considerations include the balance between automation and human oversight in programming tasks.
Reinforcement learning (RL) is essential for training large language models (LLMs), but there is a lack of effective scaling methodologies in this area. This study presents a framework for analyzing RL scaling, demonstrating through extensive experimentation that certain design choices can optimize compute efficiency while maintaining performance. The authors propose a best-practice recipe, ScaleRL, which successfully predicts validation performance using a significant compute budget.
The article discusses an automated workflow for tabular data validation using large language models (LLMs). It outlines the benefits of leveraging LLMs to enhance accuracy and efficiency in data validation processes, while also addressing challenges and potential strategies for implementation.
SINQ is a fast and model-agnostic quantization technique that enables the deployment of large language models on GPUs with limited memory while maintaining accuracy. It significantly reduces memory requirements and quantization time, offering improved model quality compared to existing methods. The technique introduces dual scaling to enhance quantization stability, allowing users to quantize models quickly and efficiently.