83
3 606

Alignment faking in large language models

21:35

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

24:09

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

16:28

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

19:26

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

42:03

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

14:26

Qwen2.5 Technical Report

Qwen2.5 Technical Report
Abstract: In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.
Paper: arxiv.org/abs/2412.15115
This podcast is generated using NotebookLM for the research purpose.

Відео

Alignment faking in large language models

21:35

Alignment faking in large language models

Переглядів 42 години тому

Alignment faking in large language models Abstract: We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its pri...

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

24:09

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Переглядів 264 години тому

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment cost...

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

16:28

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Переглядів 57 годин тому

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities Abstract: In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities ar...

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

19:26

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Переглядів 129 годин тому

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Abstract: Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine...

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

42:03

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

Переглядів 1212 годин тому

ChatQA: Surpassing GPT-4 on Conversational QA and RAG Abstract: retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations,...

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

14:26

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Переглядів 314 годин тому

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Abstract: Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics...

45:16

Phi 4 Technical Report

Переглядів 2616 годин тому

Phi 4 Technical Report Abstract: We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Ph...

The BrowserGym Ecosystem for Web Agent Research

40:44

The BrowserGym Ecosystem for Web Agent Research

Переглядів 2419 годин тому

The BrowserGym Ecosystem for Web Agent Research Abstract: The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve re...

22:10

Sora System Card

Переглядів 621 годину тому

Sora System Card This system card details OpenAI's new video generation model, Sora. Sora generates videos from text, images, and existing videos, utilizing a diffusion model and transformer architecture. Extensive safety measures, including pre-training filtering, multi-modal moderation classifiers, and external red teaming, were implemented to mitigate risks like the creation of harmful or mi...

18:17

Densing Law of LLMs

Переглядів 9День тому

Densing Law of LLMs Abstract: Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. However, this scaling brings great challenges to training and inference efficiency, particularly for deploying LLMs in resource-constrained environments, and the scaling trend is becoming increasingly unsustainable. Thi...

ProcessBench: Identifying Process Errors in Mathematical Reasoning

10:06

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Переглядів 4День тому

ProcessBench: Identifying Process Errors in Mathematical Reasoning Abstract: As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It c...

100% Hallucination Elimination Using Acurai

16:59

100% Hallucination Elimination Using Acurai

Переглядів 22День тому

100% Hallucination Elimination Using Acurai Abstract: The issue of hallucinations in large language models (LLMs) remains a critical barrier to the adoption of AI in enterprise and other high-stakes applications. Despite advancements in retrieval-augmented generation (RAG) systems, current state-of-the-art methods fail to achieve more than 80% accuracy in generating faithful and factually corre...

27:29

OpenAI o1 System Card

Переглядів 156День тому

OpenAI o1 System Card OpenAI's system card for the o1 large language model series details the models' development, capabilities, and safety evaluations. Extensive testing covered various aspects, including disallowed content generation, resistance to jailbreaks, hallucination rates, and bias. External red teaming by organizations like Apollo Research and Gray Swan further assessed potential ris...

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

12:43

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Переглядів 10614 днів тому

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability Abstract: Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the fi...

Zero-Indexing Internet Search Augmented Generation for Large Language Models

22:46

Zero-Indexing Internet Search Augmented Generation for Large Language Models

Переглядів 1714 днів тому

Zero-Indexing Internet Search Augmented Generation for Large Language Models

Open-Sora Plan: Open-Source Large Video Generation Model

22:54

Open-Sora Plan: Open-Source Large Video Generation Model

Переглядів 2714 днів тому

Open-Sora Plan: Open-Source Large Video Generation Model

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

15:30

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Переглядів 2214 днів тому

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

33:17

Large Language Models as Markov Chains

Переглядів 3914 днів тому

Large Language Models as Markov Chains

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

16:34

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Переглядів 2214 днів тому

TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

ROICtrl: Boosting Instance Control for Visual Generation

13:22

ROICtrl: Boosting Instance Control for Visual Generation

Переглядів 814 днів тому

ROICtrl: Boosting Instance Control for Visual Generation

13:45

O1 Replication Journey -- Part 2

Переглядів 1321 день тому

O1 Replication Journey Part 2

O1 Replication Journey: A Strategic Progress Report -- Part 1

19:25

O1 Replication Journey: A Strategic Progress Report -- Part 1

Переглядів 2921 день тому

O1 Replication Journey: A Strategic Progress Report Part 1

Star Attention: Efficient LLM Inference over Long Sequences

15:26

Star Attention: Efficient LLM Inference over Long Sequences

Переглядів 6121 день тому

Star Attention: Efficient LLM Inference over Long Sequences

31:23

Generative World Explorer

Переглядів 1321 день тому

Generative World Explorer

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

22:31

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Переглядів 1721 день тому

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

10:40

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Переглядів 6021 день тому

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

AnimateAnything: Consistent and Controllable Animation for Video Generation

20:22

AnimateAnything: Consistent and Controllable Animation for Video Generation

Переглядів 721 день тому

AnimateAnything: Consistent and Controllable Animation for Video Generation

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

17:39

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Переглядів 3928 днів тому

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

10:06

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

Переглядів 1228 днів тому

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

Keyur