GPT Reviews
The MNLU-Pro dataset is a more robust and challenging massive multi-task language understanding dataset that's tailored to more rigorously benchmark large language models' capabilities.
The Composable Interventions framework allows researchers to study the effects of using multiple interventions on a language model, and the order in which interventions are applied can have a significant impact on their effectiveness.
The MJ-Bench benchmark evaluates the effectiveness of different types of multimodal judges in providing feedback for text-to-image generation models, and the experiments reveal that close-source VLMs generally provide better feedback.
The Associative Recurrent Memory Transformer (ARMT) is an approach that combines transformer self-attention for local context with segment-level recurrence for storage of task-specific information distributed over a long context, and it sets a new performance record in the recent BABILong multi-task long-context benchmark.
Contact:Β Β sergi@earkind.com
Timestamps:
00:34 Introduction
01:32Β MNLU-Pro Release on HuggingFace Datasets
03:48Β Extrinsic Hallucinations in LLMs
04:53Β RouteLLM
06:13 Fake sponsor
08:14Β Composable Interventions for Language Models
09:45Β MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
11:31Β Associative Recurrent Memory Transformer
13:30 Outro