Yuanzhong Xu | Semantic Scholar

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel ReidNikolay Savinov Alexandra Chronopoulou

Computer Science

8 March 2024

Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks.

arXiv

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry LepikhinHyoukJoong Lee Z. Chen

Computer Science

International Conference on Learning…

30 June 2020

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

arXiv

LaMDA: Language Models for Dialog Applications

R. ThoppilanDaniel De Freitas Quoc Le

Computer Science

arXiv.org

20 January 2022

It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

arXiv

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui YuYuanzhong Xu Yonghui Wu

Computer Science

Trans. Mach. Learn. Res.

22 June 2022

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

arXiv

PaLM 2 Technical Report

Rohan AnilAndrew M. Dai Yonghui Wu

Computer Science, Linguistics

arXiv.org

17 May 2023

PaLM 2 is a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM and enables inference-time control over toxicity without additional overhead or impact on other capabilities.

arXiv

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Nan DuYanping Huang Claire Cui

Computer Science

International Conference on Machine Learning

13 December 2021

This paper proposes and develops a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.

arXiv

Vector-quantized Image Modeling with Improved VQGAN

Jiahui YuXin Li Yonghui Wu

Computer Science

International Conference on Learning…

9 October 2021

This work introduces a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively, and proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.

arXiv

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong XuHyoukJoong Lee Zhifeng Chen

Computer Science

arXiv.org

10 May 2021

GSPMD allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation.

arXiv

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Xi ChenJosip Djolonga Radu Soricut

Computer Science, Linguistics

arXiv.org

29 May 2023

PaLI-X, a multilingual vision and language model, advances the state-of-the-art on most vision-and-language benchmarks considered and observes emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

arXiv

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Yu ZhangDaniel S. Park Yonghui Wu

Computer Science

IEEE Journal on Selected Topics in Signal…

27 September 2021

It is found that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data.

IEEE