Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
- Machel ReidNikolay Savinov Alexandra Chronopoulou
- 8 March 2024
Computer Science
Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- Dmitry LepikhinHyoukJoong Lee Z. Chen
- 30 June 2020
Computer Science
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
LaMDA: Language Models for Dialog Applications
- R. ThoppilanDaniel De Freitas Quoc Le
- 20 January 2022
Computer Science
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- Jiahui YuYuanzhong Xu Yonghui Wu
- 22 June 2022
Computer Science
Trans. Mach. Learn. Res.
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
PaLM 2 Technical Report
- Rohan AnilAndrew M. Dai Yonghui Wu
- 17 May 2023
Computer Science, Linguistics
PaLM 2 is a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM and enables inference-time control over toxicity without additional overhead or impact on other capabilities.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- Nan DuYanping Huang Claire Cui
- 13 December 2021
Computer Science
This paper proposes and develops a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
Vector-quantized Image Modeling with Improved VQGAN
- Jiahui YuXin Li Yonghui Wu
- 9 October 2021
Computer Science
This work introduces a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively, and proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
GSPMD: General and Scalable Parallelization for ML Computation Graphs
- Yuanzhong XuHyoukJoong Lee Zhifeng Chen
- 10 May 2021
Computer Science
GSPMD allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation.
PaLI-X: On Scaling up a Multilingual Vision and Language Model
- Xi ChenJosip Djolonga Radu Soricut
- 29 May 2023
Computer Science, Linguistics
PaLI-X, a multilingual vision and language model, advances the state-of-the-art on most vision-and-language benchmarks considered and observes emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
- Yu ZhangDaniel S. Park Yonghui Wu
- 27 September 2021
Computer Science
It is found that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data.
...
...