Yanping Huang | Semantic Scholar

Regularized Evolution for Image Classifier Architecture Search

Esteban RealA. AggarwalYanping HuangQuoc V. Le

Computer Science

AAAI Conference on Artificial Intelligence

5 February 2018

This work evolves an image classifier---AmoebaNet-A---that surpasses hand-designs for the first time and gives evidence that evolution can obtain results faster with the same hardware, especially at the earlier stages of the search.

arXiv

Scaling Instruction-Finetuned Language Models

Hyung Won ChungLe Hou Jason Wei

Computer Science

Journal of machine learning research

20 October 2022

It is found that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups, and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation).

arXiv

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping HuangYonglong Cheng Z. Chen

Computer Science, Engineering

Neural Information Processing Systems

16 November 2018

GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.

View Paper

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry LepikhinHyoukJoong Lee Z. Chen

Computer Science

International Conference on Learning…

30 June 2020

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

arXiv

LaMDA: Language Models for Dialog Applications

R. ThoppilanDaniel De Freitas Quoc Le

Computer Science

arXiv.org

20 January 2022

It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

arXiv

PaLM 2 Technical Report

Rohan AnilAndrew M. Dai Yonghui Wu

Computer Science, Linguistics

arXiv.org

17 May 2023

PaLM 2 is a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM and enables inference-time control over toxicity without additional overhead or impact on other capabilities.

arXiv

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Nan DuYanping Huang Claire Cui

Computer Science

International Conference on Machine Learning

13 December 2021

This paper proposes and develops a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.

arXiv

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Zhao ChenJiquan Ngiam Dragomir Anguelov

Computer Science

Neural Information Processing Systems

14 October 2020

This work presents Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency, and discusses how GradDrop reveals links between optimal multiloss training and gradient stochasticity.

arXiv

Mixture-of-Experts with Expert Choice Routing

Yan-Quan ZhouTao Lei J. Laudon

Computer Science

Neural Information Processing Systems

18 February 2022

This work proposes a heterogeneous mixture-of-experts employing an expert choice method that improves training convergence time by more than 2x and demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks.

arXiv

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret ZophIrwan Bello W. Fedus

Computer Science, Linguistics

17 February 2022

This work scales a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B), and for the first time achieves state-of theart performance in transfer learning.

arXiv