https://arxiv.org/pdf/2305.07922.pdfhttps://arxiv.org/pdf/2305.07922.pdfHowever, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade.
Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 and 54.5% pass@10 on the HumanEval code generation task against other open code LLMs, even surpassing the OpenAI code-cushman-001 model.
From an architectural perspective, existing code LLMs often adopt encoder-only or decoder-only models that perform well only on certain understanding or generative tasks.
Besides, several recent models have adopted more unified encoder-decoder architectures [Wang et al., 2021b, Ahmad et al., 2021] to adapt to different types of tasks. While these models can support both understanding and generative tasks, they still suffer from suboptimal performance on certain tasks.
To address the above limitations, we propose “CodeT5+”, a new family of encoder-decoder code foundation LLMs for a wide range of code understanding and generation tasks (see Fig. 1 for an overview). Despite being an encoder-decoder based model, our CodeT5+ can flexibly operate in encoder-only, decoder-only, and encoder-decoder modes to suit different downstream applications.
All CodeT5+ models will be open-sourced to support the research and developer communities.
We develop CodeT5+, a new family of open code large language models for code understanding and generation tasks (see Fig. 1 for an overview and more architecture/pretraining details in Fig. 2 and Fig. 3). Based on the encoder-decoder architecture [Wang et al., 2021b], CodeT5+ is enhanced with the flexibility to operate in various modes for different downstream tasks through our proposed mixture of pretraining objectives on unimodal and bimodal data.
In the first stage of unimodal pretraining, we pretrain the model with massive code data using computationally efficient objectives (Sec. 3.1). In the second stage of bimodal pretraining, we continue to pretrain the model with a smaller set of code-text data with cross-modal learning objectives (Sec. 3.2). For each stage, we jointly optimize multiple pretraining objectives with equal weights. We found that this stage-wise training approach can efficiently expose our models to more diverse data to learn rich contextual representations. Additionally, we explore initializing CodeT5+ with off-the-shelf code LLMs to efficiently scale up the model (Sec. 3.3). Finally, model components in CodeT5+ can be dynamically combined to suit different downstream application tasks (Sec. 3.4).