Home United States USA — software NLP Models for Writing Code: Program Synthesis

NLP Models for Writing Code: Program Synthesis

June 5, 2022

In this article we explore program synthesis through NLP models for writing code, giving a brief overview of Codex, Copilot, & Alphacode.
Join the DZone community and get the full member experience. Enabled by the rise of transformers in Natural Language Processing (NLP), we’ve seen a flurry of astounding deep learning models for writing code in recent years. Computer programs that can write computer programs, generally known as the program synthesis problem, have been of research interest since at least the late 1960s (pdf) and early 1970s. In the 2010s and 2020s, program synthesis research has been re-invigorated by the success of attention-based models in other sequence domains, namely the strategy of pre-training massive attention-based neural models (transformers) with millions or billions of parameters on hundreds of gigabytes of text. The pre-trained models show an impressive capacity for meta-learning facilitated by their attention mechanisms, and can seemingly adapt to a textual task with only a few examples included in a prompt (referred to as zero-shot to few-shot learning in the research literature). NLP models can further be trained with specialized datasets to fine-tune performance on specific tasks, with writing code a particularly interesting use case. GitHub Copilot, billed as « Your AI Pair Programmer », caused no small amount of controversy when it was introduced in June 2021. In large part this was due to the use of all public GitHub code in the training dataset, including projects with copyleft licenses that may disallow the use of the code for projects like Copilot, unless Copilot itself were to be made open source, according to some interpretations. Copilot is the result of the relationship between OpenAI and Microsoft, based on a version of GPT-3 trained on code. The version demonstrated by OpenAI and available via their API is called Codex. Formal experiments with Copex are described in Chen et al. 2021. Not to be outdone, in early 2022 DeepMind upped the stakes with their own program synthesis deep NLP system: AlphaCode. Like Codex and Copilot before it, AlphaCode is a large NLP model designed and trained to write code. Rather than trying to build AlphaCode as a productivity tool for software engineers like Copilot, AlphaCode was developed to take on the challenge of human-level performance in competitive programming tasks. The competitive coding challenges used to train and evaluate AlphaCode (making up the new CodeContests dataset) lie somewhere between the difficulty level of previous datasets and real-world software engineering. For those unfamiliar with the competitive coding challenge websites, the task is somewhat like a simplified version of test-driven development. Based on a text description and several examples, the challenge is to write a program that passes a set of tests, most of which are hidden from the programmer. Ideally, the hidden tests should be comprehensive, and passing all tests should be representative of solving the problem, but covering every edge case with unit tests is a difficult problem. An important contribution to the program synthesis field is actually the CodeContests dataset itself, as the DeepMind team made significant efforts to reduce the rate of false positives (tests pass but the problem is unsolved) and slow positives (tests pass but the solution is too slow) by generating additional tests through a process of mutation. AlphaCode was evaluated on solving competitive coding challenges from the competition website CodeForces, and overall AlphaCode was estimated to perform in the “top 54.3%” of (presumably human) competitive programmers participating in contests, on average. Note that the metric may be a bit misleading, as it actually equates to performance in the 45.7th percentile. While the ~45th percentile sounds decidedly less impressive, it’s incredible that the AlphaCode system was able to write any algorithms that passed every hidden test at all. But AlphaCode used a very different strategy to solving programming problems than a human would. While a human competitor might write an algorithm that solves most of the examples and iteratively improve it until it passes all tests, incorporating insights from running early versions of their solution, AlphaCode takes a more broad-based approach by generating many samples per problem and then choosing ~10 to submit. A big contributor to AlphaCode’s performance on the CodeContests dataset is the result of post-generative filtering and clustering: after generating some 1,000,000 candidate solutions, AlphaCode filters the candidates to remove those that don’t pass the example tests in the problem statement, eliminating around 99% percent of the candidate population. The authors mentioned that around 10% of problems have no candidate solutions that pass all example tests at this stage. The remaining candidates are then winnowed down to 10 submissions or less by clustering. In short, they trained another model to generate additional test inputs based on the problem description (but note they do not have valid outputs for these tests). The remaining candidate solutions, which may number in the 1000s after filtering, are clustered based on their outputs on the generated test inputs. A single candidate from each cluster is selected for submission, in order from largest to smallest clusters. Clusters are sampled more than once if there are fewer than 10 clusters. While the filtering/clustering step is unique and AlphaCode was fine-tuned on the new CodeContests dataset, it was initially trained in much the same way as Codex or Copilot.