Hash cracking reveals verboten slurs, terms like ‘liberals, ‘Palestine,’ and ‘socialist’ … and Quake’s famous Fast InvSqrt
GitHub’s Copilot comes with a coded list of 1,170 words to prevent the AI programming assistant from responding to input, or generating output, with offensive terms, while also keeping users safe from words like “Israel,” “Palestine,” “communist,” “liberal,” and “socialist,” according to new research. Copilot was released as a limited technical preview in July in the hope it can serve as a more sophisticated version of source-code autocomplete, drawing on an OpenAI neural network called Codex to turn text prompts into functioning code and make suggestions based on existing code. To date, the results have been interesting but not quite compelling – the code produced has been simplistic and insecure, though the project is still being improved. GitHub is aware that its clever software could offend, having perhaps absorbed parent company Microsoft’s chagrin at seeing its Tay chatbot manipulated to parrot hate speech. “The technical preview includes filters to block offensive words and avoid synthesizing suggestions in sensitive contexts,” the company explains on its website. “Due to the pre-release nature of the underlying technology, GitHub Copilot may sometimes produce undesired outputs, including biased, discriminatory, abusive, or offensive outputs.” But it doesn’t explain how it handles problematic input and output, other than asking users to report when they’ve been offended. “There is definitely a growing awareness that abuse is something you need to consider when deploying a new technology,” said Brendan Dolan-Gavitt, assistant professor in the Computer Science and Engineering Department at NYU Tandon School of Engineering, in an email to The Register. “I’m not a lawyer, but I don’t think this is being driven by regulation (though perhaps it’s motivated by a desire to avoid getting regulated). My sense is that aside from altruistic motives, no one wants to end up as the subject of the next viral thread about AI gone awry.” Dolan-Gavitt, who with colleagues identified Copilot’s habit of producing vulnerable suggestions, recently found that Copilot incorporates a list of hashes – encoded data produced by passing input through hash function. Copilot’s code compares the contents of the user-provided text prompt fed to the AI model and the resulting output, prior to display, against these hashes. And it intervenes if there’s a match. The software also won’t make suggestions if the user’s code contains any of the stored slurs. And at least during the beta period, according to Dolan-Gavitt, it reports intervention metrics back to GitHub while making a separate check to make sure the software doesn’t reproduce personal information like email or IP addresses from its data model.