Or: What kind of research Google’s getting in its Mandiant takeover
GTC Disassembling and analyzing malware to see how it works, what it’s designed to do and how to protect against it, is mostly a long, manual task that requires a strong understanding of assembly code and programming, techniques and exploits used by miscreants, and other skills that are hard to come by. What with the rise of deep learning and other AI research, infosec folks are investigating ways machine learning can be used to bring greater speed, efficiency, and automation to this process. These automated systems must cope with devilishly obfuscated malicious code that’s designed to evade detection. One key aim is to have AI systems take on more routine work, freeing up reverse engineers to focus on more important tasks. Mandiant is one of those companies seeing where neural networks and related technology can change how malware is broken down and analyzed. At this week at Nvidia’s GTC 2022 event, Sunil Vasisht, staff data scientist at the infosec firm, presented one of those initiatives: a neural machine translation (NMT) model that can annotate functions. This prediction model, from what we understand, can take decompiled code – machine-language instructions turned back into corresponding high-level language code – and use this to suggest appropriate names for each of the function blocks. Thus, if you’re a reverse engineer, you can skip the functions that, for instance, get the OS to handle a printf() call, and go right to the functions identified as performing encryption or raising privileges. You can ignore a block that’s labeled by the model as tolower(), and go after the inject_into_process() one. You can avoid wasting time on dead-ends or inconsequential functions. Specifically, the model works by predicting function name keywords (eg, ‘get’, ‘registry’, ‘value’) from abstract syntax tree (AST) tokens from decompiled executable files. It was shown that the model was able to label one function as ‘des’, ‘encrypt’, ‘openssl’, ‘i386’, ‘libeay32’, whereas an analyst involved in the experiment was only able to suggest encode().