Over the past decade, deep learning research has led to significant advances in perceptual tasks, such as object detection, face recognition, and speech recognition. In each of these use cases, raw real-world inputs have to be mapped into a normalized representation. Deep learning has also started to lead to significant advances in natural language processing, for example, where contextual embeddings enable multi-task transfer learning. A newer, emerging field for machine learning is in its application within the production, deployment, and maintenance of software. Research is increasingly identifying areas where machine learning sits alongside traditional software engineering tools and the human workflow to deliver on the promise of augmented programming.
There are many areas where machine learning is having a gradual impact on the software production process. One holy grail in this effort is program synthesis, where a program can be generated directly from natural language, explanations, or examples. Here, one of the most interesting avenues is the automatic creation of GUIs from sketches. Program-assisted code completion has also gotten significantly better, inspired by work related to speech recognition and machine translation — in particular, a technology called “language modeling.” With increasing research, bug fixing, and identification, applying machine learning by means of static analysis of code have also steadily improved, often from learning what bugs have been fixed in the past. Despite this, automated program synthesis still remains far from matching or augmenting human capabilities (with only a few notable exceptions).
However, there are a few areas where machine learning techniques seem poised to vastly improve how software is produced. One comes in the compilation process. With modern compilers, humans specify the output, but the compiler specifies instruction ordering — and sometimes extensive rewriting. In turn, this can significantly impact the code’s performance characteristics. The typical optimizations that compilers use to search for equivalent and more efficient programs are hand-designed. But, over the past half-decade, researchers have identified areas, specifically in loop-free code, where MCMC sampling can find optimizations that significantly outperform what traditional compilers generate. Other work in auto-tuning compiler optimization suggests a whole host of areas where heuristic approaches in compiler design will be improved by machine learning mechanisms.
Decompilation has also seen a recent flowering of research work. Decompilation, or reverse engineering, is a critical step in many security and malware detection processes, where low-level assembly code is translated into a high-level programming language. The key step is to imbue the resulting high-level language with the semantics of the low-level program. Techniques borrowed from machine translation have started to be applied usefully to infer variable names for code where the original variable names have been scrambled.
While testing is always a chore for developers, it is crucial to exposing faults and vulnerabilities in software applications. One approach to software testing is fuzz testing, where a wide variety of inputs are sent to a program with the hope of exposing behavior that results in a crash or other aberrant program behavior. In grey-box fuzzing, the fuzzer executes inputs and then instruments and monitors internal program execution in order to select new examples to test. Here, experiments at Bloomberg and elsewhere have demonstrated how both reinforcement learning and deep learning can be applied effectively to build models of program behavior and build ever-more efficient fuzzers.
These three applications (compiler optimization, decompilation, and testing) are all areas developers typically ignore and generally struggle with. Even though machine learning models understand very little about how the program actually executes, they may still be able to extend the reach of what humans can do with software by leveraging extensive computation. While machine learning is still very far from being able to assume full responsibility for software production, it will increasingly be a tool that is integral to its efficient deployment.
Be sure to attend Gideon’s talk where he will elaborate on the concept of “Augmented Programming” in the Code-Build-Test lifecycle during his Deep Learning track keynote at ODSC Europe 2019 in London, this November 19 – 22.