4EU+ online seminars on "Artificial Intelligence Techniques, Applications, and Social Issues"

Machine Learning for source code modeling and generation with applications in software engineering

18 March 2021
16:00 – 17:30 CET

Link to online seminars: Zoom

Abstract

The emergence of "Big Code", i.e. availability of very large repositories of programs, e.g. on GitHub or GitLab enabled a new class of software engineering applications and tools based on machine learning models of code. Such applications include code recommendation, automated source code summarization, comment generation and updates, bug detection, program translation, clone detection, program induction, and more. In this presentation we will first refresh some concepts from Deep Learning (in particular Transformer models), and then explain common building blocks for source code modelling: embeddings of source code as text, embeddings and representation of Abstract Syntax Trees (ASTs), closed vocabulary vs. open vocabulary models, and pre-training. We will then illustrate these concepts on some recent state-of-the-art approaches: code predictions via modifying the attention mechanism of the Transformer (https://arxiv.org/abs/2003.13848), and unsupervised translations of programs between Java, C++, and Python (http://arxiv.org/abs/2006.03511).

Short bio

Artur Andrzejak has received a PhD degree in computer science from ETH Zurich in 2000 and a habilitation degree from FU Berlin in 2009. He was a postdoctoral researcher at the HP Labs Palo Alto from 2001 to 2002 and a researcher at ZIB Berlin from 2003 to 2009. He was leading the CoreGRID Institute on System Architecture (2004 to 2006) and acted as a Deputy Head of Data Mining Department at I2R Singapore in 2010. Since 2010 he is a professor at Ruprecht-Karls-University of Heidelberg and leads there the Parallel and Distributed Systems group. His research interests include reliability of complex software systems, scalable data analysis, and software engineering.