“Detecting Web Vulnerabilities in an Intermediate Language Resorting of Machine Learning Techniques”
Master’s thesis, Mestrado em Ciência em Dados, Nov. 2020
Abstract: The number of vulnerabilities has grown exponentially over the last years, with SQL Injection being especially troublesome for web applications. In parallel, novel research has shown the potential of Machine Learning to find vulnerabilities, which can aid experts to reduce the search space or even classify programs on its own. Previous work, however, rarely includes SQL Injection or considers popular serverside languages for web application development like PHP. In our work, we construct a Deep Learning model capable of classifying PHP excerpts as vulnerable (or not) to SQL Injection. We use an intermediate language to represent the excerpts and interpret them as text, resorting to well-studied Natural Language Processing techniques. This work can help back-end programmers discover SQL Injection in an early stage of the project, avoiding attacks that would eventually cost a lot to repair their damage. We also investigate which information should be fed to the model. Hence, we built four datasets (the Opcode Dataset, the Opcode+Operand Dataset, the Slice Dataset, and the Simplified Slice Dataset) from the bytecode dataset that represent each PHP excerpt differently. This approach is a simpler alternative to complex data structures previously used to represent code’s control flow. For each of those datasets, we performed several experiments to evaluate alternative configurations for the model. For all datasets, we managed to find a setting that leads to a score, on average, above 60% for the accuracy, precision, and recall.
Research line(s): Fault and Intrusion Tolerance in Open Distributed Systems (FIT)