Safe and Explainable AI for Code

Aftab Hussain¹, Md Rafiqul Islam Rabin¹, Mohammad Amin Alipour¹, Vincent J. Hellendoorn², Bowen Xu³, Omprakash Gnawali¹, Sen Lin¹, Toufique Ahmed⁴, Premkumar Devanbu⁴, Navid Ayoobi¹, David Lo⁵, Sahil Suneja⁶
University of Houston¹, Carnegie Mellon University², North Carolina State University³, University of California, Davis⁴, Singapore Management University⁵, IBM Research⁶

account_balance Supported by SRI International, IARPA
event Accepted at AIware '24 at FSE '24, Porto de Galinhas, Brazil, SeT LLM at ICLR '24, Vienna, Austria, InteNSE '24 at ICSE '24, Melbourne, Australia, IST '23
calendar_clock 2021 to present
construction Skills used: Python, Pytorch, SciPy, Matplotlib, NumPy, C, Java, SQL, model finetuning, freezed model finetuning, model parameter analysis, data extraction, data manipulation, machine learning, cybersecurity

arrow_backReturn to Projects

drawing

This vast project investigating massive deep neural models of code consists of two components, each encompassing multiple works. The Explainable AI component focuses on the behavior of these models, and the Safe AI for Code component focuses on their security. A dedicated page for the Safe AI for Code component can be found here. The subject models range in size from millions to billions of parameters (100m to 15b+) – they include transformer-based Large Language Models (LLMs) like Microsoft’s CodeBERT, Salesforce’s CodeT5 and CodeT5+, Meta’s Llama2 and CodeLlama, BigCode’s StarCoder, against attacks on software development tasks including defect detection, clone detection, and text-to-code generation. The techniques we deploy include model probing and black box approaches that involve fine-tuning the models on noise-induced and poisoned code data derived from benchmark datasets like Microsoft’s CodeXGLUE, utilizing NVIDIA A100 GPUs.

Here are the works in this project:

Aftab Hussain
University of Houston

Safe and Explainable AI for Code

Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code

On Trojan Signatures in Large Language Models of Code

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Memorization and Generalization in Neural Code Intelligence Models

Aftab Hussain University of Houston

Safe and Explainable AI for Code

Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code

On Trojan Signatures in Large Language Models of Code

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Memorization and Generalization in Neural Code Intelligence Models

Aftab Hussain
University of Houston