On Trojan Signatures in Large Language Models of Code

Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour
Software Engineering Research Group
University of Houston

account_balance Supported by SRI International, IARPA
event Accepted at The International Conference on Learning Representations Workshop on Secure and Trustworthy Large Language Models (SeT LLM at ICLR '24), 2024, Vienna, Austria

arrow_backReturn to Safe and Explainable AI Projects

drawing

How can you unveil the hidden threats lurking within your code model? The quest to uncover trojans can be costly and complex. Inspired by Fields et al.’s approach, we sought a solution — a light-weight technique that analyzes parameters only, which originally revealed trojan signatures in vision models.

Cracking the puzzle for code models, however, proved challenging. Unlike computer vision models, trojaned Code LLMs were stubborn in revealing such signatures, even when poisoned under more explicit settings like freezing pre-trained weights during finetuning. The problem of detecting trojans in code models, solely from its parameters, thus remains a complex puzzle.

Slide from AftabHussain@slideshare

article Paper - SeT LLM at ICLR 2024
filter_frames Poster - SeT LLM at ICLR 2024

Aftab Hussain University of Houston

On Trojan Signatures in Large Language Models of Code

Aftab Hussain
University of Houston