AI4SE | Baishakhi Ray

AI4SE: AI for Software Engineering

Learning & Reasoning -> Trustworthy Code!

We build neurosymbolic techniques that combine program analysis and machine learning to improve developers' productivity and software quality. Our AI models and agents automate code generation, bug detection, and program repair for robust and trustworthy software development.

Some of our recent effort includes:

Code Language Models

Post-training and fine-tuning code models for diverse SE tasks.

EditLord: Learning Code Transformation Rules for Code Editing. (ICML'25)
SemCoder: Training Code Language Models with Comprehensive Semantics. (NeurIPS’24).
LEDEX: Training LLMs to Better Self-Debug and Explain Code. (NeurIPS’24).
TRACED: Execution-aware Pre-training for Source Code. (ICSE'24)
CYCLE: Learning to Self-Refine Code Generation. (SPLASH/OOPSLA’24).
Towards Causal Deep Learning for Vulnerability Detection. (ICSE’24).

Coding Agents

Neurosymbolic agents to perform complex software engineering tasks

Translation agent to make the legacy application safer
- C2saferrust: Transforming c projects into safer rust with neurosymbolic techniques. (preprint)
Program Repairing agent
Test Generation Agent
- FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents (preprint)
- UTFix: Change aware unit test repairing using LLM (OOPSLA'25)
- Code-Aware Prompting: A study of Coverage-guided Test Generation in Regression Setting using LLM (FSE'24)

Benchmarking

Create benchmarks for diverse software engineering tasks

Security Related Benchmark
- Vulnerability detection with code language models: How far are we? (ICSE'25)
- Cweval: Outcome-driven evaluation on functionality and security of llm code generation (LLM4Code'25)
Crash Repairing Benchmark
- Kgym: A platform and dataset to benchmark large language models on linux kernel crash resolution (NeurIPS'24)
API Evolution Benchmark
- LibEvolutionEval: A benchmark and study for version-specific code generation (NAACL'25)

Empirical Evaluation

Gaining deep insight about models' and agents' behavior at different stage of SE cycle.

Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination. (ICML'25)
On mitigating code LLM hallucinations with API documentation (ICSE'25 SEIP)
Beyond accuracy: Evaluating self-consistency of code large language models with identitychain (ICLR'24)