CMPUT 663: AI for Software Engineering

Winter 2026

Overview

In this graduate course, we focus on how artificial intelligence (AI) for software engineering is built and used in practice, with a focus on code large language models (CodeLLMs) and coding agents. We will cover the lifecycle of CodeLLMs, including pre training on large code corpora, post training and alignment as coding assistants, and test time techniques such as multi sampling and reranking for higher reliability on real tasks.

The course connects these models to classic AI4SE topics, such as mining software repositories, automated testing and repair, verification, vulnerability detection, and the design and evaluation of benchmarks for code intelligence. We will also study how Code LLMs can operate on complex, real world repositories through retrieval and agent based workflows.

We will also critically analyze limitations and risks, including security, privacy, licensing, hallucination, and overconfidence, and review methods to improve safety, efficiency, and reasoning. A final component of the course focuses on human factors, industrial adoption, and broader impacts on software engineering practice, education, open source communities, and society.

Topics and Reading List

The reading list is in this companion document.

Mining Software Repository and Software Analytics

Modern AI for software engineering is built on rich data from software repositories, including source code, issue reports, version control history, CI/CD logs, error traces, and Q&A sites such as Stack Overflow. In this topic, we will look at how researchers mined these repositories before large language models, and how they used the resulting insights to build AI-based tools for tasks such as code completion, code generation, testing, verification, vulnerability detection, and GUI generation. We will also discuss how the introduction of AI into the development workflow itself creates new data, and how mining this data helps us understand the impact of AI on software engineering practice.

Tasks and Benchmarking

AI techniques have been applied to a wide range of software engineering tasks, evolving from token-level code completion to repository-level code generation. We will survey key tasks such as automated test generation, fuzzing, automated program repair, verification (e.g., generating post-conditions and loop invariants), vulnerability detection, code translation, automatic compiler optimization, and GUI generation. The topic also covers major benchmarks like CodeXGLUE, HumanEval, and SWE-bench, and examines long-standing concerns about benchmark quality, including data leakage (training–test contamination), task difficulty distributions, evaluation metrics, and to what extent benchmark scores correlate with real-world performance.

Building CodeLLMs: Pre-training

This topic focuses on how to pre-train large models on massive code corpora, with an emphasis on Transformer-based, decoder-only architectures. We will discuss end-to-end data pipelines for collecting, cleaning, and filtering code data, as well as pre-training objectives such as masked language modeling, infilling, and masked span prediction. We will also study representative models like CodeT5+, StarCoder 2, and DeepSeek-Coder, comparing their design choices, training data, and open-source strategies.

Building CodeLLMs: Post-training

Pre-trained code models should be adapted to understand developer intent and function as effective coding assistants. In this topic, we will cover supervised fine-tuning (SFT) and alignment methods such as RLHF and RLAIF, with the goal of producing code that is concise, maintainable, and safe. We will also look at code-specific alignment strategies, including using unit tests and static analysis as automatic judges, and self-improvement loops where models iteratively refine their own outputs.

Building CodeLLMs: Test-time scaling

Test-time scaling explores how to leverage more computation at inference time to improve performance on coding tasks, without changing the underlying model parameters. We will discuss common strategies such as multi-sampling, reranking, and various forms of structured reasoning, and analyze how these techniques trade off between solution quality, latency, and cost.

Handling Complex Repositories using CodeLLMs

Real-world software repositories are large and complex, involving many files, modules, dependencies, tests, and bugs, making it impossible to fit all relevant context into a single prompt. This topic examines how to enable CodeLLMs to work effectively with complex repositories, with a particular focus on coding agents. We will compare approaches like RAG (retrieval-augmented generation) and agent-based workflows, and concentrate on how to design, implement, and evaluate coding agents operating over realistic codebases.

Limitations of CodeLLMs

Despite their strong performance on many benchmarks, CodeLLMs still have significant limitations. They can generate insecure or privacy-violating code, introduce licensing and copyright risks, hallucinate APIs or behaviors, exhibit overconfidence, and take unsafe or unreasonable actions when given tool access. In this topic, we will analyze these limitations in depth and discuss open challenges in making CodeLLMs more reliable and trustworthy.

Improving AI4SE (generate secure code, efficiency code, reasoning ability)

Building on the identified limitations, this topic focuses on targeted techniques to improve CodeLLMs for software engineering. We will discuss methods for encouraging models to generate more secure and robust code, techniques for producing more efficient implementations, and approaches to strengthening reasoning in specific domains. We will also cover the design of safety guardrails, as well as runtime verification and monitoring frameworks for coding agents deployed in real development environments.

Human-factors in Intelligent Coding Tools

Software development remains fundamentally human-in-the-loop, with developers and AI tools continuously interacting. This topic investigates how developers perceive and use intelligent coding tools, how much they trust AI-generated code, and what kinds of UX and interaction patterns make AI coding assistants genuinely helpful rather than disruptive (e.g., when and where to show completions). We will also compare usage patterns between novice and senior developers, and discuss implications for tool and workflow design.

Industrial adoption and opinions of AI4SE

This topic examines how AI4SE technologies are being adopted in industry and what practitioners think of them in practice. We will cover typical usage scenarios—such as IDE assistants, test and code review support, migration and refactoring tools, and automated bug fixing—and analyze reported gains in productivity alongside concerns about security, compliance, and cost-effectiveness. We will further discuss organizational and process changes, including new roles (e.g., prompt engineer, AI tooling owner, AI safety officer), and explore the vision of “AI teammates” participating directly in sprint planning and task assignment.

Impact of AI on SE, on education, OSS, and society.

Here we look at the broader impact of AI on software engineering, education, open-source ecosystems, and society at large. Topics include shifts in required skills and job roles for software engineers, changes in teaching and assessment practices in the “AI era,” and questions about how to fairly support and govern open-source projects whose code is heavily used for model training. We will also consider societal and policy questions around employment, privacy, intellectual property, and responsibility when AI-generated code leads to bugs, vulnerabilities, or failures.

Contact

Email: [email protected]

Office: 7-235, UCommon, University of Alberta