9 rounds of AI-powered code reviews in Azure Devops

The other day we had a AI hackathon day in the office, basically a do-whatever day as long as it was related to AI. Fun! Part of my work as system architect at MAJORITY is to do code reviews, but it's impossible to review everything. Wouldn't it be great if an AI could help out? Let's find out! (PS. If you don't care about how to set it up, feel free to skip to the PR Review Feedback)

Setup

We use Azure Devops to both host our code, and to run CI/CD pipelines. And we already have an automated PR review pipeline (which runs tests, checks code coverage and does automated code cleanup with Jetbrains). The stuff I'm mainly involved in is our backend, which consists of C# microservices.

1 Setup Azure AI!

At the time I did not have a API key for OpenAI GPT4 ready. But luckily there's a GPT4 model available in Azure as well, under Azure OpenAI. As of writing you have to fill out a form to get access, but we had already done that earlier, so the process was just a couple of clicks in the UI:

With the OpenAI resource in place, it's super easy to create a "deployment" of a base model which is what we need to connect to. The model management is handled in Azure OpenAI Studio and not the portal, but it's still just a couple of clicks. I had several different models to chose from, and to start with I used the gpt-4 model:

2 Devops extension

To send the relevant code in the PR for code review by our new GPT 4 deployment we need to add a extension in DevOps. In the Visual Studio Marketplace there are many to chose from, but all of them are very fresh and mostly seems to be forks/clones of each other...

My simple requirements were to be able to connect it to the Azure OpenAI deployment created above, and look reasonably trustworthy. After looking through the Git history of a bunch of them I settled on AI Pull Request Reviewer by Blogic.

The easy part done!

3 A PR pipeline

Final step of the setup is to make a PR pipeline and trigger it on PRs. In my case I had to be a bit careful since I didn't want to roll it out on all our code. To prevent this I forked one of our repos with a real microservice and added a .azuredevops/pr-pipeline.yml file to its master branch with this content:


jobs:
  - job: aitest
    displayName: AI Testing
    pool:
      name: dev-buildpipeline-agents
      timeoutInMinutes: 5
    steps:
      - checkout: self
        persistCredentials: true
      - task: AIPullRequestReview@1
        inputs:
          api_key: "$(aoi-api-key)"
          model: "$(aoi-model)"
          aoi_endpoint: "$(aoi-endpoint)"

To the pipeline I also had to setup the 3 variables (btw, the UI to add a variable can not have been tested by real users, I've seen other and myself forget to save so many times):

Done - Lets try some code!

PR Review Feedback

To easily test this I made a new PR on my forked repo, which added the following code to a service class in one of our C# microservices.


public async Task TestAiPRReview(){
    Console.WriteLine("hellow?");
    Console.WriteLine($"Pi with 5 decimals: {CalcualtePi(5)}");
}
public double CalculatePi(int decimals){
    if (decimals == 0)
        return 3;
    else
        return 3.14;
}

In my mind this would be easy for it to review, it's a more or less useless method to calculate π with no need for any context, no tricky concurrency issues, no unreliable 3rd parties etc. On top of that my implementation leaves a lot of room for improvement.

Review #1

Less than a minute later I had my first PR comment

Project Collection Build Service (MAJORITY)

Code Consistency: 5 🚩

Misspelling in CalcualtePi (should be CalculatePi).

Inconsistent capitalization in Console.WriteLine("hellow?"); ('hellow' should be 'Hello').

Improvement:
Console.WriteLine("Hello?");
Console.WriteLine($"Pi with 5 decimals: {CalculatePi(5)}");
Security: 1

- Logging 'Pi with 5 decimals' may expose internals, consider if necessary.

Readability: 5 🚩

Typo in method name affects readability and function invocation.

Best Practices: 2

Method `CalculatePi` is too simplistic and does not accurately calculate pi.

Missing named constants for pi values.

Improvement:
private const double PI_APPROXIMATION = 3.14;
// ...
public double CalculatePi(int decimals){
    if (decimals == 0)
        return 3;
    else
        return PI_APPROXIMATION;
}
Additional:

The patch adds placeholder code (Console.WriteLine("Hello?");) that seems intended for debugging rather than production.

The CalculatePi method does not follow rule 6, as it is not an accurate calculation of pi. A more robust algorithm or library should be used.

Ok, let's break it down in a review of the code review:

Project Collection Build Service (MAJORITY)

Code Consistency: 5 🚩

Misspelling in CalcualtePi (should be CalculatePi).

Inconsistent capitalization in Console.WriteLine("hellow?"); ('hellow' should be 'Hello').

Improvement:
Console.WriteLine("Hello?");
Console.WriteLine($"Pi with 5 decimals: {CalculatePi(5)}");

Very reasonable complains, I didn't even add the spelling mistakes on purpose! I will apply these fixes.

Security: 1

Logging 'Pi with 5 decimals' may expose internals, consider if necessary.

This is nonsense! I will just ignore this part.

Readability: 5 🚩

Typo in method name affects readability and function invocation.

Ok, I already accepted the mistake and will fix.

Best Practices: 2

Method CalculatePi is too simplistic and does not accurately calculate pi.

Missing named constants for pi values.

Improvement:
private const double PI_APPROXIMATION = 3.14;
// ...

public double CalculatePi(int decimals){
    if (decimals == 0)
        return 3;
    else
        return PI_APPROXIMATION;
}

Very true. But I'm not sure of this improvement. Let's skip for now.

Additional:

The patch adds placeholder code (Console.WriteLine("Hello?");) that seems intended for debugging rather than production.

Agreed. I only added this to make debugging easier (theoretically, I never even compiled my changes)

The CalculatePi method does not follow rule 6, as it is not an accurate calculation of pi. A more robust algorithm or library should be used.

Right, repeat of number 5. But I still ignore it for now.

Review #2

Updated code:

public async Task TestAiPRReview(){
    Console.WriteLine("Hello?");
    Console.WriteLine($"Pi with 5 decimals: {CalculatePi(5)}");
}

public double CalculatePi(int decimals){
    if (decimals == 0)
        return 3;
    else
        return 3.14;
}

And not surprising it still had some comments. This time I will break it down directly:

Code Consistency: 1 - The new method CalculatePi should use braces consistently as per the surrounding code.

Yeah maybe... but I will ignore this.

Readability: 1 - Descriptive names should be used instead of TestAiPRReview. This name doesn't indicate the method's purpose.

I agree, naming is important. But also difficult, and I think the name is ok, so I will ignore this as well.

Best Practices: 2 - Avoid magic numbers (3 and 3.14); use named constants instead.

private const double PiToZeroDecimals = 3.0;
private const double PiToOneDecimal = 3.1; // Assume you meant to round to one decimal place

public double CalculatePi(int decimals){
    if (decimals == 0)
        return PiToZeroDecimals;
    else
        return PiToOneDecimal;
}

I have no idea why it thinks it's better to round of to 3.1 instead of my previous approximation of 3.14. But ok, I guess I should follow its advice, it is after all a AI trained on more or less all human knowledge!

Review #3

By now I think you can guess how it goes. But surprise, surprise, this time it has a great insight. But let's not get ahead of ourselves. This is the updated code:

public async Task TestAiPRReview(){
    Console.WriteLine("Hello?");
    Console.WriteLine($"Pi with 5 decimals: {CalculatePi(5)}");
}

private const double PiToZeroDecimals = 3.0;
private const double PiToOneDecimal = 3.1; // Assume you meant to round to one decimal place

public double CalculatePi(int decimals){
    if (decimals == 0)
        return PiToZeroDecimals;
    else
        return PiToOneDecimal;
}

And just as before, after a short wait, it produced the following review (with my inline comments):

Best Practices: 2

Non-descriptive method name 'TestAiPRReview'.

'CalculatePi' doesn't scale beyond 1 decimal, use a more general solution.

I agree!

Readability: 3

Magic numbers 3.0 and 3.1; provide a method for pi calculation or fetch from a math library.

Also agree. Magic numbers should be avoided, especially when they are used for no good reason.

Best Practices Example:

private const double BasePiValue = 3.0; // or use a math library to get Pi

public double CalculatePi(int decimals) {
    return Math.Round(Math.PI, decimals);
}

Nice! Math.Round and Math.PI is much better than the previous hard coded values! Im not really sure what the BasePiValue constant is useful for, but since Math.PI seems so smart I will trust it on the constant as well.

Review #4, #5, #6, #7, #8 and #9

The same continued for the next rounds of feedback. But to save everyone's time I won't go into details by skip directly to feedback round #9. Have you ever encountered such a persistent (human) code reviewer before? Even after 8 rounds of fixes it still came back with suggestions!

This is what the methods looked like after my last commit.

private const int PiPrecision = 5;

#if DEBUG
/// 
/// Debugging method that writes pi to the log to help test PR Code Review flow.
/// 
public void TestCalculatePi(){
    _logger.LogInformation("Hello?");
    _logger.LogInformation($"Pi with {PiPrecision} decimals: {CalculatePi(PiPrecision)}");
}
#endif

private const double MathPi = Math.PI; // Or a remark why a specific base value is needed

/// 
/// Return Pi calculated to the specified number of decimals given as argument.
/// 
/// <param name="decimals"></param>
/// 
public double CalculatePi(int decimals) {
    try {
        return Math.Round(Math.PI, decimals);
    }
    catch (Exception e) {
        _logger.LogError("An error occurred while calculating Pi: {Exception}", e);
        return MathPi;
    }
}

Now it actually started to run out of things to comment on, with no Security or Readability issues, yay!:

Code Consistency: 2 - The comment // Or a remark why a specific base value is needed is against minimal comments rule unless necessary for understanding.

I agree... Its just that I was advised by a smart AI to add it there in the first place...

Security: NF

Readability: NF

Error Handling: 3 - Using a fallback value directly returning MathPi may conceal the issue rather than properly handling it. Alternative: throw a custom exception.

Oh yes, let the caller deal with the mess.

Best Practices: 4 - Use of preprocessor directive #if DEBUG is questionable. This kind of code should generally not reach production.

This was my own idea I have to admit. Some time in the middle, maybe review round 5 or 6 it remarked that the TestCalculatePi looked like it was only for debugging and shouldn't be put in production, so I added the #if DEBUG there... Sorry!

Evaluation - is it useful?

When I showed this for my co-workers I had this conversation in slack:

Colleague:
\> We are not gonna use it, are we? 😂
Me:
\> Only on JDs code 😉

(JD is our very AI-passionate CTO. Did I mention that we already ~~have another AI Hackathon planned?~~ are on the way to do AI Hackathon #3)

I would say it's quite a bit left until it can match the more traditional analyzers that's available. At least they give sensible suggestions most of the time. To be fair, I suspect a "professional" AI code review tool would perform better than something "only" using a custom prompt with ChatGPT. But that is for another time.

In my experience the most valuable feedback in a code review session is not the easy details, but harder issues. A couple of typical issues I thought about in the last couple of code reviews I've done:

Is this code safe to run on two threads/instances/pods/servers at the same time?
Should this logic be in class A or B?
Why didn't this new code use our existing logic?
Is it backwards compatible. If not, is that ok?

Maybe a future AI can help with these issues as well, but it's not there yet at least. And then I skipped the softer benefits of code reviews, like shared ownership, knowledge sharing etc. which arguably is even more important.

Victor 2024-02-24