Have you ever wanted to reduce the disk space used by an Azure SQL database? You will quickly find out that the general advice on the Internet is Don't do it!. But how bad is it? Read on for a report on how I reclaimed 100s of GB and saved 2500+ USD from our monthly Azure SQL cost with zero downsides!

Intro

For those not familiar to Azure SQL, it's a fully managed "cloud" version of the classical Microsoft SQL Server. There are several different service tiers and options, meaning it's not obvious which model is best. At work, we use an elastic pool, which means several databases share a single pool of resources. The pool itself use the classic DTU model, where you pay for an amount of DTUs which represent compute, storage and IO. Basically, given X DTUs, you get Y disk space. For our case, we use 1500 DTUs, which results in ~1.5TB of disk.

But what if you need more disk? Well, then you just drag the slider in the UI to the next size, which is 2000 DTUs and ~2TB of disk. Its just one tiny little problem, and that is, surprise surprise, the cost. Going from 1500 DTU to 2000 increases the monthly cost with ~2700 USD or ~33% (on the Premium tier).

Often, the data stored in the various databases changes over time, what was once critical data is no longer needed, after the usage has grown it's no longer a good idea to keep all historical data in the main database etc. Same thing happened at my work at MAJORITY.

The Problem

We got close to the 1.5TB allocated disk limit. The quickest fix would be to just scale up to the next disk size. This is a great advantage of cloud hosting, especially when you're a growing startup where everyone has a ton of important stuff to work on! But as noted above it does not come for free.

We will look at 3 databases, Reporting, Risk and User. These are the DBs in the elastic pool with the most free/unused space. Their table designs and usage are quite different, so any (negative) effects from shrinking should show in at least one of them. The reporting DB only contains a few tables and has data inserted into those continuously, but the big read queries happens only once per day. User DB has lots of tables with many different types of queries. And finally risk DB, which is somewhat in the middle. It has only a few tables, and mainly does key lookup/insert/updates. Another important consideration is that reporting is not as critical as the other ones, and some slower queries won't matter much.

Before shrink, the data used, allocated and allocated but not used looked like this:

DB Used (GB) Allocated (GB) Unused (GB)
Reporting 24 238 214
Risk 44 116 72
User 180 353 173
total 248 707 459

In total 459GB or 65% of the disk space not used but still allocated!

AI to the rescue!

As a modern developer, the first thing to reach for is support by AI! Or... actually this is a lie. I didn't ask the AI first, only now when I wrote the article because I was curious what it would answer.

I have a big SQL Elastic cluster in azure. It has 1.4TB of allocated disk with a limit of 1.5TB. But it only uses 900GB. How can I get the allocated unused space back?

I sent the question to GPT-4o through Github. And the answer?

Unfortunately, there isn't a supported "shrink" operation that will return that unused allocated space to the system.

As anyone that have asked ChatGPT/Copilot for anything know, this was of course not the full response, but the key part of it. I don't know why ChatGPT is so useless when it comes to DB questions, this is not the first time.

AUTO_SHRINK

There are several ways to shrink a database. For this experiment I tried with DBCC SHRINKFILE first, but quickly switch to AUTO_SHRINK instead. DBCC SHRINKFILE did its job decently, but was very resource hungry while running. Like many other B2E systems these DBs has periods of higher load and periods of lower load, but never no load. Therefor you will need to be somewhat careful with these kinds of maintenance operations. AUTO_SHRINK can be enabled in this way:


  ALTER DATABASE {name of db} SET AUTO_SHRINK ON;

Guess what happened next!

All 3 databases behaved similarly, so its enough to look at one of them. I picked out User DB since its the one with the most load and also the biggest. First lets take a look at the general database stats (disk, DTU, CPU and log usage) in this graph from Datadog:

Database stats (disk, DTU, CPU/Log/Mem) of the User DB. (Graph screenshot from Datadog)

In the above graph we can see when AUTO_SHRINK completed, that's when the big drop in allocated disk space happened. Looking at the DTU, the squiggly line, there is an increase. However, it is not related to the shrinking but is the database's normal pattern. The "peak" hours start at around 15:00 and happens at the same time every day regardless of shrinking. Normally you should not run DB maintenance when it's the most traffic, but in this case I knew from the other two databases that the effect would be very low and was curious how it behaved when the DB was under more pressure.

Next, lets look at the actual queries, did their performance change? To check this I measured the median and 99th percentile of all queries ran against the DB:

Query statistics (median and 99th percentile) compared against 1 day earlier. (Graph screenshot from Kibana)

Finally! Here we see something! Let me first explain what this graph measures. Every time a service queries the database it will log the time taken from the service's point of view. The resolution is 1ms. The above graph shows the median and 99th percentile and the same but for the previous day. In the above graph there's a clear increase in the 99th percentile. It more or less tripled, which in one way is a significant increase! However, in absolute numbers its "only" 10-15ms extra in query time, and only for the 99th percentile. In our case this won't be noticed by anyone but be drowned out by other inefficiencies. I.e. if an API call results in 10 DB calls of 1ms each, and 1 external call of 200ms, and 1 of the DB calls becomes 10ms slower, it will only be a ~5% time increase of that API call.

But... What about fragmentation everyone warns about?!

I didn't check very carefully. My high level view is that if the fragmentation does not affect the observable performance, then it doesn't matter. These databases mostly uses random GUIDs as primary keys / cluster keys, which leads to a natural state of fairly heavy fragmentation (with the upside that there's no need for regular REORGANIZE/REBUILD maintenance operations).

I have previously done experiments on our most heavily used DB, by running REBUILD on all its big tables and indexes. But at that time I couldn't detect any significant improvement on performance seen from the backend services. The same happened this time, any additional fragmentation caused by the shrink operation did not have any visible effect on query times, even when looking at a longer time period (e.g. several days). Maybe this observation would be a good topic for another post?

Savings

Ok, we have concluded that the impact of DB performance were minimal while the shrink operation ran. But did it give any disk savings?

This is the resulting space used of the DBs after shrink, and the savings compared to before:

DB Used (GB) Allocated (GB) Unused (GB) Saved (GB)
Reporting 24 31 7 207
Risk 44 58 14 58
User 180 240 60 113
total 248 329 81 378

The shrink reduced the unused space from 459GB to 81GB, in total a reduction of 378GB!

Ending

With this massive disk usage saving we could scale down (or rather avoid scaling up to the next tier), and thereby save ~2700 USD in DB hosting costs.

I find it a bit strange that this useful command is not spread more widely. As shown above, given our very standard DBs I could only see a minor degrade of 99th percentile while running the shrink itself, and otherwise only upside of 100s of GB reclaimed for very low effort.

I would still recommend you keep AUTO_SHRINK off by default. This is to prevent it from running during a critical moment and to not end up in a shrink/expand/shrink cycle, for example if a table is filled each day and then emptied during the night. However, I won't hesitate to enable it the next time there's 100s of GB that could be reclaimed.

Victor 2025-03-01