A tricky query to optimize is partial match of phone numbers, this post will walk through some options.

Intro

We hit 100% DTU on user DB again yesterday, so we need to come up with a new solution for search ... We are considering both short term fixes ... But also more future-proof long term solutions like Elasticsearch, Azure Cognitive Search...

As is often common, at MAJORITY the slowest query is a generic search function in the admin tool that allows support agents to search for users. The above quote is from a recent spike in DTU usage were the engineer investigating identified this search function as the issue.

I have to admit, I'm a sucker for DB performance... And as shown by my post on Redis, I'm a fan of boring tech and using what you already have in your stack. So here's a deep dive into a pure SQL solution!

Data Driven

As always, the first thing to do is to look at the data, and, somewhat surprising this generic search function consists of 99% searches for mobile phone numbers. From now on I will refer to these as MSISDN as that is what we have and how we call them internally. That means the phone number is normalized to digits only including leading country code. Often these searches are partial, e.g. a support agent could search for 574400250 which would match the full 15744002507. To not write a whole book about text searching, lets focus on the specific case of partial matching of MSISDN only.

When searching for MSISDN the search query (~100 lines of T-SQL in a stored procedure) simplifies to a query like this:


select top 50 * 
  from [Users] 
  where Msisdn like '%ddd%' 
  order by Created desc;

where the ddd part is a variable number of digits entered by the user performing the search. This is a notoriously tricky query to optimize in most SQL databases since you cannot just add a common B-Tree index to fully solve it. See for example the classic guide on indexing, Use The Index, Luke, which warns Avoid LIKE expressions with leading wildcards (e.g., '%TERM')..

Establish a baseline

Before we try any ideas on how to solve this within SQL Server, let's start by establishing a baseline in a local DB to have something to benchmark against.

The table in question is something like this


  create table U (
	UserId uniqueidentifier primary key not null,
	Created datetime not null,
	Updated datetime not null,
	Msisdn nvarchar(20) not null,
	... 10 more columns
)

If you look carefully something already stand out as very non-optimal here, can you see what it is?

Anyway, regardless of obvious issues, lets focus and create a baseline first!

To benchmark, I copied over production like data (1.7m rows) into a local instance of SQL Server running on my laptop (a Surface Pro), and then ran the query above using SqlQueryStress. To make realistic load I used its parameter substitution function to select out some random MSISDNs, then cut a bit from the beginning/end and return:


  select search from (
    select top 200, 
        cast(substring(
            Msisdn, 
            abs(checksum(UserId)) % 7, 
            4+abs(checksum(Created)) % 16
        ) as varchar(20)) as search
        from U 
        order by checksum(UserId)
) x 
where len(search)>3

Why such a complicated query you might wonder? I want to take rows in somewhat random order, and I want to cut the beginning and end randomly. In this case the exact randomness is not very important, so I use the CHECKSUM function. This way its random, but also stable between runs. This is what it looks like in SQLQueryStress' parameter setup:

And this is the main window of the program, with a sneak peek of one of the things we will try later

The first query we try is the baseline / naive version, which looks like this when its parameterized:


select top 50 * 
  from U 
  where Msisdn like'%' + @search + '%'
  order by Created desc

And the result?

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644

The test runs 200 iteration. Each query take in average ~300ms, but spends almost 10x that on CPU, and reads the full table (61644 which is roughly the page count). In total the test ran in around 1 minute. Given I run this on my laptop, which has other unrelated processes running, dynamic frequency scaling etc the exact numbers should not be used, but used as a rough estimate when comparing tests.

Easy things to try

Alright, with the baseline out of the way we come to the fun part and can finally try some optimizations! You might already have spotted this when I showed the table definition - the column that stores the MSISDN numbers are nvarchar(20)! But a normalized MSISDN phone number is actually always just digits 0-9 maximum 15 digits. There is absolutely no need to use a data type that supports full Unicode! If anything it increases the chance that something bad sneaks in.

Test 1: Set MSISDN column data type to varchar

For the test, I just added a new column, MsisdnVarchar varchar(20) and copied over the values from the old nvarchar column. Since it only contains digits this is easy:


alter table U add MsisdnVarchar varchar(20) not null default('');
update U set MsisdnVarchar = Msisdn;

To do this change in production is a bit more involved, one way is to add a new column, copy over the data, rename the old column to something else, rename the new column to the old name and finally remove the old now unused column. Preferably not during peak load. You might want to rebuild it in the end to clean up any fragmentation as well.

With this column in place, let's rerun the test with the new column as shown below. Note that SQLQueryStress will not respect the datatype but use nvarchar for the parameter, so the query needs to cast @search to varchar. If @search would be a nvarchar SQL Server would need to run a slow conversion, and we would not get the result we want:


select top 50 * 
  from U 
  where MsisdnVarchar like 
	  cast('%' + @search + '%' as varchar)
  order by Created desc    

And the result? Very nice:

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644
Varchar Column 00:18.75 0.9894 0.0884 61644

Result: 3.6x faster

We can see that just like the baseline query it needs to visit all the pages (a full table scan), but it uses much less CPU since it's much cheaper to handle non-Unicode data which is smaller and doesn't have the complex collation rules. Honestly, wearing my professional architect hat this result is good enough. No need to optimize further given the 1-2 million rows and the usage pattern (on the order of 100s-1000s of searches per hour).

Test 2: Set MSISDN column data type to bigint

But, luckily I take off that hat when I'm at home and can follow the rabbit deeper! If we got this good speedup from a varchar column, couldn't we store the MSISDN in a numeric type and gain even more? A MSISDN is as we saw above at most 15 digits. That fits well within a bigint which can store 18 digits (or 19 depending on how you treat negative numbers). It has a downside, it can't keep track of leading 0s. But luckily for us MSISDNs does not start with 0s, it will always start with a country code.

For testing, we can add a new column MsisdnInt bigint and copy over the values:


alter table U add MsisdnInt bigint not null default(0);
update U set MsisdnInt = cast(Msisdn as bigint);

In an old production table, there might be some cases of uncleaned data, if so that will need to be cleaned up first.

With this column in place, let's rerun the test with the new column as shown below


select top 50 * 
  from U 
  where MsisdnInt like 
	  cast('%' + @search + '%' as varchar)
  order by Created desc

And the result? A bit disappointing:

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644
Varchar Column 00:18.75 0.9894 0.0884 61644
Bigint Column 00:18.94 1.0060 0.0895 61644

Result: 3.6x faster

As I mentioned earlier, the testing setup is not perfectly consistent, but I reran the tests several times to confirm that the bigint column always performed slightly slower than the varchar column. My guess is this happens because the like filter requires the type to be a string type, so SQL Server will need to convert on the fly. However, this is very quick, so the end result is more or less the same as varchar.

Test3: Index!

My first idea when I heard about this problem was that we should try and add an index on the MSISDN column regardless if the like query can use it optimally or not. If nothing else it should be easier for SQL Server to look through an index instead of the full table. But, I was on vacation at the time and did not have the chance to try. Instead, let's do it now and see if there's any improvement.


create nonclustered index IX_U_Msisdn on U (Msisdn);
create nonclustered index IX_U_MsisdnVarchar on U (MsisdnVarchar);
create nonclustered index IX_U_MsisdnInt on U (MsisdnInt);

And this is the result after rerunning the queries after index creation:

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644
Varchar Column 00:18.75 0.9894 0.0884 61644
Bigint Column 00:18.94 1.0060 0.0895 61644
Baseline Indexed 05:58.25 1.7805 1.7865 11828
Varchar Indexed 02:17.17 0.6545 0.6810 9041
Bigint Indexed 00:21.27 1.1159 0.1009 61644

Result ???

Very interesting result! No query got faster in total time! However, most of them uses significantly less CPU. Let's take a look at the query plan of the Varchar Indexed query:

And compare it to the Bigint Column query:

Ah, as we can see the Bigint Column query has a Parallelism step, while the Varchar indexed query does not. This should explain why Bigint Column uses more CPU than actual time. We also see that Varchar indexed spends most on Key lookup on the table itself. If the table had significantly more rows, I would expect the indexed queries to come out ahead also in actual time and not just CPU time, but for now lets move on.

Complex things to try

With the easy low-hanging fruits out of the way, we can now turn to more complex solutions. Let's first take a look at trigrams.

Test 3: Trigram

There's a great post by Paul White here: Trigram Wildcard String Search in SQL Server that goes into details of how to do wildcard search with trigrams. Trigrams are a variant of n-gram that splits up both the search term and the text to search into 3 character substrings. These can then be indexed and as a result give a great speedup. I won't go into detail since Paul already covered it, but below is the various SQL I used:

GenerateTrigrams implementation CREATE FUNCTION dbo.GenerateTrigrams...

    CREATE FUNCTION dbo.GenerateTrigrams (@string varchar(255))
    RETURNS table
    WITH SCHEMABINDING
    AS RETURN
        WITH
            N16 AS 
            (
                SELECT V.v 
                FROM 
                (
                    VALUES 
                        (0),(0),(0),(0),(0),(0),(0),(0),
                        (0),(0),(0),(0),(0),(0),(0),(0)
                ) AS V (v)),
            -- Numbers table (256)
            Nums AS 
            (
                SELECT n = ROW_NUMBER() OVER (ORDER BY A.v)
                FROM N16 AS A 
                CROSS JOIN N16 AS B
            ),
            Trigrams AS
            (
                -- Every 3-character substring
                SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END)
                    trigram = SUBSTRING(@string, N.n, 3)
                FROM Nums AS N
                ORDER BY N.n
            )
        -- Remove duplicates and ensure all three characters are alphanumeric
        SELECT DISTINCT 
            T.trigram
        FROM Trigrams AS T
        WHERE
            -- Binary collation comparison so ranges work as expected
            T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%';
Create a table to store the trigrams and fill itCREATE TABLE dbo.Trigram...

    -- Table to store the trigrams
    CREATE TABLE dbo.Trigram
    (
        UserId UNIQUEIDENTIFIER NOT NULL,
        Trigram char(3) NOT NULL
    );

    -- Generate trigrams
    INSERT INTO dbo.Trigram WITH (TABLOCKX)
        (UserId, Trigram)
    SELECT
        U.UserId,
        GT.trigram
    FROM U
    CROSS APPLY GenerateTrigrams(U.msisdnVarchar) AS GT;

    -- clustered "covering" index in trigram, userid
    CREATE UNIQUE CLUSTERED INDEX CUQ_Trigram_Trigram_UserId ON dbo.Trigram (Trigram, UserId) WITH (ONLINE = ON);
A materialized view to know the count of each trigram CREATE VIEW dbo.TrigramCounts...

    -- view of unique trigrams counted
    CREATE VIEW dbo.TrigramCounts
    WITH SCHEMABINDING
    AS
    SELECT ET.Trigram, cnt = COUNT_BIG(*)
    FROM dbo.Trigram AS ET
    GROUP BY ET.Trigram;
    -- Materialize the view
    CREATE UNIQUE CLUSTERED INDEX CUQ_TrigramCounts_Trigram ON TrigramCounts (Trigram);
A function to get the "best" trigrams from the view CREATE FUNCTION dbo.GetBestTrigrams...

    -- Most selective trigrams for a search string
    -- Always returns a row (NULLs if no trigrams found)
    CREATE FUNCTION dbo.GetBestTrigrams (@string varchar(255))
    RETURNS table
    WITH SCHEMABINDING AS
    RETURN
        SELECT
            -- Pivot
            trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END),
            trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END),
            trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END)
        FROM 
        (
            -- Generate trigrams for the search string
            -- and choose the most selective three
            SELECT TOP (3)
                rn = ROW_NUMBER() OVER (
                    ORDER BY ETC.cnt ASC),
                GT.trigram
            FROM dbo.GenerateTrigrams(@string) AS GT
            JOIN dbo.TrigramCounts AS ETC
                WITH (NOEXPAND)
                ON ETC.trigram = GT.trigram
            ORDER BY
                ETC.cnt ASC
        ) AS BT;
A function to get UserIds matching a set of Trigrams CREATE FUNCTION dbo.GetTrigramMatchIDs...

    -- Returns UserIds matching all provided (non-null) trigrams
    CREATE FUNCTION dbo.GetTrigramMatchIDs
    (
        @Trigram1 char(3),
        @Trigram2 char(3),
        @Trigram3 char(3)
    )
    RETURNS @IDs table (UserId UNIQUEIDENTIFIER PRIMARY KEY)
    WITH SCHEMABINDING AS
    BEGIN
        IF  @Trigram1 IS NOT NULL
        BEGIN
            IF @Trigram2 IS NOT NULL
            BEGIN
                IF @Trigram3 IS NOT NULL
                BEGIN
                    -- 3 trigrams available
                    INSERT @IDs (UserId)
                    SELECT ET1.UserId
                    FROM dbo.Trigram AS ET1 
                    WHERE ET1.trigram = @Trigram1
                    INTERSECT
                    SELECT ET2.UserId
                    FROM dbo.Trigram AS ET2
                    WHERE ET2.trigram = @Trigram2
                    INTERSECT
                    SELECT ET3.UserId
                    FROM dbo.Trigram AS ET3
                    WHERE ET3.trigram = @Trigram3
                    OPTION (MERGE JOIN);
                END;
                ELSE
                BEGIN
                    -- 2 trigrams available
                    INSERT @IDs (UserId)
                    SELECT ET1.UserId
                    FROM dbo.Trigram AS ET1 
                    WHERE ET1.trigram = @Trigram1
                    INTERSECT
                    SELECT ET2.UserId
                    FROM dbo.Trigram AS ET2
                    WHERE ET2.trigram = @Trigram2
                    OPTION (MERGE JOIN);
                END;
            END;
            ELSE
            BEGIN
                -- 1 trigram available
                INSERT @IDs (UserId)
                SELECT ET1.UserId
                FROM dbo.Trigram AS ET1 
                WHERE ET1.trigram = @Trigram1;
            END;
        END;
    
        RETURN;
    END;
And a function to do the full search CREATE FUNCTION dbo.TrigramSearch...

    
    CREATE OR ALTER FUNCTION dbo.TrigramSearch
    (
        @Search varchar(255)
    )
    RETURNS table
    WITH SCHEMABINDING
    AS
    RETURN
        SELECT
            Result.UserId
        FROM dbo.GetBestTrigrams(@Search) AS GBT
        CROSS APPLY
        (
            -- Trigram search
            SELECT
                U.UserId,
                U.msisdn
            FROM dbo.GetTrigramMatchIDs
                (GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID
            JOIN dbo.U
                ON U.UserId = MID.UserId
            WHERE
                -- At least one trigram found 
                GBT.trigram1 IS NOT NULL
                AND U.msisdnVarchar LIKE @Search
    
            UNION ALL
    
            -- Non-trigram search
            SELECT
                U.UserId,
                U.msisdn
            FROM dbo.U
            WHERE
                -- No trigram found 
                GBT.trigram1 IS NULL
                AND U.msisdnVarchar LIKE @Search
        ) AS Result;

Phew, that was quite a bit of logic! Lets take a look at the query we need to use. We need to join in the U table to get the full row out. Sometimes this query does not use a loop to resovle the join, so I add a loop hint as well. As long as not too many rows are found its better to first get matching userIds, and then join in the full table:


select top 50 U.* 
  from TrigramSearch('%' + @search + '%') as TS 
  inner loop join U on U.UserId = TS.UserId
  order by U.created desc;	

With all the setup done, and a query to test it, lets look at the result. I have run the test both without any additional index and with the index on the varchar MSISDN column:

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644
Varchar Column 00:18.75 0.9894 0.0884 61644
Bigint Column 00:18.94 1.0060 0.0895 61644
Baseline Indexed 05:58.25 1.7805 1.7865 11828
Varchar Indexed 02:17.17 0.6545 0.6810 9041
Bigint Indexed 00:21.27 1.1159 0.1009 61644
Trigram 00:04.50 0.0119 0.0176 1623
Trigram indexed 00:03.57 0.0103 0.0135 1096

Result unindexed: 18x faster
Result indexed: 24x faster

Nice! Here we have a real speedup! We now reach performance levels were we dont need to be worried the query will overload the database, or levels that could allow the query to be exposed to a more heavily used endpoint than admin search.

Test 4: "Intgram"

Is it possible to do something even faster? Using a bigint column for the MSISDN didnt help. But maybe something similar applied to the Trigrams could? What if instead of using 3 character trigrams we used integers for an "intgram"? The n in "n-gram" could then become 16 bit integer.

I chose 16bit integer (smallint) as the n-gram size based on:

  1. When searching, there's no point to search for only 2 or 3 digits. Given a uniform distribution of numbers, 3 digits means we would find 1000 entries in 1 million rows, or 10000 entries in 10 million rows. 4 or 5 digits seems more reasonable as a minimum to search for. Note also that MSISDNs are at least 7 digits long, as detailed in this article.
  2. Available types to store integers in SQL Server are 8bit (0 to 255), 16bit ( -32,768 to 32,767) and 32bit (-2,147,483,648 to 2,147,483,647). Details here
  3. The integer type must be able to store all the digits, otherwise the exact matching wont work. 8 bit integers with its 2.5 digits are not enough. A 16bit number can do 4.5 digits, and 4 digits should be enough. 32bit has more digits than we need.

Given the above, lets use a 16bit integer instead of char(3) for the trigrams intgrams. The code for this is more or less the same as for the trigrams, except the data type is a smallint instead of a char(3). Only the generate trigram intgram function have significant changes, the rest is mainly casts. Note also that there's some optimization left, it could likely be faster to get the intgram mathematically than by substring which is how its implemented now.

GenerateSmallintgrams implementation CREATE FUNCTION dbo.GenerateSmallintgrams...

    CREATE OR ALTER FUNCTION dbo.GenerateSmallintgrams (@m bigint)
    RETURNS table
    WITH SCHEMABINDING
    AS RETURN
    WITH n AS (
        SELECT n from (VALUES 
                    (1),(2),(3),(4),(5),(6),(7),(8),(9),
                    (10),(11),(12),(13),(14),(15),(16),(17)
                ) as v (n)
    ), g as (
        SELECT TOP (CASE WHEN @m > 999 THEN cast(log10(@m)+1 as int) - 3 ELSE 0 END)
        cast(SUBSTRING(cast(@m as varchar), n.n, 4) as smallint) as intgram
        FROM n 
        ORDER BY n.n
    ) 
    SELECT 
        DISTINCT g.intgram 
        FROM g
Create a table to store the intgrams and fill itCREATE TABLE dbo.SmallIntgram...

    -- Table to store the trigrams
    CREATE TABLE SmallIntgram
    (
        UserId uniqueidentifier NOT NULL,
        Intgram smallint NOT NULL
    );

    -- Generate trigrams
    INSERT SmallIntgram WITH (TABLOCKX)
        (UserId, Intgram)
    SELECT
        u.UserId,
        GT.Intgram
    FROM u
    CROSS APPLY dbo.GenerateSmallintgrams(u.msisdnint) AS GT;

    -- clustered "covering" index in intgram, userid
    CREATE UNIQUE CLUSTERED INDEX [CUQ_SmallIntgram_Intgram_UserId] 
      ON dbo.SmallIntgram (Intgram, UserId)
A materialized view to know the count of each intgram CREATE VIEW dbo.SmallIntgramCounts...

    -- view of unique smallintgrams counted
    CREATE VIEW dbo.SmallIntgramCounts
    WITH SCHEMABINDING
    AS
    SELECT SI.Intgram, cnt = COUNT_BIG(*)
    FROM dbo.SmallIntgram AS SI
    GROUP BY SI.Intgram;
    GO
    
    -- Materialize the view
    CREATE UNIQUE CLUSTERED INDEX CUQ_SmallIntgramCounts_Intgram 
    ON SmallIntgramCounts (Intgram);
A function to get the "best" intgrams from the view CREATE FUNCTION dbo.GetBestSmallIntgrams...

    CREATE OR ALTER FUNCTION dbo.GetBestSmallIntgrams (@msisdn bigint)
    RETURNS table
    WITH SCHEMABINDING AS
    RETURN
        SELECT
            -- Pivot
            intgram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.intgram END),
            intgram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.intgram END),
            intgram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.intgram END)
        FROM 
        (
            -- Generate intgrams for the search string
            -- and choose the most selective three
            SELECT TOP (3)
                rn = ROW_NUMBER() OVER (ORDER BY ETC.cnt ASC),
                GT.intgram
            FROM dbo.GenerateSmallIntgrams(@msisdn) AS GT
            JOIN dbo.SmallIntgramCounts AS ETC
                WITH (NOEXPAND)
                ON ETC.intgram = GT.intgram
            ORDER BY
                ETC.cnt ASC
        ) AS BT;
A function to get UserIds matching a set of intgrams CREATE FUNCTION dbo.GetSmallIntgramMatchIDs...

    CREATE OR ALTER FUNCTION dbo.GetSmallIntgramMatchIDs
    (
        @intgram1 smallint,
        @intgram2 smallint,
        @intgram3 smallint
    )
    RETURNS @IDs table (UserId UNIQUEIDENTIFIER PRIMARY KEY)
    WITH SCHEMABINDING AS
    BEGIN
        IF  @intgram1 IS NOT NULL
        BEGIN
            IF @intgram2 IS NOT NULL
            BEGIN
                IF @intgram3 IS NOT NULL
                BEGIN
                    -- 3 intgrams available
                    INSERT @IDs (UserId)
                    SELECT ET1.UserId
                    FROM dbo.Smallintgram AS ET1 
                    WHERE ET1.intgram = @intgram1
                    INTERSECT
                    SELECT ET2.UserId
                    FROM dbo.Smallintgram AS ET2
                    WHERE ET2.intgram = @intgram2
                    INTERSECT
                    SELECT ET3.UserId
                    FROM dbo.Smallintgram AS ET3
                    WHERE ET3.intgram = @intgram3
                    OPTION (MERGE JOIN);
                END;
                ELSE
                BEGIN
                    -- 2 intgrams available
                    INSERT @IDs (UserId)
                    SELECT ET1.UserId
                    FROM dbo.Smallintgram AS ET1 
                    WHERE ET1.intgram = @intgram1
                    INTERSECT
                    SELECT ET2.UserId
                    FROM dbo.Smallintgram AS ET2
                    WHERE ET2.intgram = @intgram2
                    OPTION (MERGE JOIN);
                END;
            END;
            ELSE
            BEGIN
                -- 1 intgram available
                INSERT @IDs (UserId)
                SELECT ET1.UserId
                FROM dbo.Smallintgram AS ET1 
                WHERE ET1.intgram = @intgram1;
            END;
        END;

        RETURN;
    END;
And a function to do the full search CREATE FUNCTION dbo.SmallIntgramSearch...

    
    CREATE OR ALTER FUNCTION dbo.SmallIntgramSearch
    (
        @search bigint
    )
    RETURNS table
    WITH SCHEMABINDING
    AS
    RETURN
        SELECT
            Result.UserId
        FROM dbo.GetBestSmallIntgrams(@Search) AS GBT
        CROSS APPLY
        (
            -- Intgram search
            SELECT
                U.UserId,
                U.msisdn
            FROM dbo.GetSmallIntgramMatchIDs
                (GBT.intgram1, GBT.intgram2, GBT.intgram3) AS MID
            JOIN dbo.U
                ON U.UserId = MID.UserId
            WHERE
                -- At least one intgram found 
                GBT.intgram1 IS NOT NULL
                AND U.msisdnvarchar LIKE '%' + cast(@Search as varchar) + '%'
    
            UNION ALL
    
            -- Non-intgram search
            SELECT
                U.UserId,
                U.msisdn
            FROM dbo.U
            WHERE
                -- No trigram found 
                GBT.intgram1 IS NULL
                AND U.msisdnvarchar LIKE '%' + cast(@Search as varchar) + '%'
        ) AS Result;

And just like with the trigrams, the original table needs to be joined in with a Loop join hint to ensure speedy execution when we run the full query for benchmark:


select top 50 U.* 
  from SmallIntgramSearch(@search) as TS 
  inner loop JOIN U on U.UserId = TS.UserId
  order by created desc;

This is the final test, and the full results are listed in the table below, with the winners in bold:

Test Total time Avg CPU (s) Avg Actual (s) Logical Reads
Baseline 01:05.47 3.7452 0.3224 61644
Varchar Column 00:18.75 0.9894 0.0884 61644
Bigint Column 00:18.94 1.0060 0.0895 61644
Baseline Indexed 05:58.25 1.7805 1.7865 11828
Varchar Indexed 02:17.17 0.6545 0.6810 9041
Bigint Indexed 00:21.27 1.1159 0.1009 61644
Trigram 00:04.50 0.0119 0.0176 1623
Trigram indexed 00:03.57 0.0103 0.0135 1096
Intgram 00:03.40 0.0096 0.0123 2919
Intgram indexed 00:03.40 0.0072 0.0119 1598

Result unindexed: 26x faster
Result indexed: 27x faster

A nice speedup compared to baseline! If the 1.1x faster time is worth it compared to the more generic trigram is of course a tricky question. In most cases I would sadly think not. At the same time I suspect the Intgram method will scale better with higher rowcount, but I have not confirmed this. Another important consideration is the resource usage, looking at CPU intgram is 1.4x more efficient copmared with trigram.

Additional analysis

Given the small difference between Trigram and IntGram, and the modest improvement by adding indexes it can be good to also look at the sizes of the additional tables and indexes. Disk is almost always cheap, but table/index size will also affect how much can fit in RAM. This can be hard to show in a benchmark, as normally there's no other queries on the same or other tables running concurrently competing for the almost always limited RAM memory.

Table/index sizes

Table/index Page count Size (MB)
SmallIntgram 48613 380
Trigram 56386 440
IX_U_Msisdn 10809 84
IX_U_MsisdnInt 6729 53
IX_U_Msisdnvarchar 8316 65
PK_UId 60830 475

Approximate column sizes

Column Msisdn (nvarchar) MsisdnVarchar (varchar) MsisdnInt (bigint)
Size 61MB 42MB 29MB

Here we can see that the Trigram table is using more space than the Intgram table. This is expected, and I actually thought the difference would be larger given that a smallint should be significantly smaller than an char(3). But it turns out there's too much overhad and the saving is mostly because the Trigram table has 16m rows, while the Intgram table "only" has 14m rows. The space saving because of data type is only a couple of percent. We can also see that there's a significant difference in how much storage the nvarchar, varchar and bigint indexes uses. As expected the index storage matches well with approximate column sizes as shown in the table above.

The End

And that was that. In the end, at work we decided to be happy with the varchar conversion, and revisit this problem in the future. Maybe product requirements have changed by then, i.e. by requiring a more comprehensive search or changing usage patterns. As is often the case, at work there's more to consider than the pure technical challenge. But I still think its pretty neat with a 27x improvement just from some T-SQL and an index on a query that is supposedly not possible to index!

Victor 2024-08-26