sql server - Optimizing SQL query with large dataset and multiple filters - Stack Overflow

admin2025-04-18  2

I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:

  • Articles (≈1 million rows for the last 3 months)
  • ArticleHistory (stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)

The query needs to:

  • Fetch articles based on multiple filters (date range, status, user actions).
  • Select the latest or earliest status change date for specific statuses.
  • Join with multiple tables for additional metadata.
  • Use pagination (OFFSET ... FETCH NEXT).

Currently, the query runs too slow due to the large dataset and complex joins.

So far I have this code that is still too slow:

SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId, 
      a.StatusId, a.UUID,  a.ArticleDescription
      -- i have more selected data here from other tables
FROM Articles AS a 
  -- i have left joins with other tables here
LEFT JOIN (
    SELECT 
        sh2.ArticleId, 
        sh2.StatusId, 
        CASE
            -- Special case: StatusId = 4 derived from 144, use MAX
            WHEN sh2.StatusId = 4 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
                ) THEN MAX(sh2.CreatedAt)

            -- Actual case: StatusId = 4, use MIN
            WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)

            -- Special case: StatusId = 10 derived from 1410, use MIN
            WHEN sh2.StatusId = 10 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 10, use MAX
            WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)

            -- Special case: StatusId = 14 derived from 114, use MIN
            WHEN sh2.StatusId = 14 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 14, use MAX
            WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)

            ELSE MAX(sh2.CreatedAt)
        END AS CreatedAt
    FROM ArticleHistory sh2
                
    --One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
        --and only some specific ones can be used for filtering
    WHERE sh2.ChangeComment LIKE '%Status%'
        AND sh2.StatusId = 3
        AND cast(sh2.CreatedAt as date) >= '2025-01-01'
        AND cast(sh2.CreatedAt as date) <= '2025-01-31'

    GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId

WHERE 1 = 1
    AND sh.StatusId = 3
    AND cast(sh.CreatedAt as date) >= '2025-01-01'
    AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY

See it and sample data here:

I've tried:

  • Indexing key columns (StatusId, ArticleId, StatusDate).
  • Using WITH (NOLOCK) to minimize locking impact.
  • Filtering ArticleHistory inside a LEFT JOIN to select relevant statuses and date ranges.

Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?

I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:

  • Articles (≈1 million rows for the last 3 months)
  • ArticleHistory (stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)

The query needs to:

  • Fetch articles based on multiple filters (date range, status, user actions).
  • Select the latest or earliest status change date for specific statuses.
  • Join with multiple tables for additional metadata.
  • Use pagination (OFFSET ... FETCH NEXT).

Currently, the query runs too slow due to the large dataset and complex joins.

So far I have this code that is still too slow:

SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId, 
      a.StatusId, a.UUID,  a.ArticleDescription
      -- i have more selected data here from other tables
FROM Articles AS a 
  -- i have left joins with other tables here
LEFT JOIN (
    SELECT 
        sh2.ArticleId, 
        sh2.StatusId, 
        CASE
            -- Special case: StatusId = 4 derived from 144, use MAX
            WHEN sh2.StatusId = 4 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
                ) THEN MAX(sh2.CreatedAt)

            -- Actual case: StatusId = 4, use MIN
            WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)

            -- Special case: StatusId = 10 derived from 1410, use MIN
            WHEN sh2.StatusId = 10 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 10, use MAX
            WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)

            -- Special case: StatusId = 14 derived from 114, use MIN
            WHEN sh2.StatusId = 14 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 14, use MAX
            WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)

            ELSE MAX(sh2.CreatedAt)
        END AS CreatedAt
    FROM ArticleHistory sh2
                
    --One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
        --and only some specific ones can be used for filtering
    WHERE sh2.ChangeComment LIKE '%Status%'
        AND sh2.StatusId = 3
        AND cast(sh2.CreatedAt as date) >= '2025-01-01'
        AND cast(sh2.CreatedAt as date) <= '2025-01-31'

    GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId

WHERE 1 = 1
    AND sh.StatusId = 3
    AND cast(sh.CreatedAt as date) >= '2025-01-01'
    AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY

See it and sample data here:

https://dbfiddle.uk/wT0_5R3N

I've tried:

  • Indexing key columns (StatusId, ArticleId, StatusDate).
  • Using WITH (NOLOCK) to minimize locking impact.
  • Filtering ArticleHistory inside a LEFT JOIN to select relevant statuses and date ranges.

Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?

Share edited Feb 1 at 9:30 Mark Rotteveel 110k229 gold badges156 silver badges224 bronze badges asked Jan 30 at 15:44 E.SE.S 11 bronze badge 12
  • We need to see an actual SQL statement and a SMALL set of sample data. You can use dbfiddle.uk to help set this up if you want. Until we have these things, this question is not really answerable and likely to be put on hold. – Joel Coehoorn Commented Jan 30 at 15:47
  • Here’s a live SQL Fiddle where you can test it: dbfiddle.uk/jfMD3txJ – E.S Commented Jan 30 at 16:26
  • One thing I've figured out over the years separate from the question (so just a comment) is, when I have complicated nested queries like this, I will alternate keyword capitalization between nesting levels. So SELECT .. FROM (select ...) JOIN .. WHERE x = (select max(..) from ...) It helps me keep straight what code is at what level. – Joel Coehoorn Commented Jan 30 at 16:36
  • 1 Also, at first pass through, these conditional expressions are slowing you down: AND cast(sh.CreatedAt as date) >= '2025-01-01' AND cast(sh.CreatedAt as date) <= '2025-01-31' You want to instead write it in a way that avoids mutating the stored data. This not only saves pre-row processing work, but helps make better use of indexes. – Joel Coehoorn Commented Jan 30 at 16:39
  • 1 ... called away. To finish previous comment, corrected code looks like this, with a half-open range using an exclusive (<) upper bound rather than inclusive (<=): AND sh.CreatedAt >= '20250101' AND sh.CreatedAt < '20250201' – Joel Coehoorn Commented Jan 30 at 16:54
 |  Show 7 more comments

2 Answers 2

Reset to default 0

As others may have pointed out in the comments the regular expression:

WHERE sh2.ChangeComment LIKE '%Status%'

is not sargable since it looks through the entire string resulting in a table scan- so you wouldn't be able to properly utilize indexes. But when I looked at the example code in your sqlfiddle, all of the [ChangeComments] values in the [dbo].[ArticleHistory] table contained the word "Status" so it appears the regex was unnecessary. Examples: "Initial status assigned.", "Status changed to Processing.", "Status changed to Approved.", "Status changed International.", "Status changed to Completed.", "Status changed to Rejected.". If by chance this is not all of the potential status- then using a better regex is still better, ex. WHERE sh2.ChangeComment LIKE 'Status%' OR sh2.ChangeComment = 'Initial status assigned'

Getting rid of the regex, and assuming you didn't need the filter on ChangeComment after all, you could do a fully covering non-clustered index for the ArticleHistory table:

CREATE NONCLUSTERED INDEX ncx_ArticleHistory_StatusId_CreatedAt 
   ON [dbo].[ArticleHistory] (StatusId, CreatedAt)
    INCLUDE (ArticleID)

I included ArticleID since you selected it and grouped with it. The Execution Plan for the ArticleHistory table would change to an Index Seek instead of the original Clustered Index Scan (aka table scan).

You should have indexes on the id column.

Also replace

AND cast(sh.CreatedAt as date) >= '2025-01-01'
AND cast(sh.CreatedAt as date) <= '2025-01-31'

With

YEAR(sh.CreatedAt) = 2025 AND MONTH(sh.CreatedAt) = 1

You could also make an index on the year and month if still slow.

转载请注明原文地址:http://anycun.com/QandA/1744908779a89329.html