I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:
Articles
(≈1 million rows for the last 3 months)ArticleHistory
(stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)The query needs to:
Currently, the query runs too slow due to the large dataset and complex joins.
So far I have this code that is still too slow:
SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId,
a.StatusId, a.UUID, a.ArticleDescription
-- i have more selected data here from other tables
FROM Articles AS a
-- i have left joins with other tables here
LEFT JOIN (
SELECT
sh2.ArticleId,
sh2.StatusId,
CASE
-- Special case: StatusId = 4 derived from 144, use MAX
WHEN sh2.StatusId = 4 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
) THEN MAX(sh2.CreatedAt)
-- Actual case: StatusId = 4, use MIN
WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)
-- Special case: StatusId = 10 derived from 1410, use MIN
WHEN sh2.StatusId = 10 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
) THEN MIN(sh2.CreatedAt)
-- Actual case: StatusId = 10, use MAX
WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)
-- Special case: StatusId = 14 derived from 114, use MIN
WHEN sh2.StatusId = 14 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
) THEN MIN(sh2.CreatedAt)
-- Actual case: StatusId = 14, use MAX
WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)
ELSE MAX(sh2.CreatedAt)
END AS CreatedAt
FROM ArticleHistory sh2
--One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
--and only some specific ones can be used for filtering
WHERE sh2.ChangeComment LIKE '%Status%'
AND sh2.StatusId = 3
AND cast(sh2.CreatedAt as date) >= '2025-01-01'
AND cast(sh2.CreatedAt as date) <= '2025-01-31'
GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId
WHERE 1 = 1
AND sh.StatusId = 3
AND cast(sh.CreatedAt as date) >= '2025-01-01'
AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY
See it and sample data here:
I've tried:
Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?
I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:
Articles
(≈1 million rows for the last 3 months)ArticleHistory
(stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)The query needs to:
Currently, the query runs too slow due to the large dataset and complex joins.
So far I have this code that is still too slow:
SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId,
a.StatusId, a.UUID, a.ArticleDescription
-- i have more selected data here from other tables
FROM Articles AS a
-- i have left joins with other tables here
LEFT JOIN (
SELECT
sh2.ArticleId,
sh2.StatusId,
CASE
-- Special case: StatusId = 4 derived from 144, use MAX
WHEN sh2.StatusId = 4 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
) THEN MAX(sh2.CreatedAt)
-- Actual case: StatusId = 4, use MIN
WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)
-- Special case: StatusId = 10 derived from 1410, use MIN
WHEN sh2.StatusId = 10 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
) THEN MIN(sh2.CreatedAt)
-- Actual case: StatusId = 10, use MAX
WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)
-- Special case: StatusId = 14 derived from 114, use MIN
WHEN sh2.StatusId = 14 AND EXISTS (
SELECT 1
FROM #TempStatusIds tsi
WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
) THEN MIN(sh2.CreatedAt)
-- Actual case: StatusId = 14, use MAX
WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)
ELSE MAX(sh2.CreatedAt)
END AS CreatedAt
FROM ArticleHistory sh2
--One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
--and only some specific ones can be used for filtering
WHERE sh2.ChangeComment LIKE '%Status%'
AND sh2.StatusId = 3
AND cast(sh2.CreatedAt as date) >= '2025-01-01'
AND cast(sh2.CreatedAt as date) <= '2025-01-31'
GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId
WHERE 1 = 1
AND sh.StatusId = 3
AND cast(sh.CreatedAt as date) >= '2025-01-01'
AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY
See it and sample data here:
https://dbfiddle.uk/wT0_5R3N
I've tried:
Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?
As others may have pointed out in the comments the regular expression:
WHERE sh2.ChangeComment LIKE '%Status%'
is not sargable since it looks through the entire string resulting in a table scan- so you wouldn't be able to properly utilize indexes. But when I looked at the example code in your sqlfiddle, all of the [ChangeComments] values in the [dbo].[ArticleHistory] table contained the word "Status" so it appears the regex was unnecessary. Examples: "Initial status assigned.", "Status changed to Processing.", "Status changed to Approved.", "Status changed International.", "Status changed to Completed.", "Status changed to Rejected.". If by chance this is not all of the potential status- then using a better regex is still better, ex. WHERE sh2.ChangeComment LIKE 'Status%' OR sh2.ChangeComment = 'Initial status assigned'
Getting rid of the regex, and assuming you didn't need the filter on ChangeComment after all, you could do a fully covering non-clustered index for the ArticleHistory table:
CREATE NONCLUSTERED INDEX ncx_ArticleHistory_StatusId_CreatedAt
ON [dbo].[ArticleHistory] (StatusId, CreatedAt)
INCLUDE (ArticleID)
I included ArticleID
since you selected it and grouped with it. The Execution Plan for the ArticleHistory table would change to an Index Seek instead of the original Clustered Index Scan (aka table scan).
You should have indexes on the id column.
Also replace
AND cast(sh.CreatedAt as date) >= '2025-01-01'
AND cast(sh.CreatedAt as date) <= '2025-01-31'
With
YEAR(sh.CreatedAt) = 2025 AND MONTH(sh.CreatedAt) = 1
You could also make an index on the year and month if still slow.
SELECT .. FROM (select ...) JOIN .. WHERE x = (select max(..) from ...)
It helps me keep straight what code is at what level. – Joel Coehoorn Commented Jan 30 at 16:36AND cast(sh.CreatedAt as date) >= '2025-01-01' AND cast(sh.CreatedAt as date) <= '2025-01-31'
You want to instead write it in a way that avoids mutating the stored data. This not only saves pre-row processing work, but helps make better use of indexes. – Joel Coehoorn Commented Jan 30 at 16:39<
) upper bound rather than inclusive (<=
):AND sh.CreatedAt >= '20250101' AND sh.CreatedAt < '20250201'
– Joel Coehoorn Commented Jan 30 at 16:54