Simplified scenario:
I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.
It took more than 7hrs to update the whole thing in one go. (I don't know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET
and LIMIT
clauses. So far with no luck.
Wrapped up in a procedure, the code looks like below:
DECLARE
offset_number integer := 0;
batch_size integer := 1000;
BEGIN
LOOP
UPDATE TableA temp1
SET TableA_column_value_to_be_updated = (
SELECT
tableB_column_value
FROM
TableB temp2
WHERE
temp2.id = temp1.id
AND some_other_conditions_in_TableB
)
WHERE
some_other_conditions_in_Table
OFFSET offset_number
LIMIT batch_size ;
COMMIT;
offset_number := offset_number + batch_size;
EXIT WHEN NOT FOUND;
END LOOP;
END;
The engine reports an error with exception:
org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"
I have no idea what it is. Notably, it seems to work without OFFSET
and LIMIT
.
Any ideas why this would happen? Should I use other loop statement?
Simplified scenario:
I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.
It took more than 7hrs to update the whole thing in one go. (I don't know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET
and LIMIT
clauses. So far with no luck.
Wrapped up in a procedure, the code looks like below:
DECLARE
offset_number integer := 0;
batch_size integer := 1000;
BEGIN
LOOP
UPDATE TableA temp1
SET TableA_column_value_to_be_updated = (
SELECT
tableB_column_value
FROM
TableB temp2
WHERE
temp2.id = temp1.id
AND some_other_conditions_in_TableB
)
WHERE
some_other_conditions_in_Table
OFFSET offset_number
LIMIT batch_size ;
COMMIT;
offset_number := offset_number + batch_size;
EXIT WHEN NOT FOUND;
END LOOP;
END;
The engine reports an error with exception:
org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"
I have no idea what it is. Notably, it seems to work without OFFSET
and LIMIT
.
Any ideas why this would happen? Should I use other loop statement?
LIMIT
and OFFSET
are not in the syntax of an SQL UPDATE
statement. You need SELECT
for that.
Also, OFFSET
scales poorly to "paginate" a big table. Remember the upper bound from the last iteration instead.
Something like this could work:
CREATE OR REPLACE PROCEDURE upd_in_batches(_batch_size int = 1000)
LANGUAGE plpgsql AS
$proc$
DECLARE
_id_bound int = 0; -- or whatever?
BEGIN
LOOP
WITH sel AS (
SELECT a.id -- id = PK!
FROM tablea a
WHERE a.id > _id_bound
-- AND <some other conditions in Table A>
ORDER BY a.id
LIMIT _batch_size
FOR UPDATE
)
, upd AS (
UPDATE tablea a
SET target_col = b.b_source_col
FROM sel s
JOIN tableb b USING (id)
WHERE a.id = s.id
AND a.target_col IS DISTINCT FROM b.b_source_col
)
SELECT max(id) -- always returns a row
FROM sel
INTO _id_bound;
IF _id_bound IS NULL THEN
EXIT; -- no more rows found; we're done, exit loop
ELSE
COMMIT;
END IF;
END LOOP;
END
$proc$
Use a SELECT
statement instead to apply your LIMIT
. To avoid race conditions with concurrent writes, throw in a locking clause (FOR UPDATE
). You may or may not need that.
You might be able to iUPDATE
directly and just increment lower & upper bound for the filter on id
instead, which is cheaper. Depends on the details of your setup and requirements. Each has its caveats.
See:
The issue here is in using OFFSET and LIMIT within the UPDATE statement. Most SQL dialects (like PostgreSQL) do not support using OFFSET and LIMIT in an UPDATE query as they are usually used in SELECT statements.
Here’s an alternative solution:
DECLARE
offset_number INTEGER := 0;
batch_size INTEGER := 1000;
BEGIN
LOOP
-- Update rows in batches using a subquery to limit the rows processed
WITH cte AS (
SELECT temp1.id
FROM TableA temp1
WHERE some_other_conditions_in_Table
ORDER BY temp1.id
OFFSET offset_number
LIMIT batch_size
)
UPDATE TableA temp1
SET TableA_column_value_to_be_updated = (
SELECT tableB_column_value
FROM TableB temp2
WHERE temp2.id = temp1.id AND some_other_conditions_in_TableB
)
WHERE temp1.id IN (SELECT id FROM cte);
-- Commit after each batch
COMMIT;
-- Exit loop if no more rows are returned
EXIT WHEN NOT FOUND;
-- Increment the offset for the next batch
offset_number := offset_number + batch_size;
END LOOP;
END;
update
statement does not takelimit
oroffset
clauses! – Ture Pålsson Commented Jan 2 at 6:24