Specific Aggregation Market Risk System - data volume challenge - part 4
4) High volume - multi column enrichments
The staging tables consist of 100-150 columns. About 90 columns are populated during data loading, and the rest of the columns are updated using several phases of mapping. The mapping processes will consult various lookup tables. Mappings are done using stored procedures. Within each phase, various columns that can be enriched independent of one another, are updated sequentially.
The current code can be simplified as follows (please excuse the syntax error, as I am typing this straight from my head)
cursor cur_columns is
select column_name, column_value from lookup_tables where mapping_phase=5;
for i in cur_columns then
loop
v_sql := ‘update staging set ‘||i.column_name||’=’||i.column_value
||’ where data_feed_id=’||p_data_feed_id
||’ and blah ‘;
execute immediate v_sql;
commit;
end loop;
The final sql that is updated for each enriched column would look like:
update staging set industry=(inline select) where data_feed_id=12345 and exists (some subquery);
update staging set region=(inline select) where data_feed_id_12345 and exists (some subquery);
Now remember, the target table partition for this data feed
- has 1.2 million records
- is 150 column wide
- indexing all columns which will be used in subquery could be counter-productive, as there are so many different ones which are needed. Even if they are not counter-productive, data loading could be slowed down tremendously.
Now, what if …
- within the same mapping phase, we do all the updates as follows:
update staging
set column1 = (inline select),
column2 = (inline select),
column3 = (inline select),
…
column8 = (inline select)
where data_feed_id=12345;
- what if we just let Oracle scan the whole partition with the 1.2 million row, and let all the columns be updated in one sweep.
- I believe that the improvement factor could be as much as number_of_columns_to_update/2 or better.
- to improve it further, if we have implemented subpartitioning, perhaps parallel DML can be used here.
update /*+ parallel (staging,8) */ staging
set column1 = (inline select),
column2 = (inline select),
column3 = (inline select),
…
column8 = (inline select)
where data_feed_id=12345;
- another thing to consider is to have a target table, as it is much much quicker to do an insert select (to include enrichment work), rather than an update. I have tested this with the following results –updating 5 columns in 1 sweep for 1.2 million records too 16 minutes, doing an insert select as below took a mere 3 minutes.
insert into staging_2
select staging.*,
(inline select) as newcolumn1,
(inline select) as newcolumn2,
(inline select) as newcolumn3,
…
(inline select) as newcolumn8
from staging
where data_feed_id=12345;
To be continued ..