Specific Aggregation Market Risk System - data volume challenge - part 2
Following the initial overview, here is my further recommendation to the application team.
1) Change table structure
Partitioning needs to be change from partition by range only to either range-hash partitioning, or list-hash partitioning. I have picked hash as the sub partitioning method in this recommendation document. However if there is a better way to sub partition the data, that method could be chosen over hashing.
The main objective of the change is to allow parallel processing to be implemented, in this case allowing large amount of data to be distributed across multiple oracle processing to shorten processing time.
Optimal amount of sub partitions created for a single feed needs to be determined through testing. I have tested with 8 sub partitions with fairly satisfactory results. We can try to increase this to 12 or 16, however there is one thing to consider, i.e. how heavy processing on this application could potentially impact another application which resides in the same database.
Statistics Maintenance
This procedure needs to be modified to allow this to work with tables with subpartitions. For large feeds, statistics generation with ESTIMATE_PERCENT value of 1% is sufficient. For tables with sub partitions, DEGREE parameter can be specified to allow faster statistics generation.
In conjunction with the partitioning maintenance work, mainly to replace SPLIT partition with ADD partition, an additional logic is needed, i.e. if known large data is going to be loaded, then instead a standard single partition, sub partitions need to be created.
Partition maintenance
In conjunction with the partitioning maintenance work, mainly to replace SPLIT partition with ADD partition, an additional logic is needed, i.e. if known large data is going to be loaded, then instead a standard single partition, sub partitions need to be created.
AN ORA-26002 error was reported during testing against one of the staging table. I believe that this particular staging table does not need an index after subpartition is implemented. Removal of index will eliminate this error without compromising performance.
2) Data Loading via 3rd party ETL application
Most of sqlldr job is done via this 3rd party ETL application. Loading 1.2 million rows of trade data took almost 29 minutes (about 100 column wide). Currently direct path loading is already used. In conjunction with sub partitioning implementation, I would like to see the use of parallel=true argument so increase throughput further.
As a benchmark, as I can not reveal the table structure here, the csv file is about 700MB in size. This shows the amount of data which needs to be loaded.
First, I need to know how slow 29 minutes is.
So I set myself a baseline
-
created a new version of the table using range-hash partition (8 buckets)
-
perform an insert test into this table to see how fast this can go
insert /*+ append parallel (a,8) */ into staging_table a select * from source;
-
I can load 1.2 million rows in about 70 seconds. Not bad.
This exercise is useful, as this will give me a reasonable expectation of how far the sqlldr can be tuned.
Now, I think the SQLLDR can be tuned to run as fast as 5 minutes (from the original 29 minutes). Why 5 minutes? Because I am including ETL tool overhead, and network overhead.
So now I created my own CSV and prepare a controlfile to test SQLLDR from a client host, this way I am including the network overhead. The load took less than 2 minutes. Fantastic.
Now back to performing the load using ETL tool… Why oh why did it still take 15 minutes, even after we implement parallel=true? I ended up doing some tracing, and found out that one of the column is populated using a sequence (seq_name.nextval)… Bingo. This is the bottleneck.
So with this knowledge, I go back to the application team to discuss ways of populating this column another way…
To be continued …