Specific Aggregation Market Risk System - data volume challenge - part 3
3) Which method? Parallelization in oracle or distributed computing?
In past experiences, I have seen many many cases, where bringing processing into Oracle brings impressive significant performance gain. Oracle is built to do bulk processing. Why not? At the same time, we eliminate network overhead by doing everything inside the database server.
I always have to resist the temptation to enforce this mantra onto the development team, which seem to favour parallelization of processing through distributed computing.
Which way is the right way to go? I suppose, for some cases, the answer is, who is your best people? If you have a very talented java developer in your team, and just an average Oracle developer, you are likely to end up with a better implementation of parallelization via java. On the other hand, if the position is reversed, you are likely to end up with a better implementation of parallel processing within Oracle.
However for some other extreme cases, neither would suffice. An Oracle database and code can be tuned to death, and it will still not be good enough. Or distributed computing on it’s own would not be good enough if the database structure does not support it.
Remember, we tested the full size feed of 1.2 million records, and processing took 16 hours.
A similar test of 600k records took 4.5 hours.
One of the developers in the project team has taken an initiative to test a 300k record, broke it down to 5 separate feeds (60k each), and without any changes to the code, the whole processing took 30 minutes. Great stuff, we are getting there.
Now let’s analyze this further.
- I would put the 16 hr processing of 1.2 million record and the 4.5 hr processing of 600k record into the same category. Both are unacceptable. The system has reached its breaking point, and performance degraded exponentially.
- The parallelization to 5 streams shows that 60k records/stream seem to be handled well by the system. The 30 min processing time shows that we are on the right track here, and that the network overhead of the distributed computing does not seem to matter much. If this test has not been conducted by the developer, I would definitely have insisted on the breakdown to be done inside the database, and which would require more development time. However, because this is showing promising result, and allows reuse of existing code.
- Option 1) can we stay 5 streams (using 5 application servers), and process 240k records per streams, and whether it could then scale to processing time of 2 hours
- Option 2) or can we stay with 5 streams (using 5 application servers), stay with 60k record per streams and implement pipeline method and whether it could still scale to processing time of 2 hours
- Option 3) or should be increase the number of streams (potentially could mean adding more hardware to the application server domain)
- When we are flooding the database, normally the same table with loading requests, updates, etc, can the structure hold? If not what changes are needed?
- What wiring issues will arise to the splitting of feed to multiple streams? At the moment each data feed will carry a unique data_feed_id which is currently set as the partition key to the main tables. How do we disperse the single feed to multiple streams then collate everything back to a single feed again?
To be continued …