Phased release of a global risk system - gotchas!

A credit risk system, which provides limit exposure management of counterparties, is having various performance issues. This system was designed as a global system which will be implemented worldwide, however at the moment is only used to deal with London trading activities.

As expected, tables and indexes’ designs are driven by location, say LOCATION_ID.

Unfortunately because at the moment there is only 1 location using this, this is causing a lot of issues with oracle optimizer.

Table/index statistics were generated with the following option:METHOD_OPT => ‘FOR ALL INDEXED COLUMNS SIZE AUTO’. Because of this, Oracle favours the use of full table scans, rather than the nicely designated indexes. The result is disastrous.

As an outsider, with no intimate knowledge of the running of the system, there are 2 tactical solutions which I can offer here:

1) simplify the table/index statistics generation method. Do not generate histogram, in this case by default Oracle will create 2 buckets for most columns, and this will result in the designated indexes to be used in queries.

2) recrate indexes with LOCATION_ID as leading columns. Pick other columns that are much more suitable as leading columns, and if needed LOCATION_ID can be kept as third/fourth columns.

That’s it.

Friday, June 19th, 2009 at 12:28

How to handle audit trail for time series data

We are dealing with curve data over a long range time series, where this could be market data or risk factors or scenario data, or anything of similar nature.

 

From time to time, data will be updated, and there is a requirement to keep an audit trail for this.

 

The original plan was to keep a usual shadow audit table to record all the changes that have been made to a particular point in the curve at a particular date in the time series, and simply keep the latest version in the main table.

 

The weakness of this solution is that it doesn’t allow easy reconstruction of the curve for any date in the past.

 

Let me illustrate as follows:

 

Observation dates – are dates within the time series. In this example we are dealing with 01-Apr-2009 – 10 Apr 2009 data

 

Run dates – are dates where data is fed into the system. This can be a new insert, or an update (which will trigger creation of new version)

 

Steps:

  1. On day 1 (1 May 2009), new data was  loaded to the system (date 1-10 Apr, version 1 for everything)
  2. On day 2 (10 May 2009), new updates came  in, affecting 5-10 Apr data, therefore the new data should now be stored as version 2
  3. On day 3 (20 May 2009), more updates came  in, affecting 7-10 Apr data, therefore the new data should be stored as version 3.

 

RISK FACTOR INSTANCE ID

RUN DATE

OBSERVATION DATE

BUCKET ID

VALUE

VERSION

100

01-May-09

01-Apr-09

1

1.082798469

1

100

01-May-09

02-Apr-09

1

0.6882592

1

100

01-May-09

03-Apr-09

1

0.33327391

1

100

01-May-09

04-Apr-09

1

1.229654916

1

100

01-May-09

05-Apr-09

1

0.100964171

1

100

01-May-09

06-Apr-09

1

1.31731591

1

100

01-May-09

07-Apr-09

1

-0.13868598

1

100

01-May-09

08-Apr-09

1

0.470941035

1

100

01-May-09

09-Apr-09

1

0.243330874

1

100

01-May-09

10-Apr-09

1

2.045916747

1

100

10-May-09

05-Apr-09

1

0.167072293

2

100

10-May-09

06-Apr-09

1

0.832312129

2

100

10-May-09

07-Apr-09

1

0.890681594

2

100

10-May-09

08-Apr-09

1

1.91338225

2

100

10-May-09

09-Apr-09

1

0.764316071

2

100

10-May-09

10-Apr-09

1

2.51100637

2

100

20-May-09

07-Apr-09

1

0.537491042

3

100

20-May-09

08-Apr-09

1

0.91346192

3

100

20-May-09

09-Apr-09

1

0.91736769

3

100

20-May-09

10-Apr-09

1

0.27792944

3

 

Method 1 – using audit table to store the changes

 

We will have

 

Main table

RISK FACTOR INSTANCE ID

RUN DATE

OBSERVATION DATE

BUCKET ID

VALUE

VERSION

100

01-May-09

01-Apr-09

1

1.082798469

1

100

01-May-09

02-Apr-09

1

0.6882592

1

100

01-May-09

03-Apr-09

1

0.33327391

1

100

01-May-09

04-Apr-09

1

1.229654916

1

100

10-May-09

05-Apr-09

1

0.167072293

2

100

10-May-09

06-Apr-09

1

0.832312129

2

100

20-May-09

07-Apr-09

1

0.537491042

3

100

20-May-09

08-Apr-09

1

0.91346192

3

100

20-May-09

09-Apr-09

1

0.91736769

3

100

20-May-09

10-Apr-09

1

0.27792944

3

 

Audit table

RISK FACTOR INSTANCE ID

RUN DATE

OBSERVATION DATE

BUCKET ID

VALUE

VERSION

UPDATE  DATE

100

01-May-09

05-Apr-09

1

0.100964171

1

10-May-09

100

01-May-09

06-Apr-09

1

1.31731591

1

10-May-09

100

01-May-09

07-Apr-09

1

0.13868598

1

10-May-09

100

01-May-09

08-Apr-09

1

0.470941035

1

10-May-09

100

01-May-09

09-Apr-09

1

0.243330874

1

10-May-09

100

01-May-09

10-Apr-09

1

2.045916747

1

10-May-09

100

10-May-09

07-Apr-09

1

0.890681594

2

20-May-09

100

10-May-09

08-Apr-09

1

1.91338225

2

20-May-09

100

10-May-09

09-Apr-09

1

0.764316071

2

20-May-09

100

10-May-09

10-Apr-09

1

2.51100637

2

20-May-09

 

Now, a user will want to then run a report “show me my data as of 15 May 2009”.

Not easy, right?

 

Method 2 – version handling without audit table

 

In this method, I will store all versions in the same table.

 

To reconstruct data as of 15 May 2009, this is the data that we are after

 

RISK FACTOR INSTANCE ID

RUN DATE

OBSERVATION DATE

BUCKET ID

VALUE

VERSION

100

01-May-09

01-Apr-09

1

1.082798469

1

100

01-May-09

02-Apr-09

1

0.6882592

1

100

01-May-09

03-Apr-09

1

0.33327391

1

100

01-May-09

04-Apr-09

1

1.229654916

1

100

10-May-09

05-Apr-09

1

0.167072293

2

100

10-May-09

06-Apr-09

1

0.832312129

2

100

10-May-09

07-Apr-09

1

0.890681594

2

100

10-May-09

08-Apr-09

1

1.91338225

2

100

10-May-09

09-Apr-09

1

0.764316071

2

100

10-May-09

10-Apr-09

1

2.51100637

2

 

To reconstruct data as of 25 May 2009, this is the data that we are after

 

RISK FACTOR INSTANCE ID

RUN DATE

OBSERVATION DATE

BUCKET ID

VALUE

VERSION

100

01-May-09

01-Apr-09

1

1.082798469

1

100

01-May-09

02-Apr-09

1

0.6882592

1

100

01-May-09

03-Apr-09

1

0.33327391

1

100

01-May-09

04-Apr-09

1

1.229654916

1

100

10-May-09

05-Apr-09

1

0.167072293

2

100

10-May-09

06-Apr-09

1

0.832312129

2

100

20-May-09

07-Apr-09

1

0.537491042

3

100

20-May-09

08-Apr-09

1

0.91346192

3

100

20-May-09

09-Apr-09

1

0.91736769

3

100

20-May-09

10-Apr-09

1

0.27792944

3

 

Next, how do we reconstruct the risk factor data as of any point in time in the past?

This is easily done using oracle analytics (RANK).

 

select *

from (

select observation_date, bucket_id, value,

version, 

rank() over (partition by observation_date, bucket_id order by version desc) as rank from mydata where run_date <= <date of required reconstruction>)

where rank =1

 

The same model will obviously support multiple updates within a day, simply make sure we store the timestamp as well.

 

Obviously this carries some overhead, so some housekeeping criteria will have to be defined and implemented to ensure we do not indefinitely keep old versions of data.

 

 

Wednesday, June 3rd, 2009 at 13:11

Defragmenting a very large partition table

I am working on a market risk data warehouse, where we needed to rationalize the space used by a very large table.

 

The characteristics of the table were as follows:

  • Total number of rows 5,200,000,000 (5.2 billion rows)
  • Total table size 700GB
  • Data was stored in an ASSM tablespace with uniform extents of 4MB
  • Data was split into 1000 partitions (by source system), and each partition was further split to around 60 subpartitions.
  • Data distribution across sub-partitions and across partitions are also not even, causing a lot wastage taken by sub-partitions which do not hold a lot of data.

 

Having reviewed the system, each partition is normally read as a whole, and for the largest partitions, parallel query/parallel DML were used using degree of parallelism of 8 only. Therefore, there was really no good reason to maintain 60 sub-partitions within a single partition. So it was decided that we will reduce it from 60 to 8.

 

Initially we tried performing the work in-situ, by using partition merge. We tested this against the largest partition with 47 million rows, and it took 45 minutes to complete. Considering we have 1000 partitions to do, even if we run this in parallel, it would be impossible to do this work over a weekend.

 

Next we tried another strategy:

  • Create a new table with the same structure
  • Create all the partitions/sub-partitions using the new template
  • Set the table to be nologging
  • Perform insert select with APPEND and PARALLEL hints for both read and write.
  • Run 12 streams

 

We were getting 4 billion rows throughput per hour, meaning that the work was done well within 1.5 hours.

Size of final table was 450GB, a saving of 250GB.

 

This means we will have all the time in the world to add any indexes if needed, plus to regenerate the statistics on this new table.

 

I knew Oracle was not slow, but this is impressive indeed.

 

Thursday, May 28th, 2009 at 15:49

Compromising between Java and Oracle to load LOB/CLOB/XML

Problem:

We have the challenge to load feed files, each with approximately 100k rows of trade data with XML which sizes vary between 30k to 300k, into an Oracle database, across the network, as quickly as possible.

 

Trade data is processed in a grid using a Java application, which in turn produces these XMLs to be inserted into the database. The Java application produces row by row insert statements against the database, and in effort to increase throughput, multiple streams of inserts will be run.  It was reported that performance was unsatisfactory, and that the assumption that the bottleneck is the database.

 

 Constraints:

The use of external tables, sqlldr are not an option at this point, as it requires large changes to the java application which currently produces the required data.

 

Analysis:

Those who are familiar with Oracle will question the report that Oracle is struggling to load a mere 100k rows of data. On closer analysis, it is clear that the real bottleneck is network. 100k rows of data equates to about 7GB worth of data.


1)       Tried using a new listener with 32k packet sizing, as opposed to the default 2k. I expected that this would reduce the amount of return trips that have to be made to transfer al the data. Good idea, but turned out that it hardly made any difference in throughput 

2)       We partitioned the target table to 20 partitions (hash), in the hope to load  balance the concurrent loader to hit the different partitions. No improvement, and the main bottleneck is still network

3)       OK, hardcore DBAs might frown on this one, but I think this is a very neat and interesting approach. We zipped the XML at source (at the java application side) to reduce the amount of data have to be shifted across the network. We later unzipped the XML with a simple trigger on the table, prior to writing it to the target column (yes, this works row by row still). This eliminates the network bottleneck, which is a very good thing, however performance is still not good enough

4)       Altered the table and modify the lob column to use the CACHE option. This improved performance considerably, but now we were seeing a lot of “log buffer space” wait event. This was confusing me, because I had set the table to be NOLOGGING. After much digging, turned out that CACHE does not go hand in hand with NOLOGGING. If you cache, logging will be turned on automatically. The fact is, nocache nologging is much much slower than cache logging. So we will just have to deal with the “log buffer space” wait.

5)       Increased the “log_buffer” parameter from 2MB to 8MB, as well as move the redo logs to the faster device. This reduced the log buffer space wait event, but introduced new wait “buffer busy waits”. The explanation for this is relatively simple, the concurrent inserts often try to write to the same data block. There are different options to deal with this. One is to use a tablespace with smaller block size to reduce the number of rows that can be fit into a single block. This will hopefully reduce the buffer busy waits, but not necessarily eliminate it. Or we can deal with this from the application side. We choose to do this instead.

6)       We modified  the java code which control the job distribution across streams. A single stream is dynamically assigned to a unique partition. Therefore when the 40 streams are running, each one is guaranteed to be working against different blocks, therefore eliminating buffer busy waits completely.

 

And the results? 7GB worth of XML loaded within 5.5 minutes. Not bad.

 

Thursday, May 7th, 2009 at 15:40

Generalist or Specialist

Well, I have reached a point in my career (and my life) where I need to have a re-think. I have been working as a data architect covering various asset classes in the financial industries (read: a generalist), gaining respect and reconigition from senior managements and colleagues alike, something that I have found extremely enjoyable and rewarding. 

The irony is, being a generalist, I am supposed to be able to maintain the the big picture, to be able to do pattern matching, to abstract similarities between different business areas, however I feel that in reality, it is not the case. While I can be recognized to be very good, I want to be excellent, I want to be extraordinary. Something is missing. Generalists do not drive innovation and long-term results, specialists do. As an individual looking for career longetivity and personal satisfaction, I am terrified of the prospect of being complacent with being a ‘talented generalist’, knowing a bit of a lot of things, and experts at nothing. The grass is always greener on the other side, they say. Perhaps it’s true, but I am giving it a go.

So! I have decided to hang up my generalist uniform for now, and take a plunge to move to the risk management area within one of the largest banks in the world. I have set myself a 2 year goal, contribute, bring fresh perspective and imagination to the area, while setting a personal target to be a specialist in Risk technologies. With everything that is going with the current economic climate, I can’t think of a more exciting place to be than the area of Financial Risk Management.

In order to become a specialist in Risk-IT, I would need the gain the general knowledge and understanding of risk management  So will that make me a specialist in one hand, and a generalist in others? Doesn’t that beat the purpose of my career change anyway? Am I a generalist becoming a specialist, or a specialist becoming a generalist? I don’t know, and I don’t care. One thing is for sure, I can’t wait to start.

Thursday, April 16th, 2009 at 21:11

Finance - GL related systems - fundamentals

I was asked to look at a finance system, which primarily is a reconciliation system between front office and back office sub ledger systems. The amount of records they have to deal with is about 2.5 million rows/day. Sizable, but not all that big. However, the system was so slow, it was unusable.

I can simply discount lack of hardware resource as an issue, as they are running on a relatively new and powerful server.

It turned out that the system was designed on Microsoft Access, and simply ported to Oracle 10g (version 10.2.0.4). Developers have no real understanding of basic oracle fundamentals. It astonished me that this happened in a large financial institution that is not short of oracle expertise.

To cut the story short, the solutions are simple:

  • Partition the large balance tables by date. This is almost a gospel in finance-IT. You will read your data by business date, therefore partition your data by business date. Simple.
  • Do not use global indexes for your partition tables. Really, you should be able to survive with local indexes only, and this will save you a lot of time that otherwise you would need to spend maintaining the global indexes.
  • Do not use DEFAULT partition. I would rather have my system fail when it tries to write to a non-existing partition, rather than have this hidden from me, and a few days later, I end up having to deal with performance issue because my DEFAULT partition gets so big
  • Do not SPLIT partition. ADD. It’s faster.
  • Create yourself a simple but robust partition maintenance package, which will perform partition add, drop, truncate. This package can then be granted to operator accounts.
  • Enhance the package so it also caters for statistics maintenance.
  • If you rely on a lot of staging tables, make sure that statistics for these tables are not generated when the tables are empty. Generate statistics when these tables are populated. If necessary, simply do this once, and then lock the stats.
  • If your rely on a lot of inserts and updates in your code, spend some time learning about MERGE command. It is a god send for finance-IT systems.

When designing your system, make it self maintaining. Your batch should include these steps

  • roll business date
  • add all required partition for the new business date (do not take the lazy way of pre-creating 100 partitions ahead of time)
  • after populating a partition, do your stats generation there and then

OK, rant over.

Saturday, January 24th, 2009 at 15:23

Market Data Store - Data Modelling

I am currently drafting a data model for market data storage, to be used in a scenario engine project within the risk management area. This new engine will be replacing an existing system, which is deemed to be quite out of date, and does not cater for all the risk factors required by the business.

I have started reviewing the existing system to learn about a thing or two about the data model used. One thing I quickly noticed is that it has separate group of tables for each risk factor. Prior to this, I had a plan to create a centralized generic table to store all type of risk factors.

So this rings a bell. Why was it done that way in the existing system? Is there a good reason why each risk factor has its own set of tables? Is it because they are so different from each other, that no common attributes can be found across the various risk factors? Or it is simply a bad existing design?

I have decided to keep my original idea to have a generic table for all type of risk factors for now. As this will give the system flexibility to add new risk factors easily in the future. However, I need to start doing some homework to understand the storage requirement better.

So what are the facts that I know so far

FACTS

  • 100 risk factors
  • between 30-50 bucket points within a single curve
  • system will need to cater for up to 1000 scenarios
  • observation length is between 2-7 years
  • most risk factors will only need a single bucket (tenor) identifier , others will need two (tenor and strike) - question: can the two types live in a single table?
So the workflow will be as follows:
  • load market data from various sources
  • clean market data, fill in gaps, either by copying previous day’s market data, or in the case of missing time horizon (buckets), interpolate
  • cleaned market data is then disturbed/shocked in various ways to produce scenarios.

To be continued … honest

Saturday, January 24th, 2009 at 14:56

Aspiring to become more than just an Oracle DBA 2.0

It saddens me in a way to read the widely talked article by 2 oracle certified masters, titled “Performing an Oracle DBA 1.0 to DBA 2.0 Upgrade”, a nicely written article encouraging DBAs out there to equip themselves better in order to be able to maintain the quality needed to work as a good Oracle DBA in a modern environment.

 While it contains many good pointers about DBAs needing to expand their skillsets to include system and network administration, I think it lacks the ultimate need to encourage Oracle DBAs out there  to step out of their comfort zone and do one thing that most DBAs do not do, i.e. trying to understand the business background which drives the need to build  the application which uses the Oracle databases that we look after. 

I have seen so many examples where DBAs would spend a lot of effort performing various tracing against a system to solve a performance problem. And spend further time analyzing the output  to come up with an understanding of a system behavior.

 The same answer could have been obtained by simply politely asking the users or the application teams about the nature of the application or the usage. I have time and time again found that these information is more valuable and accurate compared to my trendy detective work using highly technical tracing.

The other thing is the contentment of many DBAs to simply be tasked to ensure the running of an Oracle database. While I am not disputing that this is a mighty important task, sometimes it puzzles me how an Oracle DBA can have so little interest in understanding why a particular that they look after is deemed so important to the business, why it needs to be up 24/7, why an mere extra 20 min in batch processing time causing that particular user in NY to call us in frustration.

 Humanizing DBA role will also help a DBA gain respect their users/developers better and vice versa.

 And from my personal view, it’s much more interesting to expand my view this way, rather than filling up my brain with more new oracle tips and tricks, fancy tracing commands, setting events, setting hidden parameters, basically all the things that Oracle DBAs have been forced to learn by Oracle, because they can’t make their product work as it should in the first place?

Friday, January 9th, 2009 at 22:52

The importance of raw, unfiltered collaboration

Challenges within an organisation, within reasons, should be made known to everyone within the organisation, without it being filtered to senior management’s perception of what the involved individuals need to know.

Instead of having just a group of select individuals owning a specific problem, companies should try to tap the skills and contributions from a much wider audience within the organisation. Break the exclusivity and the silos, and you should see increased productivity and quality.

Say, an issue arises, which needs immediate attention. Finance asset class needs to consolidate the old disperse legacy general ledger systems that they have in order to reduce cost of maintaining these disperse systems, as well as to have a consolidated, single truth view of their finances. Or credit crunch prompting the need for a new age of risk management, putting pressure on risk systems to be able to calculate risk in a near-time manner.

When issues arise within a department in a large organisation, it is common that each department would call on the same individuals time and time again to address the different problems. These individuals could be very experienced, highly respected in their fields, with broad knowledge. And they might even call on for external help, perhaps a trusted vendor which they have dealt with in the past, or a consultant.

When they need to engage expertise from other areas, it will be something like this.

 “We need to hire a new person with expertise implementing Oracle GL. Engage the DBA team, as we will need a new Oracle database. Engage the hardware team, as we need an application server running Java, a reporting server with WebSphere. Ask for support team to provide example of feeds from each of the old GL system that we are going to replace. Let’s engage the network team, as we will need a 10GB dedicated link between application and database server”

This is what I mean by filtered collaboration. It is following an old fashioned hierarchical line of authority. Information is delivered on “need to know” basis. Somehow, someone higher up the chain decides what piece of information is suitable for each time he/she is collaborating with.

During my various engagements as an Oracle consultant, I have often found it frustrating that information is being drip-fed, filtered, sometimes with genuine intent to save me from confusion; everything has been tailored as if I have no ability to understand anything else that is not related to a database world. In occasions this has led me to make decisions that I would not have made had I been given the bigger picture earlier. 

When companies want to tap to the brightest minds within their organisations, they need to provide them with the whole story, not tailored sub questions, which will only hold them back from contributing to their full potential.

 

Saturday, December 20th, 2008 at 22:46

A report which dynamically ignores all zero value columns

A risk reporting table contains exposure values by counterparty for the last 7 days.

 

The structure of the table is as follows:

 

COUNTERPARTY        VARCHAR2(50)

T                    NUMBER

T-1                  NUMBER

T-2                  NUMBER

T-3                  NUMBER

T-4                  NUMBER

T-5                  NUMBER

T-6                  NUMBER

 

Values for Saturdays and Sundays will always be zero.

T to T-6 will be a moving window, therefore we can not hardcode which days are Saturdays or Sundays.

 

The requirement is to load the data into another table, where the final product will be a CSV, but users do not want to include the Saturday/Sunday all zero columns.

 

How do we do this?

 

A quick way would be to write a code like this

 

 

 

 

 

declare
  v_t number;
  v_t1 number;
  v_t2 number;
  v_t3 number;
  v_t4 number;
  v_t5 number;
  v_t6 number;

  v_sql varchar2(2000);

begin
select sum(t), sum(t-1), sum(t-2), sum(t-3), sum(t-4), sum(t-5), sum(t-6)
into v_t, v_t1,v_t2,v_t3,v_t4,v_t5,v_t6
from table1 ;

v_sql := ‘insert into newtable as select ‘
if v_t <> 0 then append to v_sql
if v_t1 <> 0 then append to v_sql
if v_t2 <> 0 then append to v_sql
etc

v_sql := v_sql ||’ from table1 ;’

execute immediate v_sql;
commit;

end;

Thursday, December 18th, 2008 at 18:14