Managing Holiday Dates

Recently, I came across a holiday table which contains a staggering 320 million rows. With indexes, the space used was a cringe-worthy 80GB.

So how do we improve on this?

The requirement of the table is easy, i.e. to list out holidays by countries including all weekends, for the next 50 years.

So to do this, we need 2 tables:
1) HD_COUNTRY_WEEKENDS
Create a table listing all required countries and their respective weekend days.
For example for GB, it s Saturday an Sunday, for UAE, it is Friday and Saturday

COUNTRY_CODE WEEKEND
GB SATURDAY
GB SUNDAY

So for 150 countries, we will store only 300 rows.

2) HD_COUNTRY_HOLIDAYS
Create a table listing countries and its holidays
This will amount to less than 10 rows per country per year.
So for a period of 50 years, for around 150 countries, we will have less than 150k rows in total (~5MB only)

For example for GB for 2011

COUNTRY_CODE HOLIDAY_DATE DAY_OF_WEEK
GB 22/04/2011 FRIDAY
GB 29/04/2011 FRIDAY
GB 03/01/2011 MONDAY
GB 26/12/2011 MONDAY
GB 25/04/2011 MONDAY
GB 30/05/2011 MONDAY
GB 29/08/2011 MONDAY
GB 02/05/2011 MONDAY

To get a complete listing of holidays for GB for the period of 2011 and 2061, simply run the following query (this ran in 94msec for me)

select country_code, holiday_date, day_of_week, ‘H’ as holiday_flag
from hd_country_holidays
where country_code=’GB’ and holiday_date between ‘01-JAN-2011′ and ‘31-DEC-2061′
union
select ‘GB’, weekend_date, to_char(weekend_date,’DAY’),’W’ as holiday_flag from (
select to_date(’01-JAN-2011′, ‘DD-MON-YYYY’) + level -1 weekend_date
from dual connect by level <= 22000)
where to_char(weekend_date,’DAY’) in (select weekend from hd_country_weekends where country_code=’GB’ ) and weekend_date between ‘01-JAN-2011′ and ‘31-DEC-2061′

This runs instantly, without the baggage of having to store so much data.

Sunday, July 3rd, 2011 at 21:26

Example of sqlldr control file

This example is for my own use, as I keep forgetting how to do the followings
- call a function using EXPRESSION
- combine fields
- ignore fields
- do date functions
- vary bindsize/readsize using OPTIONS
- constant varchar
- constant date
- use BOUNDFILLER

So here is one example

options (rows=5000,bindsize=10000000,readsize=10000000)
LOAD DATA
INFILE ‘timepoint.csv’
BADFILE ‘hilda.bad’
APPEND INTO TABLE HD_STG
FIELDS TERMINATED BY “|” ENCLOSED BY ‘”‘
(
ACTIVITY,
SIMULATED_BY,
COUNTERPARTY_ACRONYM,
PRODUCT_GROUP_NAME,
RISK_DATE “to_date(:risk_date,’YYYYMMDD’)”,
CURRENCY,
MTM,
FFR,
MISMATCH_RISK,
EXPOSURE,
filler_simulation_date BOUNDFILLER,
filler_simulation_time BOUNDFILLER,
filler_profile_sequence BOUNDFILLER,
ENTITY,
BATCH_REF,
filler_extract_date BOUNDFILLER,
filler_extract_time BOUNDFILLER,
PROFILE_ID,
BUSINESS_DATE EXPRESSION “TO_DATE(’20101122′,’YYYYMMDD’)”,
SOURCE_SYSTEM CONSTANT ‘ABC’,
SIMULATION_TIMESTAMP EXPRESSION “STA_UTIL_PKG.TO_TIMESTAMP_IN_GMT(:filler_si
mulation_date || :filler_simulation_time, ‘YYYYMMDDHH24:MI:SS’)”,
LAST_UPDATE_TIMESTAMP EXPRESSION “STA_UTIL_PKG.GET_SYSTIMESTAMP_IN_GMT”,
STATUS CONSTANT ‘LOADED’,
SITE_ID CONSTANT ‘LONDON’
)

Wednesday, January 5th, 2011 at 16:01

Handling large XML trade data

In financial databases, we often face difficulties with handling large amount of trade data, which include XML. In a lot of cases I have seen XML stored as either XMLTYPE or CLOB, whilst 99.9% of the XML parsing/processing is actually done outside the database, in Java.

Not only this use a large amount of disk space – in some particular system I have seen 2 million trades equates to around 300GB worth of XML data — this also means that everytime we transfer the data to the java side for processing, we have to transfer large amount of data across the network. Wasteful.

The alternative is to either move the processing to the database (which is not covered in the scope of this article), but to try a different approach, i.e. storing the XML data as compressed.

Compressed, XML data could shrink by as much as 90% or better, which means not only the disk usage is going to shrink considerably, but also the network traffic. And if we have multiple layer of applications all using this XML, the benefit will be multiplied considerably.

To do this, here is some examples of what is needed:

1) a java procedure to zip data

CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED “ZipXml” as package com.oracle;

import java.io.*;
import java.util.zip.*;
import java.sql.*;
import oracle.sql.*;

public class ZipXml
{
public static void zip(oracle.sql.CLOB src, oracle.sql.BLOB dst[]) throws Exception {
InputStream in = null;
OutputStream out = null;
try {
in = src.getAsciiStream();
out = new GZIPOutputStream(dst[0].getBinaryOutputStream());
byte[] buffer = new byte[1024];
for (int read = 0; (read = in.read(buffer)) > 0; ) {
out.write(buffer, 0, read);
}
out.flush();
} finally {
if (in != null) {
try { in.close(); } catch(IOException e) { e.printStackTrace(); }
}
if (out != null) {
try { out.close(); } catch(IOException e) { e.printStackTrace(); }
}
}
}
};
/

2) a java procedure to unzip data

CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED “UnzipXml” as package com.oracle;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.zip.GZIPInputStream;

public class UnzipXml
{
public static void unzipClob(oracle.sql.BLOB src, oracle.sql.CLOB dst[]) {
ByteArrayOutputStream bout = null;
InputStream in = null;
try {
dst[0].putString(1, “Pending Unzip”);
bout = new ByteArrayOutputStream ();
in = new GZIPInputStream(src.getBinaryStream());
byte[] buffer = new byte[1024];
for (int read = 0; (read = in.read(buffer)) > 0; ) {
bout.write(buffer, 0, read);
}
bout.flush();
String s = new String(bout.toByteArray());
dst[0].putString(1, s);
} catch (Exception e) {
try {  dst[0].putString(1, e.getMessage()); }
catch (Exception es) {}
e.printStackTrace();
} finally {
if (in != null) {
try { in.close(); } catch(IOException e) { e.printStackTrace(); }
}
}
}
public static void unzipBlob(oracle.sql.BLOB src, oracle.sql.BLOB dst[]) {
OutputStream out = null;
InputStream in = null;
try {
out = dst[0].getBinaryOutputStream();
in = new GZIPInputStream(src.getBinaryStream());
byte[] buffer = new byte[1024];
for (int read = 0; (read = in.read(buffer)) > 0; ) {
out.write(buffer, 0, read);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (in != null) {
try { in.close(); } catch(IOException e) { e.printStackTrace(); }
}
if (out != null) {
try { out.close(); } catch(IOException e) { e.printStackTrace(); }
}
}
}
};
/

3) a package to 2 simple functions, zip and unzip

CREATE OR REPLACE PACKAGE xml_zip_pkg AS
PROCEDURE ZIP(src IN CLOB, dst IN OUT BLOB);
PROCEDURE UNZIP(src IN BLOB, dst IN OUT CLOB);
FUNCTION ZIP(src IN CLOB) RETURN BLOB;
FUNCTION UNZIP (src IN BLOB) RETURN CLOB;
END;
/

CREATE OR REPLACE PACKAGE BODY xml_zip_pkg AS
PROCEDURE ZIP(src IN CLOB, dst IN OUT BLOB)
AS LANGUAGE JAVA
NAME ‘com.oracle.ZipXml.zip(oracle.sql.CLOB, oracle.sql.BLOB[])’;

PROCEDURE UNZIP(src IN BLOB, dst IN OUT CLOB)
AS LANGUAGE JAVA
NAME ‘com.oracle.UnzipXml.unzipClob(oracle.sql.BLOB, oracle.sql.CLOB[])’;

FUNCTION ZIP(src IN clob) RETURN blob
IS
lvResult blob;
BEGIN
IF src is not null THEN
DBMS_LOB.createtemporary(lvResult, true, DBMS_LOB.CALL);
xml_zip_pkg.ZIP(src, lvResult);
END IF;

RETURN lvResult;
END ZIP;

FUNCTION UNZIP(src IN blob) RETURN clob
IS
lvResult clob;
BEGIN
IF src is not null THEN
DBMS_LOB.createtemporary(lvResult, true, DBMS_LOB.CALL);
xml_zip_pkg.UNZIP(src, lvResult);
END IF;

RETURN lvResult;
END UNZIP;
END;
/

An example usage will be as followed:

create

table hd_trade as select * from trade where rownum < 10 ;

alter

table hd_trade add zip_trade_xml blob;

update

hd_trade set zip_trade_xml = xml_zip_pkg.zip(xmltype.getclobval(trade_xml));commit;

To perform manipulation on the fly

select

substr(extract(xmltype(xml_zip_pkg.unzip(zip_trade_xml)),‘//notional’),1) from hd_trade where rownum=1

select

dbms_lob.getlength(xmltype.getclobval(trade_xml)) original_length, dbms_lob.getlength(zip_trade_xml) zipped_length from hd_trade

Thursday, December 9th, 2010 at 13:38

Testing your housekeeping code

In many many projects I have been involved in, the importance of housekeeping is often overlooked, in the midst of the rush of delivering core functionalities for the business. Housekeeping is often resurrected only when we either run out of space, or performance started to degrade due to the unplanned large volume of data.

The one thing I would like to talk about today, is that housekeeping code is normally not well tested, due to the excuse that developers have run out of data to test. Sounds familiar? A developer developed a housekeeping code, then tested it against a particular environment. He ran it once, and it did 80% of the tasks correctly, so he needed to do a bit more tweaks. When he tried to rerun the code after, he realised, that his first run already deleted most of the data he could housekeep from his test environment. What now? Asking the DBA to refresh a full historical dataset would mean waiting for a day or two, and to handcraft dummy data manually would take too long. At the end, the half baked code is fast-tracked to production environment.

My solution to this problem is actually quite simple.

  • When creating a housekeeping package/procedure, always include a “run_mode” as an IN attribute, accepting the values of TEST run or LIVE run
  • during a TEST run, rather than performing actual truncates/deletes against target tables, simply write all the commands that will be called into a log table, so these can be reviewed again and again until the code is satisfactory
  • only during a LIVE run would the actual data be housekept for real.
Sunday, September 5th, 2010 at 13:19

Unusual amount of archive logs

A limit management system experienced unplanned downtime, due to archive log area filling up. It  turned out that within a space of 10 hours, the system was generating 100GB of archive files, and the tape backup simply could not backup the data fast enough.

To check whether this is a normal activity, we can the following query to check pattern for the last 30 days.

select to_char(trunc(first_time),’DD-MON-YYYY’) business_date, count(*) number_of_archive_logs from  gv$log_history where first_time >= sysdate-30 group by trunc(first_time) order by trunc(first_time) desc

We can even break down the rate of the generation by hour,

select count(*), cob, hour from (select to_char(trunc(first_time),’DD-MON-YYYY’) cob , to_char(first_time,’HH24′) hour from gv$log_history where first_time >= sysdate-4) group by cob, hour

And based on the result, pinpoint the problematic time period, and cross check with the application log, to see which component run is causing this unusual high amount of archives.

Thursday, August 12th, 2010 at 12:37

Business Day as Partition Key

I was reviewing a model deal coverage FACT table today to investigate a performance issue while running an FSA report. Table is partitioned by business_date and data is also queried by business_date. Daily volume is ~500k rows. So why is it running so slowly?

It turned out that BUSINESS_DATE column is defined as TIMESTAMP , rather than simply a DATE column, which means that all queries against this table will use a function such as

WHERE trunc(BUSINESS_DATE)=’30-JUL-2010′

This does not encourage straightforward partition pruning to happen, as a result for every report against  this table, Oracle was doing a full table scan, which is not a good thing considering that the table size is >8GB.

Lesson  to be learned: choose your attribute wisely.

Thursday, August 12th, 2010 at 12:06

Index monitoring

After doing the usual alter index monitoring usage, you might wonder why you can not see anything in v$object_usage. This is because this view will only show usage by current user.

To be able to see a comprehensive list, use this query instead

 select * from
(
select u.name owner
,      io.name index_name
,      t.name table_name
,      decode(bitand(i.flags, 65536), 0, ‘NO’, ‘YES’) monitoring
,      decode(bitand(ou.flags, 1), 0, ‘NO’, ‘YES’) used
,      ou.start_monitoring
,      ou.end_monitoring
from
sys.user$ u
,   sys.obj$ io
,   sys.obj$ t
,   sys.ind$ i
,   sys.object_usage ou
where
i.obj# = ou.obj#
and io.obj# = ou.obj#
and t.obj# = i.bo#
and u.user# = io.owner#
)
order by table_name;

Mind though, that this will simply tell you whether Oracle uses the index or not. If you are trying to rationalize the index usage within your database, this is not the answer of everything.

Given too many choices of indexes to use, Oracle could choose the wrong one. A used index is not a certain indication that it is a good index.

Wednesday, July 21st, 2010 at 07:48

A bit of creativity helps performance

We need to transfer trade information from one database to another using a database link on a regular basis. There are around 1 million rows to transfer. Transfer time is around 8 hours. Wow! Sounds really bad. Possibility of network issue has been ruled out. OK, what could it be?

Checking the size of the table reveals the the table contains raw XML data that is stored as CLOB. This data alone amounts to almost 50GB. No wonder it is taking ages.

At the moment, the workflow is done as follows:

  • raw trade data (including the XML)  is loaded to Database A, which is a central trade repository
  • there will be multiple subscribers reading this data from Database A, some in the form of direct copy via database link, some will extract data  via files, and reload to other databases.

Regardless of the method, the same fact exists. 50GB worth of data has to be shipped around. Not a small task. Further questioning reveals that the XML data is never manipulated inside the database. This fact has a significant importance.

So what if we store the XML in the database as compressed? Easy enough to do using a little bit of java and some PL/SQL code. See this link from Ask Tom.

After applying this compression method,  the total size of zipped XML is a miniscule 3GB (from the original 50GB). The benefit is instant.

The copy via database link is now taking a mere 4 minutes.

Storage requirement is reduced in the source database, as well as in the subscribers’ databases.

Job done… for now..

Tuesday, February 16th, 2010 at 21:09

Pilot projects - The importance of being earnest

There has been an interesting discussion recently in regards to what technology is needed to provide market risk managers with sensitivity data. The data will have to be stored at deal level, which amounts to 23 million rows/day. System will need to store 10 days worth of sensitivity data. Users will want to be able to aggregate chosen data to specific level in a reporting hierarchy within seconds.
It is interesting for me to watch the appraoch that has been taken to find the definite chosen solution for this. The higher entity in the project has provided a vision, that is, to use a particular vendor product, which provides out-of-the-box solution to perform aggregation on the fly. The approach is “straightforward”. Load all data to memory, and away we go.The complexity is as follows:

  • the software has never been used for such a large scale implementation
  • 10 days worth of data amounts to 230 million rows of data, and this translates to a total of 750GB of data/region. There are 3 main regions to deal with, and this adds up to the need to persist 2.2TB of data in memory, which in turn will result in millions of dollars of hardware cost/year, plus the software cost.

All throughout the pilot, various issues have been identified

  • During the pilot, only a subset of the data has been tested, due to constraint in sourcing the required hardware to support a full test
  • The business is expected to simplify the reporting hierarchies to cope with hardware constraint
  • There is no flexibility on providing FX rates conversion once data is loaded to memory
  • There must be enough buffer in hardware resource to accommodate for business growth, i.e. there must be enough memory to load all data, otherwise system will fail
  • and the list of unknown continues

Now it is decision time. Someone will need to look at the result of the pilot and make a go/no go decision with the choice of technology.

I sit on the outside of this project, and it interests me to see how decisions are made, and on what basis. Is the pilot genuinely used to pick the best solution which is the best fit for this particular business requirement, or is it used to justify that the initial hunch to pick a particular software vendor as a solution -  is correct. Somehow I suspect that the latter is true in most cases. It is hard to back off a particular route once that route is taken, and the goal post would suddenly switch from “finding the best solution to a problem”, to “how to make a particular solution work for a particular problem”. Two different things.

How does this relate to my oracle blog?

Well, as an Oracle consultant, I believe that an oracle solution without the bells and whistles would work just fine. With the correct data model design, we can end up with a FACT table which is lean enough to allow it to be used for aggregation on the fly. For current day’s data that is used more often, data would end up being cached in memory, and aggregation on the fly should “fly”. And for other day’s data that is used occassionally, does it matter so much if it takes a few seconds longer to aggregate? This surely would save the client a few million dollars annually. And some of the saved dollars can be spent on on a good business intelligence solution to improve user experience.

But again, what do I know..

Friday, October 2nd, 2009 at 11:22

Phased release of a global system - gotchas!

A credit risk system, which provides limit exposure management of counterparties, is having various performance issues. This system was designed as a global system which will be implemented worldwide, however at the moment is only used to deal with London trading activities.

As expected, tables and indexes’ designs are driven by location, say LOCATION_ID.

Unfortunately because at the moment there is only 1 location using this, this is causing a lot of issues with oracle optimizer.

Table/index statistics were generated with the following option:METHOD_OPT => ‘FOR ALL INDEXED COLUMNS SIZE AUTO’. Because of this, Oracle favours the use of full table scans, rather than the nicely designated indexes. The result is disastrous.

As an outsider, with no intimate knowledge of the running of the system, there are 2 tactical solutions which I can offer here:

1) simplify the table/index statistics generation method. Do not generate histogram, in this case by default Oracle will create 2 buckets for most columns, and this will result in the designated indexes to be used in queries.

2) recrate indexes with LOCATION_ID as leading columns. Pick other columns that are much more suitable as leading columns, and if needed LOCATION_ID can be kept as third/fourth columns.

That’s it.

Friday, June 19th, 2009 at 12:28