Tuesday, April 24, 2012

XSLT Hints (hard lessons)

XSLT really relies on XPATH.
To learn XSLT you must learn XPATH.
The most common problem with an xslt document is an XPATH issue.
The most common XPATH issue is a namespace issue.
Try changing your node references to include *: for namespace.
Another common problem is the xslt engine.  Free engines are crippled to the standard.

Monday, April 23, 2012

Duplicate Keys

Cleaning Data

Duplicate keys happen.  I see it most when you feed data into a database and the source data is dirty.

Source data is usually dirty, that's why they want it in a database.

The UNIQUE constraint clause in SQL prevents duplicate keys from ever getting into your pristine database--at least in theory.  Sometimes a plain old attribute just needs to be turned into a key for some "practical" reason.

A common data cleaning need is finding and removing duplicate keys.  Don't forget to turn on unique constraints for your newly clean keys when you are ready.  You never know when you might get hit by a drive-by data sludger.

Finding Duplicate Keys

SELECT my_key, count(*) FROM my_table GROUP BY my_key HAVING count(*) > 1; 

Remove Duplicate Keys

DELETE
FROM     tableA
WHERE uniquekey NOT IN
    (SELECT     MAX(duplicates.uniquekey)
      FROM           tableA As duplicates
      GROUP BY     duplicates.column1, duplicates.column2, duplicates.column3)

For relatively small tables these queries work just fine.  When these take a long time, they can hog resources in your database (CPU, RAM, disk I/O).  Then, you should create stored procedures and cursors and force commit every 100 rows or so.  This can help prevent filling redo logs in Oracle as well.




Sunday, April 22, 2012

Integrating Systems

The Debt of Dirty Data

--Hey, why can't our Insurance and Billing system just talk to our Admissions system?  We just get too many errors in patient names and numbers, and fixing all of this is becoming a nightmare.

--Oh, alright let's integrate them then.

Duplicate data, multiple databases, and the need to communicate new records and updated records, has some organizations in a frenzy.  The mantra "just integrate them" makes it seem like there's a technology that will make this happen like magic.

It's not just the technology, it's the data.

Even if you follow database normalization rules (arcane, mathematically-based rules for designing a database for the non-database types), that doesn't mean you have clean, tight data.  Even still, in all of the databases I've seen, over the years, I've still yet to see a perfectly designed database anyway.  Add to this mess, a customer record looks like different things to different people and systems, making two systems "talk" to each other requires yet a whole 'nother system.

Let's start with clean data.  As a simple example walkthrough consider a customer address.

--Address is easy, the Post Office has been handling addresses for hundreds of years now.

Is it?  What should be the maximum length of the first address line?  If you say infinite, then how does that show up when you print an envelope?   Do you want to keep track of physical versus box addresses?  Why not have a field for PO_BOX_NO?  This way we don't have to type P.O. Box every time in the first address line which reduces errors.

Through internet shopping exposure, people now have no trouble justifying tracking a Billing Address and a Mailing Address.   But wouldn't it make sense to keep track of previous addresses?  If your system doesn't keep track of history, it's worth considering tracking Previous Billing Address and Previous Mailing address.  Customers move and some day you may have to prove to them or show when you updated their address.

--Okay, but this is trivial.

Right.  What if someone gives us the wrong zip code, shouldn't we be able to verify address against zip code to make sure the data is clean?

--Yes, all systems do that.

No.  Unfortunately, addresses change all of the time which requires an update from the Post Office MIPS database, so you need a subscription and a way to update your verification database.

--Let's skip that for now, we have a deadline.

And now, once you've skipped this validation, when you have a non-trivial customer database, you will have unclean data, eventually. 

Lesson 1:  All data should be validated to be clean.
Lesson 2:  Validation reveals hidden costs in implementing a database.

Note: We are not even talking about a database management system (DBMS), but just a database.

And I will add the following.

Lesson 3: Not ensuring clean data now, pushes the costs into the future like using a credit card that will have to be paid back with interest.

So, system integration requires you to pay off the dirty-data debt with interest on top of all of the other visible costs--just one reason system integration is not just technology.