Adam C. Scott: 2012

Monday, December 10, 2012

MySQL 5.6

Replication

Note 1760
Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.

With the new 5.6, while doing a change master to, I received the message
Query OK, 0 rows affected, 2 warnings (0.04) sec

A quick show warnings revealed the above note which is a ER_INSECURE_CHANGE_MASTER.

It took some digging, but yes, it's possible to store replication information in a table (mysql.slave_master_info):

set global master_info_repository = 'TABLE';

This can be set in the option file as well, so it's persistent between start ups. As soon as you do this the master.info file in the datadir is removed, and the mysql.slave_master_info is populated. You don't have to go clean up anything. This also can be done for relay_log_info_repository as well.

What's the best way?

From Linux a strings slave_master_info.ibd will show the user and password and server. Since master.info is stored in datadir and the slave_master_info.ibd table is stored unencrypted in datadir/mysql, I don't see how it's more secure.

When you backup the server using mysqldump, you can carry the replication information with the backup using the TABLE mode. So that might be more convenient for some folks.

If you have two instances running, using the same datadir and are slaves, it also may make it more convenient, so you don't have to make each one have a custom filename using the master-info-file option.

Nice feature, but not any more secure from what I can see. Anyone?

ADDENDUM:
When you do a show slave status\G you will see a row such as this if it stores the master information in a table. If it is in a file it will provide the full path and file name.

Master_Info_File: mysql.slave_master_info

Tuesday, April 24, 2012

XSLT Hints (hard lessons)

XSLT really relies on XPATH.
To learn XSLT you must learn XPATH.
The most common problem with an xslt document is an XPATH issue.
The most common XPATH issue is a namespace issue.

Try changing your node references to include *: for namespace.

Another common problem is the xslt engine. Free engines are crippled to the standard.

Monday, April 23, 2012

Duplicate Keys

Cleaning Data

Duplicate keys happen. I see it most when you feed data into a database and the source data is dirty.

Source data is usually dirty, that's why they want it in a database.

The UNIQUE constraint clause in SQL prevents duplicate keys from ever getting into your pristine database--at least in theory. Sometimes a plain old attribute just needs to be turned into a key for some "practical" reason.

A common data cleaning need is finding and removing duplicate keys. Don't forget to turn on unique constraints for your newly clean keys when you are ready. You never know when you might get hit by a drive-by data sludger.

Finding Duplicate Keys

SELECT my_key, count(*) FROM my_table GROUP BY my_key HAVING count(*) > 1;

Remove Duplicate Keys

DELETE
FROM     tableA
WHERE uniquekey NOT IN
    (SELECT     MAX(duplicates.uniquekey)
      FROM           tableA As duplicates
      GROUP BY     duplicates.column1, duplicates.column2, duplicates.column3)

For relatively small tables these queries work just fine. When these take a long time, they can hog resources in your database (CPU, RAM, disk I/O). Then, you should create stored procedures and cursors and force commit every 100 rows or so. This can help prevent filling redo logs in Oracle as well.

Sunday, April 22, 2012

Integrating Systems

The Debt of Dirty Data

--Hey, why can't our Insurance and Billing system just talk to our Admissions system? We just get too many errors in patient names and numbers, and fixing all of this is becoming a nightmare.

--Oh, alright let's integrate them then.

Duplicate data, multiple databases, and the need to communicate new records and updated records, has some organizations in a frenzy. The mantra "just integrate them" makes it seem like there's a technology that will make this happen like magic.

It's not just the technology, it's the data.

Even if you follow database normalization rules (arcane, mathematically-based rules for designing a database for the non-database types), that doesn't mean you have clean, tight data. Even still, in all of the databases I've seen, over the years, I've still yet to see a perfectly designed database anyway. Add to this mess, a customer record looks like different things to different people and systems, making two systems "talk" to each other requires yet a whole 'nother system.

Let's start with clean data. As a simple example walkthrough consider a customer address.

--Address is easy, the Post Office has been handling addresses for hundreds of years now.

Is it? What should be the maximum length of the first address line? If you say infinite, then how does that show up when you print an envelope? Do you want to keep track of physical versus box addresses? Why not have a field for PO_BOX_NO? This way we don't have to type P.O. Box every time in the first address line which reduces errors.

Through internet shopping exposure, people now have no trouble justifying tracking a Billing Address and a Mailing Address. But wouldn't it make sense to keep track of previous addresses? If your system doesn't keep track of history, it's worth considering tracking Previous Billing Address and Previous Mailing address. Customers move and some day you may have to prove to them or show when you updated their address.

--Okay, but this is trivial.

Right. What if someone gives us the wrong zip code, shouldn't we be able to verify address against zip code to make sure the data is clean?

--Yes, all systems do that.

No. Unfortunately, addresses change all of the time which requires an update from the Post Office MIPS database, so you need a subscription and a way to update your verification database.

--Let's skip that for now, we have a deadline.

And now, once you've skipped this validation, when you have a non-trivial customer database, you will have unclean data, eventually.

Lesson 1: All data should be validated to be clean.
Lesson 2: Validation reveals hidden costs in implementing a database.

Note: We are not even talking about a database management system (DBMS), but just a database.

And I will add the following.

Lesson 3: Not ensuring clean data now, pushes the costs into the future like using a credit card that will have to be paid back with interest.

So, system integration requires you to pay off the dirty-data debt with interest on top of all of the other visible costs--just one reason system integration is not just technology.

Adam C. Scott