Blog moved to

Please go for updates.


Build [Dell DVD Store] sample database for your postgresql

Struggled almost 30 minutes to get through all the steps. It seems the read me was disconnected with current build.

Here is the steps I tried out:

  1. Download ds21.tar.gz and ds21_postgresql.tar.gz from
  2. Extract them out, in one directory or different directories. If in different directories, just move the [data_files] directory from what you extracted from ds21.tar.gz to the directory extracted from ds21_postgresql.tar.gz.
  3. Navigate to the data_files directory we just highlighted in previous step.
  4. Go to each sub-directory under data_files and use gcc to recompile data generator executable. No need to install ActiveState or Cgwin. GCC is enough. But you need add -lm switch when you compile the c code, e.g. gcc ds2_create_cust.c -lm -o ds2_create_cust.
  5. Under cust directory, use chmod to enable execution on the sh file which matched your size requirement.
  6. Under orders directory, run the sh file to generate orders first. Then run ds2_create_inv to generate invoices, e.g ./ds2_create_inv 100000 0 > ../prod/inv.csv. The first parameter is size of product table, 100K is for medium size. The second parameter is os type. 0 means Linux, 1 means Windows.
  7. Under prod directory, run ds_create_prod command. There are two parameters ./ds2_create_prod 100000 0 > prod.csv.
  8. Navigate to top directory and run ./ (You may need to modify the data file path in sql scripts under [load] directory. In my download, the data file reference path is three levels up so I removed one level. And there is one redundant back slash in the data file path of products load script.)

Here is complete commands for medium size build (I extracted the file to [Downloads] directory):

cd /home/ryan/Downloads/ds2/pgsqlds2/data_files/cust
gcc ds2_create_cust.c -lm -o ds2_create_cust
chmod +x
cd ../orders
gcc ds2_create_orders.c -lm -o ds2_create_orders
gcc ds2_create_inv.c -lm -o ds2_create_inv
chmod +x
./ds2_create_inv 100000 0 > ../prod/inv.csv
cd ../prod/
gcc ds2_create_prod.c -lm -o ds2_create_prod
./ds2_create_prod 100000 0 > prod.csv
cd ../../

Hope it can save your time.

Install pgpool with vPostgres

vPostgres is another distribution of PostgreSQL from VMware. There is a little bit different to install pgpool with it. You are supposed to see error messages like “libpq is not installed or libpq is old” in the configuration phase or “error libpq-fe.h no such file or directory” in the make phase.

  1. To resolve libpq problem. You need copy vpostgres-ldconfig-x86_64.conf to /etc/, and run ldconfig.
    cp /opt/vmware/vpostgres/current/share/libpq-doc/vpostgres-ldconfig-x86_64.conf /etc/
  2. Then you can run the configure script in pgpool extracted source. As vPostgres was not installed to /usr/local by default. So you need put a switch behind the script.
    ./configure –with-pgsql=/opt/vmware/vpostgres/current/
  3. Before you run the make command, confirm you have server and client development package installed. If not, install it.
    rpm -ivh VMware-vFabric-Postgres-client-devel- VMware-vFabric-Postgres-server-devel-
  4. Now you can run make.
  5. Now you can install.
    make install
  6. Before you run the pgpool, create a folder for its pid file
    mkdir /var/run/pgpool
  7. Then you need make a copy of pgpool.conf.
    cp /usr/local/etc/pgpool.conf.sample /usr/local/etc/pgpool.conf
  8. Use your preferred editor to change the configuration file.

SQL Server Automated Script: re-attach databases

Have you ever tried change the default path of your data files and log files?

One scenario you can step into this case is manually failover. You mounted the data drive and log drive, but unfortunately the log drive unable to be mounted as original drive letter. So after you start up the instance. You found all your original databases were under the recovery pending status, because SQL Server can’t find the log file for your user databases. So you need to do something to detach and attach these databases, with a log folder path.

Here is a simple script which will save your wrists and mouse from massive clicks:
:setvar newDataPath “D:\SQL Server\Data”
:setvar newLogPath “E:\SQL Server\Log”

set nocount on

declare @databaseID int
declare @databaseName sysname
declare @dbStmt nvarchar(1024)

declare curOFLDB cursor for
select d.database_id, from sys.databases d
inner join sys.database_recovery_status r
on d.database_id = r.database_id
where r.database_guid is null

open curOFLDB
fetch next from curOFLDB into @databaseID, @databaseName

while @@FETCH_STATUS = 0
set @dbStmt = ‘ALTER DATABASE ‘ + @databaseName + ‘ SET OFFLINE;’
select @dbStmt

declare @logicalName sysname
declare @physicalName nvarchar(260)
declare @type tinyint
declare @moveStmt nvarchar(1024)

declare curFiles cursor for
select, f.physical_name, type from sys.databases d
inner join sys.master_files f
on d.database_id = f.database_id
where d.database_id = @databaseID

open curFiles
fetch next from curFiles into @logicalName, @physicalName, @type

while @@FETCH_STATUS = 0
if @type = 0
set @physicalName = RIGHT(@physicalName, CHARINDEX(‘\’, REVERSE(@physicalName)) – 1)
set @moveStmt = ‘ALTER DATABASE ‘ + @databaseName + ‘ MODIFY FILE’
set @moveStmt += ‘ (‘
set @moveStmt += ‘ name = ‘ + @logicalName + ‘, ‘
set @moveStmt += ‘ filename = ”$(newDataPath)’ + @physicalName + ””
set @moveStmt += ‘ )’

if @type = 1
set @physicalName = RIGHT(@physicalName, CHARINDEX(‘\’, REVERSE(@physicalName)) – 1)
set @moveStmt = ‘ALTER DATABASE ‘ + @databaseName + ‘ MODIFY FILE’
set @moveStmt += ‘ (‘
set @moveStmt += ‘ NAME = ‘ + @logicalName + ‘, ‘
set @moveStmt += ‘ FILENAME = ”$(newLogPath)’ + @physicalName + ””
set @moveStmt += ‘ )’

select @moveStmt

fetch next from curFiles into @logicalName, @physicalName, @type

close curFiles
deallocate curFiles

set @dbStmt = ‘ALTER DATABASE ‘ + @databaseName + ‘ SET ONLINE;’
select @dbStmt

fetch next from curOFLDB into @databaseID, @databaseName

close curOFLDB
deallocate curOFLDB

Change data folder and log folder path of vPostgres

Just tried out how to change the data folder of vPostgres. We all know that separate data folder and log folder is one of the deployment best practice, for better performance. It is not difficult to do this for native PostgreSQL, just run the initdb command, or change the PostgreSQL.conf. But it’s not the same story for vPostgres, which is a distribution from VMware, part of its vFabric ecosystem.

Short version of the new storage: you must change a variable called “PG_DATA_DIR” in pgvm.conf, instead of postgresql.conf. You can find the pgvm.conf at /opt/vmware/vpostgres/current/etc/.

So if you want install vPostgres in your own machine, with the package downloaded from VMware, and you want separate the data folder and log folder. You need use following steps:

  1. Mount your disks for data folder and log folder, e.g. /pgdata/9.2/data and /pgdata/9.2/log. Be sure you added the mounted path to /etc/fstab.
  2. Install the vPostgres package with rpm command.
  3. Change owner of the folders you created in 1st step.
  4. Run initdb with postgres user, with -D and -X switch to change the data folder and log folder path. You will find the initdb command from /opt/vmware/vpostgres/current/bin/.
  5. Modify the pgvm.conf we mentioned before. Change the variable PG_DATA_DIR.
    BE CAUTIOUS, there are two sections in the con figuration file. PG_DATA_DIR is in the first section. The second section is for virtual appliance (vApp you downloaded from VMware).
  6. Restart the server and by default the vpostgres service will start up automatically. Now you can delete the /var/vmware/vpostgres/current/pgdata.

Have to say I don’t like this change, especially the official document never mentioned it. Tried several times then found this pgvm.conf.

And if you are using a virtual appliance. The steps shall be very close. Just try the value in second section, of course with

Be Unique Or Be Null

We all knew that unique index can prevent duplicated values in the table. Does a column with unique constraint allow NULL values? The answer is positive. But only one. This is what this post for.

There are chances that we might need the value in a column to be unique, or be NULL. Take an example, you have a user table for your online social application. The user can be validated with their identities (PID) or be linked with another external ID like Live ID (now calls Microsoft Account) in the real world. Otherwise he can choose to stay as not validated. To avoid duplicated registration, you might need the PID to be unique. At the same time, user can also choose to keep their privacy, and leave the PID value to be empty.

For the scenario above, you will find some difficulties to define the unique constraint. There will be impossible to save two unverified user, because their PIDs are all NULL.

To resolve this problem, you can use filtered index which was first introduced in SQL Server 2008. You can define the unique index as

T-SQL Tips: Delete duplicated rows in a table without unique key

It is always a good practice to put a unique constraint in a table, no matter an auto-incremental primary key or a composite unique index. The reason is not only it can produce some performance benefits, but also make you easier to remove duplicated rows.

Let’s look at one sample here. You have a table with log information. The table has three columns, log_date, log_source, and log_text. The problem is developer forgot to put a unique index there, although he can put one on the columns log_date and log_source. And one day the log relay-agent middleware had some problems and pushed several log entries into the table several times when it retried to save the log into the database. (Maybe you can find some more realistic scenarios.)

Now as a dba, you need to find out those duplicate rows and delete them. You can try the undocumented pseudo column %%physloc%%, or %%lockres%% in SQL Server 2005.

For SQL Server 2005
WHERE %%lockres%% NOT IN
(SELECT MIN(%%lockres%%) FROM logs GROUP BY log_date, log_source)

For SQL Server 2008 or later version
WHERE %%physloc%% NOT IN
(SELECT MIN(%%physloc%%) FROM logs GROUP BY log_date, log_source)

A simple script – find heap tables

I found I often need to find out those heap tables when I do quality control of the database schema in some projects. Usually those heap tables will be requested for justification. So here is the simple script used to make this task easier.

select ‘[‘ + + ‘].[‘ + + ‘]’ from sys.indexes i
inner join sys.tables t on i.object_id = t.object_id
inner join sys.schemas s on t.schema_id = s.schema_id
where index_id = 0

And this script can also be changed to find out those heap tables with non-clustered indexes.

select ‘[‘ + + ‘].[‘ + + ‘]’ from sys.indexes i
inner join sys.tables t on i.object_id = t.object_id
inner join sys.schemas s on t.schema_id = s.schema_id
group by,
having sum(index_id) = 0

Use TABLOCK to boost your INSERT INTO … SELECT performance

Something that we do frequently in an analytics application is loading data from one table to another. Personally I recommend SSIS for such tasks. But there are chances SSIS is not suitable, then we usually use INSERT INTO … SELECT.

I don’t recommend SELECT … INTO because you can’t control which file group the destination table goes to. And it is very important when I have to use table partition.

So if we need use T-SQL to move data from one place to another one. It is recommended to use TABLOCK option on the destination table. With a few other constraints, the query can be executed with minimal logging.

The few constraints are:

  • The database must be in bulk logged recovery model.
  • The destination table must be empty or without clustered index.
  • There is no non-clustered index on the destination table.

Maybe it is difficult to match the second and third constraint. But we have an alternative which is partition switch. We can partition the destination table. Load the data from source into an empty table with same schema except the index objects (usually I keep the clustered index). After load, we can do a partition switch.

Here is two sample queries:

INSERT INTO dbo.FactDevicePerformance_History
([TimeKey], [DeviceKey], [reads per sec], [read hits per sec],
[seq reads per sec], [seq read hits per sec], [writes per sec],
[write hits per sec], [Kbytes read per sec], [Kbytes written per sec])
SELECT h.* FROM AIR.dbo.FactDevicePerformance_History h
INNER JOIN AIR.dbo.DimTime t ON h.TimeKey = t.TimeKey
WHERE t.[Date] BETWEEN ‘2011-11-1’ AND ‘2011-11-30’

INSERT INTO dbo.FactDevicePerformance_History WITH (TABLOCK)
([TimeKey], [DeviceKey], [reads per sec], [read hits per sec],
[seq reads per sec], [seq read hits per sec], [writes per sec],
[write hits per sec], [Kbytes read per sec], [Kbytes written per sec])
SELECT h.* FROM AIR.dbo.FactDevicePerformance_History h
INNER JOIN AIR.dbo.DimTime t ON h.TimeKey = t.TimeKey
WHERE t.[Date] BETWEEN ‘2011-1-1’ AND ‘2011-1-31’

You can find the only difference is query hint “TABLOCK”. But first query runs 21 seconds, the second one runs only 8 seconds. First query increase the log file from 0.5MB to 672MB. The second one increase the log file from 0.5MB to 3.37MB. There are 2.3 million records moved in the query.

This is what called minimal logging since SQL Server 2005. Anyway you still need to compare these two approaches before you implement. Because with minimal logging SQL Server will force data pages to be flushed to disk before the transaction commits. So if the IO system for data file is not very fast, or the data pages affected by the operation are not sequentially arranged, you might see worse performance with minimal logging.

Maybe the query optimize is not so smart as you imagine

I had a database which stores huge amount of performance counter values. The table with counter values has a clustered index with column timekey in the first sequence. And there is time dimension table with column timekey as its clustered index.

One day I issued a query to find out the earliest date in the table, which looks like:
SELECT MAX(Date) FROM dbo.FactPerformance_History p
INNER JOIN dbo.DimTime t on p.TimeKey = t.TimeKey

I found the result was not returned immediately. So I checked the execution plan. I was quite surprised the execution plan showed two big tables FactPerformance_History and the DimTime were joined first then be aggregated for the max value.

It looks like query optimizer didn’t notice two tables were joined with two physical sorted columns. It decided to start the aggregation after the entire merge join was finished. Actually the query only needs the first match record with the two tables merged reversely.

Anyway, the query optimizer didn’t do it. So I can only rewrote the query:
SELECT Date FROM (SELECT MAX(TimeKey) AS TimeKey FROM Array.[FactDevicePerformance_History]) AS p
INNER JOIN ETL.DimTime t on p.TimeKey = t.TimeKey

Two execution plans here:

First query

query 1

Second query

It took 12 seconds to return with the result the first query, but only 6ms the new one.

OK. I agree semantics of the two queries are not exact the same. But if query optimizer can make the merge and stream aggregate run in asynchronously, then I believe the execution time of two plan above will be same.