Monthly Archives: December 2010

Don’t relies on auto-statistics option in VLDB environment

As we known, statistics on SQL Server is quite helpful for generating an efficient execution plan. But a lot of developers never take care of statistics, because in SQL Server, there are two options, auto create statistics and auto update statistics, which we called auto-statistics option.

In most cases, auto-statistics works properly and saved us a lot of time to manage the statistics. When you perform some search, aggregate or sort operation on the table, SQL Server will automatically create some statistics on the table to help generating execution plan. Or there are already some statistics for generating execution plan, SQL Server may update those statistics marked as outdated.  There are only few triggers to mark a statistics as outdated, one is table schema was changed, and another one is 20% of the data was modified (in fact it is 20% plus 500 rows).

In SQL Server 2000, the modification is evaluated on row level, so there is only one trigger for entire table. Then Microsoft realized this algorithm will cause statistics on those quite stable columns updated too frequent. So in SQL Server 2005, Microsoft changed the detection mechanism to column level, to prevent unnecessary statistics updating when only few fields were heavily updated.

First of all, both in SQL Server 2000 and SQL Server 2005, or later versions, auto update statistics may not take effect on large tables as we expected, like on fact table. Image you got a fact table which pump in new data daily by ETL, only after 5 days, data pumps in daily will not enough to trigger the auto statistics update. Then we can image in most data warehousing scenarios, we already archived 2 years data in fact table. How long we will wait for statistics to be updated automatically again? For every 5 months!

Then in SQL Server 2008, even in those OLTP environments which mean smaller the table and if you used filtered index, things will go worse. Because filtered index still use whole column as the trigger base, so if you have a filtered index which covers only  5% of rows in the column, you need to update over 4 times on those rows covered by filtered index to trigger the auto-update statistics operation again.

So forget auto-statistics option in those large database environment, and on filtered index in any environments. Do statistics maintenance by yourself.

Funny NULL in T-SQL

NULL is one funny word because it has some interesting behavior in T-SQL.

When you search NULL values in a table, usually you have to use IS NULL prediction. If you try use expression “= NULL”, you will get nothing by default unless you SET ANSI_NULLS OFF in your session. Seems T-SQL treat NULL as an object, right?

But when you insert records into one table, you can not insert two NULL value on the field with a UNIQE constraint. Also GROUP BY will put all NULL values in same group, ORDER BY will put all NULL values together. Now seems T-SQL treat NULL as a value, right?

Usually I will leave the field to allow NULL instead of NOT NULL and assign a default value. Because in real business, NULL can be a meaningful state, which can be we don’t know, or it will be decided later. Default value is not acceptable in these scenarioes because who knows the default value you use will be occupied by another business meaning later, especially you are getting data from another system. When I need search these NULL values, I will use IS NULL prediction. Or if I need do reporting and it is required to assign some value to the NULL values, I will use ISNULL() function.

So please take care when you are dealing with NULL.

 

temporary table vs. table variable

Temporary table and table variable are two common objects used in T-SQL code. Both of these two objects can be used for keeping your temporary dataset in runtime. These two objects are very popular due to its convenience, but I saw lot of developers got some mis-understandings for these two objects.

I’ve heard some developers said these two objects did job with same methods, just with different names. In fact, they are different.

I’ve also heard some developers indicate table variable stayed in buffer, and table variable on hard disk. In fact, both of them are kept on hard disk, which was tempdb. You can monitor the data file size of tempdb to confirm on this. (Just ensure you put a WAITFOR command in your test script because table variable will be released just after the batch execution was finished.)

So what’s the difference for these two objects indeed?

First of all, table variable and temporary table all store their data in tempdb. You can query sys.tables in tempdb to make sure of this. And you can also query sys.database_files to check the data file size increase. Both of them will create a table object in tempdb, and both of them keeps data in tempdb.

You can open one query session, and execute command:

DECLARE @t table (ID int)
WHILE (1=1) BEGIN
INSERT INTO @t VALUES (CAST(RAND() * 1000 AS int))
END

Then you open another session, and execute command

USE tempdb
SELECT * FROM sys.tables
SELECT name, (size * 8)/ 1024.0 AS file_size_mb FROM sys.database_files

You will find one table like “#XXXX”. XXXX is a random number. And you will also see data file of tempdb growing.

Though both of these two objects wrote their data in tempdb, DML on table variable only generate minimal log so in previous demo you will see data file growing but log file remains it origional size.  Maybe the reason is table variable didn’t even hold a transaction. You can not image to resume an operation on one variable after you break then restore a session, right? And you will also never expect you can access a variable in another session, right? So table variable has lower overhead on logging IO and transaction lock.

Anyway if you replace the table variable by temporary table in the previous code, you will see both data file and log file growing. This is because temporary table evolves in transaction definately.

So from this point, when you need load amount of data into one temporary dataset or you need perform lot of modifications on one temporary dataset, table variable will has lower overhead on log IO.

On the contrary, table varialbe doesn’t support index and statistics but temporary table does. So if you need perform some heavy load queries which may benefits from index and statistics, like search, join or aggregate, temporary table maybe can run faster with defined index and statistics. Anyway, you still can use primary key or unique constraint to simulate an index on table variable, or use query hint to force an execution plan. But obviously, it will be very complicated to implement this in some complex environmenet and you may suffer from unstable performance.

There are also some other restrictions for these two objects:

  1. Temporary table can not be used in function, even you just want query some records out from one temporary table. This is due to a restriction from T-SQL language, which called side effect. You can find the definition of “CREATE FUNCTION” in SQL BOL, which indicate code in function body can not change the status of outside objects, like modifying a table.
  2. You can not modify schema of a table variable once your declared it.
  3. You can not pass a table variable into a dynamic T-SQL. You can only declare and use a table variable inside the dynamic T-SQL code.
  4. You can not pass a table variable to stored procedure or function. But as one new feature of SQL Server 2008, you can declare a table type parameter to achieve the similar result.

So after all, how we choose between table variable and temporary table?

Despite of application logics requirement, there are some simple rules you may consider. If you have a small temporary dataset, you’d better consider table variable first. But if you have a large temporary dataset, then consider the operation you will perform on the dataset:

  • If you need share temporary dataset between sessions, temporary table is the only choice.
  • If you need modify the dataset schema on fly, temporary table is the only choice.
  • If you need search data, aggregate or join dataset, you shall consider temporary table in most cases. Anyway, you still can try primary key and query hint on table variable.
  • Otherwise, you shall consider table variable first.