A blog about my challenges in the Microsoft data space

Azure Data Factory – “Execute SQL Task”

Scenario

How to run single SQL commands using Azure Data Factory (ADF)? Sometimes I need just that. No fancy requirements just execute a simple UPDATE for example.

Investigation

Unfortunately there is no such thing as our good old friend the Execute SQL Task in SSIS. I would like to think this is an oversight but I’m not sure that this functionality will come into ADF. I have looked around online and there are some workarounds but the answers to this question are scarce. But don’t despair I try to collect a few methods for you, dear Reader until (if ever) a shiny new icon “SQL” appears among the different activities of an ADF pipeline.

Workarounds

Stored Procedure – redesigning your SQL command as a stored procedure

This is my least favourite but the one that is most frequently recommended. I don’t like wrapping a possibly very simple SQL command into a stored procedure as I think there is too much overhead with maintaining the code, putting the logic outside ADF, … honestly: too much hassle but good to keep in mind as a last resort.

Stored Procedure with a twist

To execute a stored procedure that stored procedure needs to exist. I could either create it myself (see previous option) or I can use one that’s available out of the box in a SQL Server database! The idea came to me when looking through a totally different stored procedure that puts together a SQL statement dynamically. Maybe you already know now before I write it down:

sp_executesql

https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-executesql-transact-sql (it applies to all 4 SQL Server platform no matter if you have it on-prem or in the cloud)

EXECUTE sp_executesql N' your SQL command comes here '

Lookup activity

This is the third way I have found. Though it’s not designed to simply execute SQL commands but to execute them and return a piece of data in some format. So for SELECT statements it’s good especially if you want to use the results later in your pipeline (it’s designed for that). But then how to run a data manipulation command like UPDATE?
Well there is no real requirement what happens between issuing the command and returning a piece of data. So it’s easy to run an UPDATE statement with a valid OUTPUT clause!
https://docs.microsoft.com/en-us/sql/t-sql/queries/output-clause-transact-sql

UPDATE your_table SET your_column = your_column * 15
OUTPUT Inserted.your_column
WHERE IsGST = 'True'

Dynamic SQL

This was a tricky one to solve. Again I got quite annoyed that Azure Data Factory does not have the native functionality to execute a SQL statement. I spent a lot of time trying to make the sp_executesql version work because that system stored procedure is supported by all SQL Server platforms according to https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-executesql-transact-sql See that all four has a green tick next to them in the APPLIES TO row. Despite that it didn’t behave correctly despite my diligent efforts of disciplining the system.

So I ended up being in a situation that my dynamic SQL command is nicely crafted by a stored procedure that runs very obediently in database A. I used the Lookup task to get that in the way it’s intended to be used. So far so good. The SQL query appeared nicely in the output of that task.

I don’t want to go through all the hurdles as that would be more confusing than useful so let me just show how I managed to get it working:

ADF pipeline to extract and run Dynamic SQL

the lookup task runs a stored procedure in Database A that returns a SQL query with a dummy SELECT 1 at the end as the Lookup task MUST return something.
the returned query gets assigned to a SQLString variable using the expression @activity(‘LookupDynamicSQL’).output.firstRow.ColumnName
I tried using the output directly in step 3 but it didn’t work for some reason
another Lookup task for Database B (a separate dataset) that is set to use a Query and the query itself is @variables(‘SQLString’)

And it runs nicely. I actually have a ForEach loop wrapped around these three steps as I need to run them for multiple items and they cooperate nicely. ‘Why wouldn’t they?‘ – you could ask but I have seen weirder errors going against common sense so I’m simply happy to tell you that it works that way, too.

Azure SQL Data Warehouse performance questions

Scenario

I use Azure SQL Server Data Warehouse with x DWU but my queries are slow and according to the metrics the DWU usage is not even close to the limit like this:

What can I do? The old approach of throwing more resources (i.e. upscale to 2x DWU) simply would not work.

Investigation

Let’s run a chunky stored procedure that often takes longer to run than expected. While it’s running let’s query some DMVs:

SELECT   *
FROM     sys.dm_pdw_exec_requests
WHERE    end_time IS NULL
ORDER BY 1, 3, 4

It actually gives you a session_id as the first column that you can use to kill the process by executing

KILL 'session_id'

resource_class – that sounds like a good lead to follow! And yes if you check the official documentation (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/resource-classes-for-workload-management) it tells you some important details about resource classes. The official definition:
“Resource classes are pre-determined resource limits in Azure SQL Data Warehouse that govern compute resources and concurrency for query execution.”

I haven’t found what 20 actually means in staticrc20 but my guess is that it limits the memory usage to 20% But how is it determined what resource_class is being used when a query is executed? These resource classes are actually database roles as well so if you find out what user context your query / stored procedure is running:

SELECT  w.[wait_id],
		w.[session_id],
		w.[type] AS Wait_type,
		w.[object_type],
		w.[object_name],
		w.[request_id],
		w.[request_time],
		w.[acquire_time],
		w.[state],
		w.[priority],
		SESSION_ID() AS Current_session,
		s.[status] AS Session_status,
		s.[login_name],
		s.[query_count],
		s.[client_id],
		s.[sql_spid],
		r.[command] AS Request_command,
		r.[label],
		r.[status] AS Request_status,
		r.[submit_time],
		r.[start_time],
		r.[end_compile_time],
		r.[end_time],
		DATEDIFF(ms, r.[submit_time], r.[start_time]) AS Request_queue_time_ms,
		DATEDIFF(ms, r.[start_time], r.[end_compile_time]) AS Request_compile_time_ms,
		DATEDIFF(ms, r.[end_compile_time], r.[end_time]) AS Request_execution_time_ms,
		r.[total_elapsed_time]
FROM    sys.dm_pdw_waits w
JOIN    sys.dm_pdw_exec_sessions s ON w.[session_id] = s.[session_id]
JOIN    sys.dm_pdw_exec_requests r ON w.[request_id] = r.[request_id]
WHERE    w.[session_id] <> SESSION_ID();

The [login_name] column contains that piece of information but the overall query can be useful.

So based on all this information my understanding is (there is not a lot of detailed information available about performance tuning Azure SQL Data Warehouse) that no matter how much DWU you allocate, the resource class actually caps how much resources a single operation can consume. This actually makes a lot of sense but only once you know about it and you’re aware of its implications. Because even if you run a single query in an SSMS window and the server itself is idle besides that your DWU usage is suboptimal. So there is a lot of room left for other, concurrent queries.

The question is how can you enforce higher resource utilisation for your monster query that produces a bottleneck?

Recommended action

Caveats and warnings

Do NOT try this in a PRODUCTION environment without testing first in non-production!
Not every operation is impacted by resource class settings! CTAS and data loading in general are impacted but see the details using the below links: (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/resource-classes-for-workload-management#operations-governed-by-resource-classes vs
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/resource-classes-for-workload-management#operations-not-governed-by-resource-classes)

Unfortunately I have only been able to come up an all or nothing solution / workaround for automated ETL execution. The account used to run the ETL processes (that is usually a single account that is assigned at runtime from the orchestration engine) needs to be part of a different database role. The ones to choose from:

staticrc10 to staticrc80 (incrementing by 10)
dynamic resource classes: smallrc / mediumrc / largerc / xlargerc – from 3% memory limit to 70%

So you have nothing else to do that simply put the ETL service account into a different resource class database role. As a first step I’d recommend switching from staticrc20 to staticrc40. It would be create if Azure SQL Data Warehouse would have supported EXECUTE AS for stored procedures but it doesn’t. The feature development is in progress since Jan 2019 the latest but you can still upvote this: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/14602404-support-execute-as-for-stored-procedures

Alternatively what you can do for testing is

Create a SQL user and assign it to e.g. the xlargerc dynamic resource class
Connect to your Azure SQL Data Warehouse server using that user
Run your query and test the results.

A sample SQL script for preparing this:


--CREATE local SQL user
--in master database
CREATE LOGIN [SvcDataLoader] WITH PASSWORD = 'your strong password here';

--in your Azure SQL Data Warehouse
CREATE USER [SvcDataLoader] FOR LOGIN [SvcDataLoader];

--assing user to a different rc role
EXEC sp_addrolemember 'xlargerc', 'SvcDataLoader'; --up to 70%
EXEC sp_addrolemember 'db_owner', 'SvcDataLoader'; --so that it can access the required tables

--check if user is assigned correctly
SELECT DP1.NAME AS DatabaseRoleName,
isnull(DP2.NAME, 'No members') AS DatabaseUserName
FROM sys.database_role_members AS DRM
RIGHT JOIN sys.database_principals AS DP1 ON DRM.role_principal_id = DP1.principal_id
LEFT JOIN sys.database_principals AS DP2 ON DRM.member_principal_id = DP2.principal_id

WHERE DP1.type = 'R'
AND DP2.[name] = 'SvcDataLoader'
ORDER BY DP1.NAME;

Any questions, just let me know in the comments! Thanks for reading.

Leap Year issue with MDX date navigation

A few days ago (exactly one day after 29th of February when testing data from last night’s load) I experienced a strange issue which I couldn’t explain at first but then found out that it is related to the leap year and to the fact that even though this year has a 29th in Feb but last/next does not!

So be careful when using ParallelPeriod on 29 February in leap years!

ParallelPeriod(
    [Date].[Y-M-D Calendar].[Month], 
    -12,
    [Date].[Y-M-D Calendar].[Date].&[20120229]
)

This above expression points (would point) on a day exactly one year from now in the future.

Explicitly the command says: based on 2012 02 29 look for the day that will come 12 months later!

As you can see 2 levels and 3 members of the date dimension is involved:

Date level: 20120219 as the specified member which is the basis of the date difference calculation
Date level: the end result (which theoretically is 29 Feb in 2013 – but that does not exist!)
Month level: the difference is defined in months.

When this becomes problematic is when you want to e.g. sum up values across a daterange:

Aggregate the [Sales] measure for the last 12 months from today!

That is a set and literally it means from 2011.02.29. to 2012.02.29. But since 2011.02.29. does not exist the expression is evaluated like using the beginning of the date dimension, e.g. 2000 Jan 1. So be careful!

When writing this post I thought of trying the same with T-SQL:

SELECT DATEADD(Year, 1, '2012 February 29')
or
SELECT DATEADD(Month, 12, '2012 February 29')

The result of both statements is

2013-02-28 00:00:00.000

Though this is not 100% correct but a lot better than returning data from the beginning of time…

Default Member inconsistency

Issue

Example is based on Adventure Works – SQL 2012

The user would like to browse the cube mainly for data related to the Accessories products so the default member of that attribute is set to:

[Product].[Category].&[4]

You have to make that modification in Adventure Works in order to test this behaviour!

But as the user digs deeper he’s interested in other statuses like Bikes for Calendar Year 2004. The result should be like this:

Query for this from profiler:

SELECT 
  NON EMPTY 
    Hierarchize
    (
      {
        DrillDownLevel({[Product].[Product Categories].[All Products]}
        ,,,INCLUDE_CALC_MEMBERS)
      }
    )
   ON COLUMNS
FROM 
(
  SELECT 
    {
      [Product].[Product Categories].[Category].&[1]
     ,[Product].[Product Categories].[Category].&[4]
    } ON COLUMNS
  FROM [Adventure Works]
)
WHERE 
  (
    [Date].[Calendar].[Calendar Year].&[2004]
   ,[Measures].[Order Quantity]
  )

So far it looks OK because if the Product Category is not in the filter but displayed on rows/columns the values are not summed up but displayed individually. Let’s try putting it from Row Labels to Filters:

Again the query in the background:

SELECT 
FROM 
(
  SELECT 
    {
      [Product].[Product Categories].[Category].&[1]
     ,[Product].[Product Categories].[Category].&[4]
    } ON COLUMNS
  FROM [Adventure Works]
)
WHERE 
  (
    [Date].[Calendar].[Calendar Year].&[2004]
   ,[Measures].[Order Quantity]
  )

So Excel (and not just Excel, other cube browsers can also do this, though I haven’t tested) mixes it up. The same problem is highlighted very clearly in this blog post, I just provide Adventure Works example and some more explanation: http://peteadshead.wordpress.com/2011/08/27/issues-with-msas-default-members/

Solution Explanation

No real solution for that one just workarounds like providing an initial spreadsheet for the user that has the Product Category set to Accessories. The problem lies in how Excel (and as I mentioned possibly other query tools, too) constructs the query (you can catch them via Profiler) because it embeds
filters in subqueries instead of WHERE clauses.

So here’s the incorrect, Excel-created MDX once more with some bold-highlighting:

--Incorrect results
SELECT 
  { [Measures].[Order Quantity] } ON COLUMNS
FROM 
(
  SELECT 
    { [Product].[Category].&[4], [Product].[Category].&[1] } ON COLUMNS
  FROM [Adventure Works]
)
WHERE 
  ( [Date].[Calendar].[Calendar Year].&[2004] )

The result is 27 560 but should be 48 713 as you can check with the next query by yourself. The only thing happens is that I move the filtering from a subquery into the WHERE clause:

--Correct results 
SELECT 
  {[Measures].[Order Quantity]} ON COLUMNS
FROM [Adventure Works]
WHERE 
  (
    [Date].[Calendar].[Calendar Year].&[2004]
   ,{ [Product].[Category].&[4], [Product].[Category].&[1] }
  );

So basically it’s not (or not just) Excel’s fault that the MDX query returns an incorrect value but the cube engine or MDX itself has a flaw, but it comes to daylight when using Excel due to its query writing mechanism.

Further details as a result of using team knowledge J If you duplicate and put the filter into the WHERE part, too then the result is OK:

SELECT 
  { [Measures].[Order Quantity] } ON COLUMNS
FROM 
(
  SELECT 
    { [Product].[Category].&[4], [Product].[Category].&[1] } ON COLUMNS
  FROM [Adventure Works]
)
WHERE 
  ( [Date].[Calendar].[Calendar Year].&[2004] 
  ,{[Product].[Category].&[4], [Product].[Category].&[1]}
  );

So it seems there are 3 things happen when executing the first, incorrect initial query:

The subquery is evaluated with all the listed statuses, overriding the default value (so Accessories and Bikes are selected by overriding the default which would have been only Accessories)
The outer query is evaluated as it should
The filtering to default member is applied again

Steps 2 and 3 can be replaced, the end result is the same. When everything is evaluated in only 1 query the default member is correctly overridden.

There’s something similar to MDX and SSAS as the logical query processing phases for SQL but it’s a lot more complicated. According to SSAS Unleashed:

“The default members are overwritten by the members of attributes specified in the WHERE clause, and then those members are overwritten by members of attributes from each axis. When we say overwritten, we mean that the member corresponding to the attribute is changed. In reality, the process is a little more complicated…” – page 175

And it might have something related to the scopes but this is beyond my present knowledge.

There’s a technet article (http://technet.microsoft.com/en-us/library/bb630310.aspx) about default members but though it gives some advice what to be aware of when defining default members it is not valid for this issue! (I wanted to mention because you might find it and suggest considering those ideas.) That refers to non-aggregatable attributes wheres Status is aggregatable.