So this is a very interesting case that we came across in one of our AWS accounts recently. We are replicating data between a MS SQL Server source to a Aurora PostgreSQL cluster using AWS DMS. The one table being replicated is fairly basic, just some integer and string columns. Below is the definition of the table in MS SQL Server:
create table dbo.myfirsttable(col1 int not null primary key,
col2 int,
col3 int,
smallcol varchar(300)
largevalue varchar(max) not null
);
The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.
This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.
My 1st thought was this must be an issue with size, because we had “Limited Mode” set on our DMS task and a size of 500kb set for the max string to migrate, but if the value was too big it should be truncated, not left null and besides when we looked into it further, the rows that had a null value over in the PostgreSQL database, that missing string value was well below 500kb and there were many rows that had replicated with much large values in that column without any issues.
Before we contacted AWS support we said we would just try setting the DMS Task to “FULLLobMode”, even though we could already see that size was not an issue it was worth a try, but as you might guess it made no difference. So we contacted AWS Support armed with what we had discovered as well as a string value that we were able to replicate the issue for them. Just one more thing that we noticed, this only began happening when we upgraded AWS DMS from 3.4.7 to 3.5.1 back in January, yes I know it took us this long to find the issue.
So as I mentioned I was able to replicate the issue for AWS support and provide them with a string value to test with, and after about 3 days they came back with a proposed fix, add the following extra connection attribute to the MS SQL Server source endpoint.
ForceLobLookup = true
So according to AWS Support and the docs this forces DMS to always do a lookup on the table and not just read the value from the transaction log and to be fair it worked a treat. Issue fixed, end of story right. Well no, because, ok that fixed the issue but why did it happen. So on pressing AWS support a bit more for answers they said that (and I’m paraphrasing here) when you have a string value that is close to 8kb (i.e. the point in SQL Server when it would be moved off-row) DMS has an issue reading this value and thus null is transferred over. This was indeed a bug that was introduced in AWS DMS 3.5.1 and still exists in the latest version of AWS DMS 3.5.3 (which we are currently on).
So I wanted to find out more about this problem and set about performing some tests are below are my findings. Note – I’m not sure they make things much clearer but they are some interesting findings.
Test Number 1:
The need to determine what exactly is the threshold when this bug kicks in. So a page in SQL Server excluding the header is 8096 bytes so that is the max row size we can have. (See my blog on SQL Server Page storage if you need a fresh .) So what I found was the following:
- When the row size was between 7,998 bytes and 8065 bytes inclusive the “largevalue” column would be migrated as null (this is without the ForceLobLookup being set).
- The row went from being stored in-row to LOB data at 8066 bytes. This might explain then why a row size of 8066 bytes transferred without any issue.
Test Number 2:
Well if we know that there is no issue with the replication of the row when the “LargeValue” column is stored off-row, then why don’t we store it off row from the beginning. So I create a new table with the same structure and DMS task to replicate the data, set the table options to store all large column values off-row, like we seen in the last blog (Page Storage), and you guessed it, all works fine, there is no data missing over in our PostgreSQL database. So moving the column off-row had the same impact as setting the extra connection attribute “ForceLobLookup to true”.
Test Number 3:
I’m not finished yet. I was wondering was there any way I could capture these lookups on the source table, well I tried extended events but they were no use, then I had the idea to use sys.dm_db_index_usage_stats and what I found was a mixture of the interesting and strange.
- When the table is created with the default settings, i.e. varchar(max) only goes out-row if row gets too big for page, when you insert some data that is below the 7,998 bytes lower-limit threshold then then all data is transferred over as expected and the “user_seeks” column in sys.dm_db_index_usage_stats DMV does not change. If you insert a value that has a row size of between 7,998 bytes and 8065 bytes then the “LargeValue” column is null when row is replicated over and “user_seeks” is incremented by one, so DMS has done a lookup on the table, even though the row is stored in-row. When we exceed that threshold of 8066 bytes, forcing the value off-row, all data is again replicated successfully but “user_seeks” is not increased.
- Lets set the “ForceLobLookup = true” attribute. Now all data is replicated successfully and for each row that DMS replicates “user_seeks” is incremented by 2.
Now I’ll be honest I’m not sure what these results mean, maybe using the DMV sys.dm_db_index_usage_stats is not a very reliable way to track reads on a table by DMS, I just thought it was strange that the “ForceLobLookup” is doing 2 user seeks and between that threshold when the data is not replicated successfully it performs one, and yet when it actually is stored off-row no user_seeks are captured.
Just to finish on this blog, so yes its great that we have our DMS task working correctly again, my question is why does DMS need to read from the source table ever, would all of this data not being in the transaction log. I’m concerned about this because it can mean the data replicating over to our PostgreSQL database might be much slower as each time now a lookup has to be performed on the source table and this is now open to blocking from other transactions (Note we currently see this on text columns that we have in SQL Server, DMS is being blocked as text columns are always read from the source table), I just don’t get why if the value is in the t-log it can’t be read from there. Maybe I’m missing something, if you have any thoughts, feel free to leave a comment.
So I hope you enjoyed todays blog, until next time, Slán!
Disclaimer: Never just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.
