A tail of a DMS bug and the curious case of the Endpoint setting ForceLobLookup

So this is a very interesting case that we came across in one of our AWS accounts recently. We are replicating data between a MS SQL Server source to a Aurora PostgreSQL cluster using AWS DMS. The one table being replicated is fairly basic, just some integer and string columns. Below is the definition of the table in MS SQL Server:

create table dbo.myfirsttable(col1 int not null primary key,
col2 int,
col3 int,
smallcol varchar(300)
largevalue varchar(max) not null
);

The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.

This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.

My 1st thought was this must be an issue with size, because we had “Limited Mode” set on our DMS task and a size of 500kb set for the max string to migrate, but if the value was too big it should be truncated, not left null and besides when we looked into it further, the rows that had a null value over in the PostgreSQL database, that missing string value was well below 500kb and there were many rows that had replicated with much large values in that column without any issues.

Before we contacted AWS support we said we would just try setting the DMS Task to “FULLLobMode”, even though we could already see that size was not an issue it was worth a try, but as you might guess it made no difference. So we contacted AWS Support armed with what we had discovered as well as a string value that we were able to replicate the issue for them. Just one more thing that we noticed, this only began happening when we upgraded AWS DMS from 3.4.7 to 3.5.1 back in January, yes I know it took us this long to find the issue.

So as I mentioned I was able to replicate the issue for AWS support and provide them with a string value to test with, and after about 3 days they came back with a proposed fix, add the following extra connection attribute to the MS SQL Server source endpoint.

ForceLobLookup = true

So according to AWS Support and the docs this forces DMS to always do a lookup on the table and not just read the value from the transaction log and to be fair it worked a treat. Issue fixed, end of story right. Well no, because, ok that fixed the issue but why did it happen. So on pressing AWS support a bit more for answers they said that (and I’m paraphrasing here) when you have a string value that is close to 8kb (i.e. the point in SQL Server when it would be moved off-row) DMS has an issue reading this value and thus null is transferred over. This was indeed a bug that was introduced in AWS DMS 3.5.1 and still exists in the latest version of AWS DMS 3.5.3 (which we are currently on).

So I wanted to find out more about this problem and set about performing some tests are below are my findings. Note – I’m not sure they make things much clearer but they are some interesting findings.


Test Number 1:

The need to determine what exactly is the threshold when this bug kicks in. So a page in SQL Server excluding the header is 8096 bytes so that is the max row size we can have. (See my blog on SQL Server Page storage if you need a fresh .) So what I found was the following:

  1. When the row size was between 7,998 bytes and 8065 bytes inclusive the “largevalue” column would be migrated as null (this is without the ForceLobLookup being set).
  2. The row went from being stored in-row to LOB data at 8066 bytes. This might explain then why a row size of 8066 bytes transferred without any issue.

Test Number 2:

Well if we know that there is no issue with the replication of the row when the “LargeValue” column is stored off-row, then why don’t we store it off row from the beginning. So I create a new table with the same structure and DMS task to replicate the data, set the table options to store all large column values off-row, like we seen in the last blog (Page Storage), and you guessed it, all works fine, there is no data missing over in our PostgreSQL database. So moving the column off-row had the same impact as setting the extra connection attribute “ForceLobLookup to true”.


Test Number 3:

I’m not finished yet. I was wondering was there any way I could capture these lookups on the source table, well I tried extended events but they were no use, then I had the idea to use sys.dm_db_index_usage_stats and what I found was a mixture of the interesting and strange.

  1. When the table is created with the default settings, i.e. varchar(max) only goes out-row if row gets too big for page, when you insert some data that is below the 7,998 bytes lower-limit threshold then then all data is transferred over as expected and the “user_seeks” column in sys.dm_db_index_usage_stats DMV does not change. If you insert a value that has a row size of between 7,998 bytes and 8065 bytes then the “LargeValue” column is null when row is replicated over and “user_seeks” is incremented by one, so DMS has done a lookup on the table, even though the row is stored in-row. When we exceed that threshold of 8066 bytes, forcing the value off-row, all data is again replicated successfully but “user_seeks” is not increased.
  2. Lets set the “ForceLobLookup = true” attribute. Now all data is replicated successfully and for each row that DMS replicates “user_seeks” is incremented by 2.

Now I’ll be honest I’m not sure what these results mean, maybe using the DMV sys.dm_db_index_usage_stats is not a very reliable way to track reads on a table by DMS, I just thought it was strange that the “ForceLobLookup” is doing 2 user seeks and between that threshold when the data is not replicated successfully it performs one, and yet when it actually is stored off-row no user_seeks are captured.

Just to finish on this blog, so yes its great that we have our DMS task working correctly again, my question is why does DMS need to read from the source table ever, would all of this data not being in the transaction log. I’m concerned about this because it can mean the data replicating over to our PostgreSQL database might be much slower as each time now a lookup has to be performed on the source table and this is now open to blocking from other transactions (Note we currently see this on text columns that we have in SQL Server, DMS is being blocked as text columns are always read from the source table), I just don’t get why if the value is in the t-log it can’t be read from there. Maybe I’m missing something, if you have any thoughts, feel free to leave a comment.

So I hope you enjoyed todays blog, until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

MS SQL Server Page Storage

So MS SQL Server stores all data in rows that fit onto 8kb pages.
There are many different types of pages in MS SQL Server but for this blog we are mainly concerned with three, these are:
1. Data Pages
2. Text/Image Pages, sometimes referred to as LOB pages.
3. Index Pages

The names are fairly self-explanatory but for completeness sake, I will list out the most common data types stored by each.

  • Data Pages:
    • All Integer types
    • All money types
    • numeric/decimal
    • bit
    • float/real
    • All date and time types
    • char(n)/nchar(n)
    • varchar(n)/nvarchar(n)
    • binary
    • varbinary(n)
  • Text/Image Pages:
    • Text/nText
    • Image
    • Xml

Now this is where things get interesting, Varchar(max) and Nvarchar(max) where do you think these are stored?

  1. So by default Varchar(max) columns are stored inline which means they will be stored on data pages.
  2. But that is only up to a max of 8k bytes after which the data will be moved from data pages into text pages, the reason for the 8k limit, well because a page is 8k and no row can span multiple pages.
  3. If the row size exceeds 8k, then again SQL Server will look to move some columns to out of row pages and it will pick the biggest column to move first so again varchar(max) columns in this instance will have their data moved from a data page to
    a text page.
  4. In list item one, I said by default varchar(max) columns are stored inline, but if you want you can set these to be always stored out of row in LOB pages.
    The stored procedure sp_tableoption allows you to control the in-row behaviour of tables that have LOB data types.
E.g.
exec sp_tableoption 'dbo.TableA', 'large value types out of row',1;

But why would you ever want to store a column out of row? Lets read on to find out.


So what is the reason why SQL Server uses different pages?

Well it all comes down to storage efficiency. Data pages are optimized to store several rows compacted together where as LOB data pages are optimized for storing large data types.

MS SQL Server stores pages of the same type together to achieve optimal storage and to help it to organize the pages it uses “Allocation Units”.

An Allocation Unit is nothing more than a way for SQL Server to split different types of pages into their different groups.

There are 3 type of allocation units that MS SQL Server uses:

  1. IN_ROW_DATA – These contain all of Data and Index pages that are stored in-row.
  2. ROW_OVERFLOW_DATA – These are also made up of Data and Index pages but are used to store variable data types for columns on rows that have grown too big to fit on one page.
  3. LOB_DATA – These are the name suggests stored LOB Data pages.

Column values that are stored in the last two allocation units (ROW_OVERFLOW_DATA & LOB_DATA) are replaced in-row with a 16 byte pointer to the data pages in these allocation units that store the actual value.

Some of the advantages of splitting out the different types of pages and not storing them together are the following:
  1. By having all of the Data pages stored together, this can reduce the amount of fragmentation that might otherwise exist if combined with LOB pages. So the data pages can be all squeezed in together.
  2. By having all of the frequently accessed pages stored together, it allows SQL Server to take advantage of its page read-a-head technology meaning pages are brought into memory before they are explicitly requested, thereby giving faster response times for the user.
  3. It also means that for the most part memory can be taken up with the more frequently accessed data that would be found in Data and Index pages and not being wasted on LOB pages.
  4. Compression can be applied on the Data and Index pages, which are much more likely to contain values that will compress well.

An important point to note here is the following: MS SQL Server is designed to work efficiently with small data types and it assumes that these data types will be accessed more frequently. Now SQL Server is nothing if not obliging, so it will also allow you to store large data types but it does not want this to be at the cost of storage efficiency, and so these data types are stored off-row, i.e. rarely accessed SQL Server is hoping.


So this brings us back to that original question, why would you want to store varchar(max) out-of-row intentionally?

Lets look at an example, so lets say we had a table called “MyFirstTable” and it had the following schema:

create table dbo.MyFirstTable(col1 int not null primary key, col2 int, col3 int, somewidecol varchar(max));

Now say you ran the following query:

select top 2000 col1, col2 from dbo.MyFirstTable;

Lets assume that the the column somewidecol, is 7500 bytes in size, very large but still small enough to fit on one page and so is stored in-row.

So because SQL Server uses row-based storage then even though we only want two of the four columns in that table, we get all four brought into memory, well because the entire row is stored as one unit on a page and SQL Server reads pages from disk. So because we want to read the top 2000 rows, and each row takes up its own page at a minimum we need to have 2k data pages belonged to that table in memory. This is requires a lot of I/O reading the data in from disk and also is taking up valuable space in memory with data that we didn’t need.

Ok so yes SQL Server is row based, so we will always pay some price, i.e. there is nothing we can do with the integer column col3, but what if we stored the LOB column “somewidecol” out of row. Now our rows become tiny, i.e. each row is the size only of 3 integer values, plus the pointer to the LOB value, that’s a row size of approximately 28 bytes compared to the 7512 bytes that our row had previously been. So now instead of having to read in 2k pages, we only have to read in 8 pages. That is some saving.

The great Jeff Moden is the master supreme when it comes to Index Maintenance and he is a believer that in nearly all cases, all LOB data types should be stored off row and even goes a further step in that when creating columns of type LOB that are going to be stored off-row give them a default 16 byte space so that when the 16 byte pointer does get added it doesn’t cause a bad page split. E.g. You have 10 rows fitting snugly on a page, you update one of the rows with an LOB value, you think that this is not going to cause any page splits for your in-row data as its being stored out-of-row but wait, a bad page split occurs as you do not have room on the page for the 16 byte pointer. Bad!

So to summarize todays blog, the key take away points are the following:

  1. MS SQL Server stored LOB pages very differently to Data and Index Pages
  2. Use LOB data types sparingly in SQL Server
  3. In many cases it makes sense to force VARCHAR(MAX) & NVARCHAR(MAX) to be stored off-row.

So I hope you enjoyed todays blog, until next time, Slán!

Fun with AWS Availability Zone Names and RDS Cloning Across Accounts

Well the above is some mouth full for the title of the blog, but hey, it best describes what I’m going to talk about today.

One of the very 1st things I remember learning about when I started my cloud journey into AWS way back in 2018 was about how AWS splits the world up into regions, and each region is further split into at least three and often many more Availability Zones or as they are better known as AZ’s and how a physical data center is then located in each one of these Availability Zones, from here on to be know simply as AZ’s. I also remember reading that the names the AWS gives to AZ’s such as AZ1a and AZ1b that these are just logical names, and they are mapped randomly to the actual physical AZ locations. So AZ1a in my account will possibly reference a different physical AZ in your AWS account.

All good so far and you the reader are probably already aware of this, and to be fair it makes sense. As lets be honest if for example AZ1a mapped to the same physical location for every account in AWS the likely hood is that despite recommendations from AWS most people would still put most of their resources into AZ1a and then AZ1b and so the AZs at the beginning at the alphabet would get over loaded. This is all good and no complaints on my side, where things get a little messy and where I have some minor complaints with AWS is when you are doing an RDS Clone across accounts. So let me give a run down on the issue we faced recently.

We have an Aurora PostgreSQL Cluster in what we will call Account Prod, when we were creating this cluster AWS required us to specify a minimum of three AZ’s which are the AZ’s that our storage is going to be replicated across thereby increasing data redundancy. So naturally we chose AZ1a, AZ1b and AZ1c, how original of us 🙂 We coded all of this up in Terraform and created us cluster in Account Prod.

We also have another account which we will call Account Test, we ran our terraform code against this account as well to create the same cluster. So all good, we have two different clusters in two different AWS accounts and they share the same code base.

A situation arises where we need to do some load testing (we want to test a minor version upgrade of the cluster) and the size of the data that we have in our cluster on our test account just will not do. So we have two possible solutions:

  1. Take a snapshot of our production cluster and restore it to the test account
  2. Take a clone of our production cluster and restore it to the test account

We decide to go with option 2 because one it works out cheaper, in that we will not be paying for duplicate storage as with a database clone, the clone is just sharing the same storage as the original cluster until writes are made. This suits us as we will be writing very little data in our load testing. Secondly cloning is also much faster than a restore as already mentioned the data does not need to be copied over.

So we destroy our cluster on our test account, do a database clone of the cluster from our production account and hey presto, we are good to start our load testing.

This is where things get a bit weird. We then modify out terraform code base to reflect the minor version upgrade of the cluster and we look to run a terraform apply on this against our test account. What we expect to see is terraform telling us it will upgrade the cluster, what instead we see is terraform telling us it is going to destroy our cluster, replacing AZ1c with AZ1d. WHAT NOW? AZ1d, where is this coming from, why is it attempting to change the AZ’s of our underling storage?

The answer of course is that as mentioned at the start of this article, AZ names do not match to physical locations. So what has happened here is the following:

  • We have AZ1a, AZ1b & AZ1c specified in our code base, when we 1st deployed this to our test account this pointed to an underlying AZ ID, so for example we might have the following:
    • AZ1a – az4
    • AZ1b – az6
    • AZ1c – az1
  • When we deployed the same code based to our prod account, our AZ names were mapped to the following AZ IDs:
    • AZ1a – az2
    • AZ1b – az4
    • AZ1c – az6

So you are probably beginning to see the mismatch here, and this becomes a problem when you do a restore of a clone from our prod account to our test account, because what is in the clone is the AZ IDs. So when the clone is being restored it will do the following:

  • Create storage in AZ ID of az4, this maps to AZ1a in our test account, this does not cause us issues with terraform as we have 1a specified in our code base.
  • Create storage in AZ ID of az6, this maps to AZ1b in our test account, again this does not cause us issues with terraform as we have 1b specified in our code base.
  • Create storage in AZ ID az2, this maps to AZ1d in our test account, oh no!, we do not have a 1d in our code base, so the clone as basically created a cluster with storage specified in 1a, 1b and 1d and this of course is not what we have in our code base and when we do a terraform apply, it tries to destroy the cluster and create a new one.

So this might seem obvious that this is going to happen but it also might leave you scratching your head for a few minutes before you get that light bulb moment.

If you think about, this makes total sense that when you are creating a clone that you cannot specify what AZ to put the storage on, as the whole selling point of a clone is that you share the same storage, so allowing a user to specify different storage would mean AWS having to move the data potentially into a different AZ, every time you did a clone, which defeats the purpose of a clone. I get that.

The issue I have here with AWS is the following:

  1. Why ask for a minimum of three AZ’s at all when creating a cluster, why not just assign these automatically under the hood. Maybe I’m missing something here but what is the gain for the customer in specifying these AZ’s, remember these are the AZ’s of the storage not the instance so you do not pay for cross AZ charges. I suppose there might be a slight performance gain in having the writer in the same AZ as your storage clusters but couldn’t AWS handle this for you?
  2. Could AWS ensure that all accounts within the same organization have the same mapping? Surely that would not cause too many issues.

Yes I know these are minor complaints and to be fair this easily fixed, so modify your code base so that for the sage account AZ1d is included and for prod account AZ 1c is included, this is not difficult to implement in terraform, I just feel on a whole it is something that could be handled better by AWS.

Ok rant over, I hope you found this blog helpful and whether you agree or disagree with me, keep checking in on sqlrebel.org.

Until next time, Slán!

AWS DMS Hidden Gotcha’s Part II

Back in March 2022 I wrote a Blog post AWS DMS Hidden Gotcha’s Part I, todays post is a continuation of this outlining some more quirks of DMS that I have become aware of from working with it on a daily basis.

Index creation locking system table causes DMS Task to fail:

Ok, so this is a strange one and certainly one I would not have thought of only for it happening to us on one of our Production databases, so here is the scenario.

We have a SQL Server Database, which we will call “MyFisrtDatabase” and in that we have many tables, 3 of these tables (Table_A, Table_B and Table_C) we replicate over to another location using AWS DMS. We have another table in the database, Table_Z, which is not in any DMS task and has never been marked for replication. We would like to put an index on this table for performance reasons, we are ok that this will offline, i.e. the table will be locked for the duration of this as this as that table is currently not in use by our clients.

Now since the Index that we are putting on Table_Z is some what large (its 200gb) we will change the recovery model of our database from “Full” to “Bulk_Logged” so that we reduce the amount of data that is written to the log as we are conscious that DMS will have to read through all this data that gets written to the log and this could potentially slow down the replication of our data by DMS.

Ok, so all good so far. We kick off the Index creation task, we estimate that it should take approximately 25 minutes to complete. Things are all good to begin with but after a few minutes our monitoring begins to pick up errors in the Cloudwatch logs of our DMS Task, we wonder what’s going on here and decide to run sp_whoisactive on source server. What do we see only the below query being blocked by the Index Creation Task.

select count(*) from sysobjects o,sys.system_internals_partitions s
where o.id=s.object_id
and o.type=N'U'
and s.is_data_row_format=1
and s.partition_id=72057594205241344

So DMS is attempting to run this query and its wait type is “LCK_M_S” . Lets have a closer look at what locks its attempting to take out.

<Locks>
<Lock request_mode="S" request_status="GRANT" request_count="1" />
</Locks>
<Objects>
<Object name="sysrowsets" schema_name="sys">
<Locks>
<Lock resource_type="KEY" index_name="clust" request_mode="S" request_status="WAIT" request_count="1" />
<Lock resource_type="OBJECT" request_mode="Sch-S" request_status="GRANT" request_count="1" />
</Locks>
</Object>
</Objects>

Ok, so its attempting to a take shared table lock on sysrowsets but it cannot be granted this as for some reason the Index Creation is blocking it, well in that case the next place to go is to see what kind of locks the Index Creation Task has taken out on the system tables. But why sysrowsets, I thought it was querying the sysobjects and sys.system_internal_partitions? Well sysrowsets is an underlying table in these views.

<Object name="sysrowsets" schema_name="sys">
<Locks>
<Lock resource_type="KEY" index_name="clust" request_mode="X" request_status="GRANT" request_count="1" />
<Lock resource_type="OBJECT" request_mode="IX" request_status="GRANT" request_count="1" />
</Locks>
</Object>

So the above is one of the locks that has been taken out by the Index creation task, so as you can see it has a table lock on sysrowsets, hence why the query that the DMS Task is trying to run is being blocked.

After while the DMS Task will fail and show a status of “ERROR”.

Its not all bad news! Once the Index has finished creating we could resume DMS tasks without any issues. But still this is far from an ideal situation. Why does DMS need to periodically run the query where it is checking the internal partitions? I cannot find any AWS documentation on why it needs to do so, your guess is as good as mine.

So to sum up, we had being prepared that creating an index would lead to one large transaction being written to the log file and thus could cause our DMS task to fall behind but we never thought it would end up failing due to a locking issue on system tables. Going forward if adding/rebuilding an index of any considerable size we will have to ensure that our DMS tasks are stopped in advanced. Just one final note, this will happen regardless of whether it is a “Full Load and Replicate” task or a “Replication Only” task.

DMS Task Order of Cache Changes Applied:

This next issue applies when you have a “Full Load and Replicate” DMS Task.

Ok so we have an in-house process whereby we are replicating data from one of our Aurora PostgreSQL database down to S3 into parquet format where it is then being processed by a glue job and upserts are done into a Hudi table. And yes you guessed it, this is where the problems begin to rise.

Lets go back a minute and talk about how does DMS handles a “Full Load and Replicate” task. DMS initials imposes a Schema-Mod lock on the table, (note of caution on this as if this lock request is kept waiting it will make every other requests line up behind it), this lock is only taken out briefly and then the data begins to be read from the data pages that make up that table. No snapshot or table locks preventing writes are taken out on that table but any writes that occur will be recorded in the transaction log and these will be read over once the full load has been completed.

Now Imagine a scenario where we have Table_A that has say 1 million rows, while the full load is being replicated an update is applied to a row that has already been read during the full load process, changing the value in the name column from ‘John’ to ‘Bill’, this will get written to the Transaction log and DMS will read this once it has finished with the full load. OK, so what’s the problem you might ask? The core of the issue is how DMS applies timestamps to these records. The timestamp of the update to the row where the name value was changed from ‘John’ to ‘Bill’ will have an older time stamp than the full load that read the data, i.e. the ‘Insert’. So the “Update” is treated as older than the “Insert”. This obviously is a major issue, the current and correct value for that row is “Bill”, but instead the update will be applied 1st (the value “Bill”) and then the Insert will be applied after (“John”), so now the the hudi table has the incorrect value.

Yes I agree this is a very specific problem and one that would not be an issue if we were dealing with a relational database but this to me seems a major flaw in DMS. Remember these are all core AWS Services, DMS, S3, Glue. This seems like a major flaw with DMS, it should surely have the ability to be able to handle this situation.

Are there ways around this? There sure are but they all require extra work from the developer and could be tricky. Perhaps the easiest solution would be to add your timestamp column but then that means you having to change your source database just so that you can get the data working down-stream. Another alternative solution might be to select the “Stop before applying cached changes” option, so that your DMS task will do the full load and stop and will not apply any of data operations which happened during the full load. You would then have to manually change the timestamp of the insert records to some arbitrary date and then resume your DMS task, the cache changes will not get applied but they will all have a timestamp after the full load data. (h/t to my co-worker Tommy for thinking of the last solution.)

As always, if you liked this blog please follow me on Twitter to be kept up-to-date with my latest blog releases and more fun stuff.

Until next time, Slán!

Checkpoints logged in Aurora PostgreSQL V.15

One of the changes that came with PostgreSQL V.15 was that the configuration setting log_checkpoints had its default value changed from 0 to 1, meaning that now by default it was on. Ok, so what does this setting actually do?

Well it does exactly as the name suggest, it logs all checkpoints that occur in the database.

So what is exactly logged and is this information useful?

As well as logging all checkpoints and server restarts, there is also information logged about the number of buffers that were written to disk, as well as the duration of these writes.

There had been a lot of debate in the Postgres community about enabling this configuration setting by default but the general consensus was that having it on by default was probably better as it could be very useful in a debugging scenario.

Ok, so are there any downsides to this?

Well the obvious one being that you will have increased data being written to your log file. For the most part this should not really be an issue. By default in Postgres the configuration setting checkpoint_timeout which controls the time between automatic checkpoints is set to 5 minutes, but this can be as low as 30 seconds depending on your system.

Another configuration parameter that will impact the number of checkpoints that occur is the max_wal_size which determines the size that the WAL can grow to before a checkpoint will occur in between the period of your automatic checkpoints. The default of this is 1gb. So provided these are are setting to reasonable values in most cases the increased logging should not be an issue.

Ok, all sounds good so far, but in the heading you mentioned Aurora, does the same apply here? Well this is where things get a little weird.

Aurora PostgreSQL differs from Postgres in that there are two separate layers, the Database Engine and the Storage Engine, with the Storage engine handling a bunch of things that in most database would be handled by the database engine and yes you guessed it one of these is checkpoints.

Without getting to deep in the architecture of Aurora PostgreSQL, that would be a blog on its own, most relational databases are page structured, which means they have a log buffer in memory that is written to, its contents are written to disk either when it becomes full, a transaction is committed or during a checkpoint, writing to the log on disk ensures durability. A checkpoint is also responsible for writing out the dirty (modified) data pages from memory to disk, these could 4kb or 8kb pages.

Aurora differs from these databases systems in that it is log-structured, which means that it does not contain a log buffer, all log changes are written straight to disk and no dirty pages are ever written from memory to disk, i.e. so no need for any checkpoints. Wait what? So yes you read that correctly no checkpoints occur in the database engine. But! And this is a big but, checkpoints do occur in the storage engine. But this is completely handled by AWS and again as this is not a post about Aurora architecture I’ll leave it at that. (On a side note, log-structured database systems, can give you more consistent write performance, yes you are writing to disk for each write which will increase I/O’s as you do not have a log buffer but you don’t periodically pay the price of doing a checkpoint where you have to do these large writes of 4kb or 8kb pages.)

So what does this mean for the configuration setting log_checkpoints in Aurora?

Well, there is not much point in having it enabled since no checkpoints occur in the database engine and so essentially what gets logged is useless See example.

LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB

Seeing the above every couple of minutes in your log file is not particularly useful.

So I know what you are thinking, Amazon must take care of this and in Aurora this setting is off by default, am that would be a No! Ok, bit of a pain but we can turn it off right, can’t we? What, we can’t?

Yes that is correct, if you look into your Parameter Groups the setting for log_checkpoints is not there, so it cannot be turned off. Not an ideal situation.

I have raised this with AWS who acknowledged it was an issue and raised it with their internal team but as of the time of writing this blog post there is no date for this being fixed.

UPDATE ** 25-10-23 The AWS team got back to me that the the issue is now fixed in v 15.4 where the setting log_checkpoints is ‘OFF’ by default. Good stuff 🙂

So there you go folks, hope you found this post useful.

Until next time, Slán!

What Tech Books am I reading?

Well folks today is a less technical post as I discuss the books that I have read over the last 12 months and those that I hope to get through in the coming year.

As we approach the end of the summer here in Ireland, I always think its a good time to review the reading material that I have got through and put some plans in place to keep me busy for the long Winter.

As with most of us, life is very busy and Time is our most precious resource so between work, family, working on my few acres it leaves very little time for professional development outside of work hours. I like to try and get through a few courses every year and of course do my daily reading of the many bloggers (post on this at a later stage) that I follow, so when it comes to books I tend to read these in my downtime, i.e. Late in the evening during those cold winter months. So go easy on me, the list is small! 🙂

Ok so first up, lets review the books I have got through:
  • SQL Server 2019 Administration Inside Out
  • SQL in a NutShell: A Desktop Quick Reference
  • Fundamentals of Data Engineering: Plan and Build Robust Data Systems
  • PostgreSQL 14 Internals (e-book)

SQL Server 2019 Administration Inside Out – I bought this as we had planned to upgrade our prod databases from 2016 to 2019. Initially I thought I would use is mostly as a reference book, but I actually got through nearly all of it, learned a lot as well. SQL Server 2019 is such a broad topic that no one book can cover everything so it doesn’t go into huge detail in anyone subject but a very good book for scanning through and picking up on some new features you had never come across (like partially contained databases).

SQL in a NutShell – This book is a revised version written by the brilliant Kevin Kline. As it states in its name its more of a reference guide to be kept on your desk than a book that you read from start to finish. Some very useful nuggets of information in this and I would regularly find myself picking it up.

Fundamentals of Data Engineering – Bought this book on the recommendation from SeattleDataGuy, as a great beginners book into the world of Data Engineering. We all learn different but for me I need to understand something from its lowest level to feel comfortable so this felt like to me a good book to get started on my Data Engineering path. Having read Database Internals by Alex Petrov I had a good understanding on some of the topics already that are covered in this book, but I would recommend it for sure before jumping into any courses or tools for Data Engineering.

PostgreSQL 14 Internals by Egor Rogov – This is a free e-book that goes into a lot and I mean ALOT of detail. Look its > 600 pages, confession time I have not read all 600 pages, but have dipped in and out of it. We are looking ahead already to PostgreSQL Version 16, but this book is still very relevant. Its a must have for a PostgreSQL DBA, especially for a relative newbie like myself. Keep this one on your laptop, you will need it.

Planned reading for the Year a head:
  • Azure SQL Revealed by Bob Ward
  • AWS Fundamentals (e-book) by Tobias Schmidt & Alessandro Volpicella
  • Indexing Beyond the Basics by Tobias Petry
  • Grokking Algorithms by Aditya Y. Bhargava
  • DB Performance at Scale (A Practical Guide) by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov & Cynthia Dunlop

Azure SQL Revealed by Bob Ward – To help me on my journey to getting Azure certification I brought this book as it is aimed as a guide of Azure from the view point of a SQL Server professional. Also of course its written by the brilliant Bob Ward, Bob has the rare ability of breaking down the most complicated features into simple form so that even folks like my self can understand.

AWS Fundamentals (e-book) by Tobias Schmidt & Alessandro Volpicella – I would be familiar with a lot of what Tobias writes from Twitter (X) and he is super smart. I’m a big believer in understanding the fundamentals of a technology, so although I would be very familiar with many of the services in these books and use them regularly, there is always time to brush up on the basics and here’s hoping this book delivers on this, by my initial review it should do. Another good thing about this book is you get really detailed info graphics for all the major services which come in very handy for quick references.

Indexing Beyond the Basics by Tobias Petry – Again I would consider myself fairly strong when it comes to indexing but this book goes a little more into detail which I was looking for. Again I’m sure there is a few nuggets of gold to be found on this.

Grokking Algorithms by Aditya Y. Bhargava – Throughout our day as Data Engineers/Computer Scientists we are regularly using various algorithms that we don’t ever think twice about. Therefore this year I want to get a better understanding of these everyday algorithms which help us so much, i.e. What is the big O notation, how does quicksort actually work.

DB Performance at Scale – This book is written by the team at ScyllaDB and by all accounts its a really practical look at the issues you face when attempting to scale performance. Along the way are many current and useful links to blog posts and white papers for further reference. The very fact that this book is written by IT experts in the field should ensure its very relevant, looking forward to this one. Sometimes in our line of work you can get bogged down in the minute detail and don’t get me wrong its important to know this but every now and again its good to take some time to have a more high-level view of databases and hopefully this book will do this.

So there is 5 books listed here, will I get them all read? Probably not but sure we’ll do our best anyways and there’s no harm in having a goal, even if you don’t reach it.

You will notice a common theme here, 4 of the books and covering fundamental aspects of my job where as there is only one book ‘Azure SQL Revealed’ that is venturing into new territory. As an old football coach used to say to me, by all means learn new skills but make sure to practice the ones you currently have even more.

Until next time, Slán!

CTE’s in PostgreSQL

So today’s topic is Common Table Expressions in PostgreSQL and how they can differ to CTE’s used in MS SQL Server depending on how you set these up.

So I’m a PostgreSQL newbie, coming over from the world of MS SQL Server, the first major version of PostgreSQL I worked on being V11. When coming to terms with the many difference’s between the two databases one that stuck out was CTE’s. In MS SQL Server creating a CTE is a logical construct, where possible the query within the CTE will be inlined in the outer query and so for all intents and purposes its treated as just a sub query allowing predicate pushdown and all that good stuff. What it does give you is a neater way to construct your query. In PostgreSQL v11 a CTE behaves very differently, the query within is ring-fenced off from the main query, in essence what is happening is the query within the CTE is executed 1st, results stored in memory and then execution of the outer query begins with the results of the CTE being a “black box” to the optimizer. i.e. So no query in-lining or predicate pushdown can happen. This is not necessarily a bad thing just need to be aware of how different it is to the behavior you see in MS SQL Server.

Anyways enough about the past, as from Postgres V12, the default behavior for a CTE is to be in-lined into the outer query.

So the current major version of PostgreSQL is V15, so why blog about this now. Well we have an interesting scenario during the week where by a function that was being executed against a read-replica instance was failing, when we investigated this issue it was due to a temp table being used in the function and of course on a read-replica no writes can be done against the instance and yes this does include temporary tables. So then why not use a Materialized CTE.

So with a MATERIALIZED CTE, the query is ring-fenced off and executed 1st, the results stored in memory and then a join is done when the CTE is referenced in the outer query.

with myfirstmaterialized_cte(somethingid) as MATERIALIZED (
   select somethingid
   from dbo.mysomething
   where mysomething.region = 1)
select 
   myotherthing.myotherthingid,
   myotherthing.myothername
from 
   dbo.myotherthing
   inner join myfirstmaterialized_cte as mycte
      on mycte.somethingid = myotherthing.somethingid;

Ok cool, so then what are the differences between a temporary table and a Materialized CTE in PostgreSQL?

  • You can add indexes to a temporary table, you cannot use indexes on materialized cte’s.
  • A temporary table exists for the life of the session where as a materialized cte will exist only until execution of the query that references it is on-going.
  • Temp tables can require vacuuming.
  • Temp table writes need to be logged

So if you are from a MS SQL Server background, you probably coming to the same conclusion that I did, are these not very similar to table variables in MS SQL Server. They are for sure many similarities around the optimizer basically looking upon them as black boxes, i.e. (no idea of statistics, no indexes, not logged and up until MS SQL Server 2019, no idea of the cardinality estimate). A big difference between the two is the scope of their lifetime, where a table variable resembles more a temporary table and lives for the scope of the session.

So to summarize todays blog post:

  • We learned that in PostgreSQL a CTE can be automatically in-lined into the outer query (provided the conditions are right), which is the default from V 12 onwards and in most cases this is the behavior that you would want.
  • If you want the CTE executed 1st and results stored in memory independent of the outer query, you can still achieve this in newer versions on PostgreSQL simply by attaching the keyword ‘MATERIALIZED’ in the CTE definition.
  • Think about choosing a Materialized CTE over a temporary table if you need to connect to a read-replica or if its going to contain a small result and only needs to be referenced in one place within your batch.

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

LRU-K Buffer Eviction Algorithm

We know that when it comes to performance in any relational database, memory is king, that is you want to be able to fit as much as your working set into memory because no matter how fast your disks are they will never come close to the access speed of memory.

Ok then, given the above paragraph, how we evict pages from memory becomes extremely important for a performance viewpoint.

Today we will discuss the LRU-K algorithm which is used by MS SQL Server to manage what pages it evicts from the buffer pool, to be more precise SQL Server uses LRU-2.

Let’s first look at the LRU (Least Recently Used) algorithm and why this is not used.

  1. It does not take in account how frequently a page is accessed, caring only about when it was last accessed. Therefore, potentially, pages that are frequently accessed could end up getting kicked out of memory.
  2. It also leaves memory susceptible to what is called ‘Table Scan Flooding’, i.e. where a table scan is carried out on a large table that might be infrequently accessed, using the LRU algorithm, this would flood memory with these infrequently accessed pages and kicking out frequently accessed pages. Considering that more and more database environments are heading down the hybrid approach where the database is used for both OLTP and also Reporting this would be a big problem, one Report run by a user during the day could have a very detrimental impact on the overall system.
  3. No thought is taken of the type of page that is being evicted either. For example, an Index page might be much more useful to be kept in memory as it is much more likely to bit hit compared to a data page.

Right then, we have seen why the LRU algorithm is not a good choice for the management of page eviction from memory for relational databases, so let’s look at what SQL Server actually uses, the LRU-K, where k >= 2 algorithm.

Essentially what this means is that for each page we keep track of the last two times that it was accessed. This information is stored in the page header. From these an ‘Interarrival Time’ is calculated which essentially is the time between the two references and only pages with the shortest interval time are kept in memory.

Another point to note on this is what is meant by a reference. So, there are different types of references, for example a page that is read in from disk and referenced in a transaction, should that page be evicted before the transaction is completed, the chances might be high that the same page may be referenced again within that transaction and so we would only have to read it in again. If it is referenced twice within the same transaction, does that mean it now is considered a very popular page as it has two references very close together, but again it’s possible that once the transaction completes that page may never be referenced again.

My point here is that the LRU-K algorithm must take all of this into account and look at references inside a transaction, transaction retries, etc.

If you are interested in reading in more detail on the LRU-K algorithm, check out this paper that was written on this: LRU-K Algorithm

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

Partially Contained Database’s in MS SQL Server

Contained Databases or (Partially Contained Databases, more on this later) to give them their correct title is a feature in SQL Server that essentially allows you to create a database where all of the components are contained within that database, even the logins, that’s right the logins can exist only in the contained database and do not need to exist on the server!!

The Upside of this? You guessed it, makes the migration of a database onto a new server, a much easier job as you no longer have to worry about migrating over logins or worry about orphaned users. Ok so enough chat on what it does, let’s get down into the weeds!

In a Partially Contained Database there are two types of authentication methods used:

  • Window Contained Users
  • SQL Contained Users

But regardless of the authentication method it happens at the database level and NOT the server level, this is key thing with contained databases.

  1. First up you need to ensure contained database is enabled on the server:
exec sp_configure 'contained database authentication',1;
go
reconfigure;

2. Next, you need to set the database you are creating to partially contained or modify and existing database.

//Creating a new Database
create database mycontaineddb containment = partial;

//Modifying an existing database
Alter database myolddatabase set containment = partial with no_wait;

3. Lets create a SQL User that is fully contained in our database:

 Use mycontaineddb
 GO
 CREATE USER TestUser WITH PASSWORD = N'test*test*test1', default_schema = dbo;
GO

4. Let’s try connecting to the newly contained database:

  • Normally to connect you would specify a Server Instance Name, Authentication method and Windows or SQL Login details. For Contained database you need to do the following:
    • In SSMS, click on Options
    • Choose the ‘Connection Properties’ tab
    • In the ‘Connect to database’ dropdown choose the contained database.

Well Partially Contained Databases just sound an amazing, ok then what are the drawbacks, there must be some right? Well, ok there are two, not necessarily drawbacks but more limitations.

  1. Replication, CDC or Change Tracking cannot be enabled on a Partially Contained database.
  2. Security – So it’s not they are less secure but rather the user that owns the database, i.e. in the role db_owner, would be able to grant access to the database to other users, which of course in turns means this user can now access the database server. This may or may not be an issue, but it is something to be aware of.

Just before I bring this month’s blog to a close, there is a very useful stored procedure that Microsoft provides us to help migrate over users to a contained database, it’s called sp_migrate_user_to_contained. What this proc does is to unlink the SQL Login from the user and assign the password of the Login to that user. You should then follow up on this by removing the SQL login after.

E.g.
            exec sp_migrate_user_to_contained
                @username = 'someolduser',
                @rename = N'keep_name'
                @disablelogin = N'disable_login'

Oh, I nearly forgot, why partially contained? Because there are still some features that cross the database boundary, so you would not say that it is fully contained.

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.

How PostgreSQL handles out of row storage

The PostgreSQL storage engines uses a fixed page size of 8kb and it does not allow tuples to span multiple pages. So how do they handle large values? Lets read on to find out.

So as we mentioned PostgreSQL does not allow a row to span multiple pages, so to be able to handle large values, PostgreSQL performs a technique known as TOAST (The Oversized-Attribute Storage Technique). The basic idea behind this is that large values are compressed and stored off-row and broken up into multiple physical rows. We’ll get into the details of when TOASTing kicks in but first lets look at where exactly this off-row data is stored.

When you create a table that has any variable length columns immediately in the background a corresponding TOAST table is also created that is linked to the original table. This will be used to store any values that are too big to store in-row in the main table, so this is an important point, even if the TOAST table is never needed it will be created anyways.

If a column on a tuple becomes TOASTED the physical value (the rather large data) is moved to the TOAST table and replaced with a pointer. The actual TOAST table is made up of a unique ‘chunk_id’ per value and each ‘chunk_seq’ to keep track of all of the chunks that make up the value and then the actual chunk data.

Little more on the pointer that is now stored in-row, this is an 18 byte pointer that contains the:

  • OID of the associated TOAST table
  • OID of the Chunk_id
  • Info on the compression used if any

To allow for faster access a unique b-tree index is automatically created on ‘chunk_id’ and ‘chunk_seq’. This is important to keep this in mind later on when we are calculating the cost of reading for TOAST and all of the extra page reads which are necessary.

When you create a table in Postgres, for each column you can specify where should the value be stored (i.e. Should it be inline or stored in a TOAST table) with the ‘SET STORAGE’ property on the CREATE TABLE or ALTER TABLE commands. There are 4 possible values:

  • PLAIN – value is kept in-line and uncompressed. Is used for fixed length columns.
  • MAIN – value is kept in-line, and data is compressed. Anything over 8kb will be truncated.
  • EXTERNAL – The data is stored in the TOAST table if tuple size >= TOAST_TUPLE_TARGET. The data is uncompressed. Setting an attribute to this is recommended for bytea columns which cannot be compressed anymore and attempting to do is a waste of cpu resources. Also for TEXT and Bytea columns on which substring operations will be performed this increases performance as it means only specific chunks need to return rather than deTOASTing the entire text.
  • EXTENDED – (Default) The data is stored in the TOAST table if tuple size >= TOAST_TUPLE_TARGET and the data is compressed.

Don’t worry about what size a tuple needs to be for TOASTing to kick in, we’ll get back to that, lets dig a little deeper into what happens during the TOASTing process itself.

When a tuple becomes eligible for TOASTing, Postgres performs what is known as passes, these are attempts to reduce the size of the tuple. Postgres will carry out 4 passes in total.

  1. It looks at all attributes which have the ‘Extended’ property (i.e. variable length cols) and it will compress the value of the attribute, if value size still greater than 2kb, then begin to TOAST these values. It will then look at attributes which have the ‘External’ property and begin to TOAST them if they are > 2kb in size.
  2. If tuple size is still > 2kb, then any attributes which have ‘External’ or ‘Extended’ regardless of size are TOASTed.
  3. If tuple size is still > 2kb, then any attributes which have the ‘MAIN’ property are compressed.

Ok, all good so far, but didn’t I mention something about compression, lets look some more into that.

The compression used by Postgres is ‘PGLZ’ which is a member of the LZ compression family algorithm. LZ when compared to something like ‘GZIP’ which is used in MS SQL Server gives you a much lower compression ratio however it is also much quicker. In Postgres 14 there is a new compression option available called LZ4 (Lossless compression algorithm), which claims to provide even faster compression than ‘PGLZ’.

  • If you are on Postgres 14 (lucky you), its worth testing out how much quicker the LZ4 compression algorithm is, here are some useful commands to get you started.
SHOW default_toast_compression; --The default is pglz
SET default_toast_compression TO lz4; --Only available on PG14 and up
--Alternatively you can specify compression algorithm on an attribute
ALTER TABLE mytable ALTER COLUMN mycol SET COMPRESSION LZ4;

Ok, lets summarize what we know so far.

  • When you create a table with a variable length column, Postgres will automatically create a corresponding TOAST table linked to this behind the scenes.
  • By default the storage value of each variable length column in your table will be ‘Extended’ which basically means Postgres will try to store the data in-row but if the tuple exceeds 2kb, it will first compress the data and if it is still too big then it will look to store some columns out-of-row in the TOAST table and inline will be replaced with a 18 byte pointer.
  • These columns that have been TOASTed will be stored in 2kb tuples in the TOAST table, where by large values that exceed 2KB are split into multiple 2kb tuples (chunks) that could potentially span multiple pages of the TOAST table. Hence the need for a chunk_seq to be associated with a chunk_id.
  • Oh and the max amount of text that can be TOASTed, stored off-row in PostgreSQL is 1GB.

So we have seen how TOASTing works but it has can have some serious performance implications.

  1. Lets think about the impact of reading a value once it has become toasted. Lets take a simple table called dbo.MyTable(id int, somevalue as text); before toasting to read the column somevalue for one particular tuple would mean reading one page. Post TOASTing, the data is now stored off-row, so now you still read the page in the table that the tuple is on, you then take the TOAST pointer and traverse a B-tree index to find the corresponding page in the TOAST table. The minimum a B-tree index can be is 2 levels, so that is two extra page reads at least and that is just the index, you must still read the page or pages where the value now resides. So we have gone from 1 page read to 4, while this might not seem that much, imagine you are reading 1 million rows, then the difference between reading 1 page or 4 becomes huge.
  2. I want to make a special mention here on the impact that Toasting can have on JSONB. The JSONB data type in postgres stores a binary representation of a JSONB document, which in itself is its own object and not an atomic value. If you were to reference two different keys from the same jsonb (document) column, then this would mean that the entire jsonb column would be de-TOASTED twice, not once. E.g. SELECT jsoncontent -> ‘namekey’, jsoncontent -> ‘agekey’ from dbo.someothertable, then even though you are accessing the same column, jsoncontent twice in the select statement, it will still go through the de-TOASTing process twice.
  3. You also have the additional overhead of decompressing the data. Now this could also happen in-row, but again this is an additional cost.
  4. So is TOAST all bad for performance? Well not if you don’t reference the column that has been TOASTed, in that case the value is left untouched and is not de-toasted, so for example, if you had a table whereby a column in it had been TOASTed but this was a column that was not accessed frequently, then this could improve your query performance as the original table will be much smaller and so more pages of it will fit into memory and since we rarely access the column who’s value is stored in the TOAST table we don’t have to worry about paying the cost of de-toasting.
  5. Database bloat can become an issue, let me explain why. We know that postgres uses MVCC, so when a row is updated it is not actually deleted but rather a new version of the row with the updated value is created. If you have a table that has a high update rate on columns that are TOASTED, you could see database bloat as these large values could cause the database to grow as multiple versions of the row could exist until such time as the row becomes obsolete (no longer visible to any active transactions) and vacuum runs to mark the space as re-usable but that space will never be returned back to the operating system. While this is not a problem directly with TOAST, it is a problem with doing high number of updates on large values in Postgres.

Some useful ad-hoc points also worth remembering:

  • In Aurora Postgres you cannot view the associated TOAST tables.
  • Compression will not be performed on an attribute if it estimates that the size reduction is < 25%, as it deems this a waste of CPU resources, when the PGLZ compression algorithm is used, but if the LZ4 algorithm is uses then as long as the compressed value is smaller than the original, compression will be maintained.
  • During an Update operation, values of unchanged fields are preserved as is, so an UPDATE on a row that has out-of-row values will not incur the cost of updating the TOAST (again provided none of these TOASTed values were changed.).

Three important values to remember here:

  • TOAST_TUPLE_THRESHOLD – The TOASTing process is triggered only when a row value to be stored in a table is wider than this value in bytes (usually this is 2kb).
  • TOAST_TUPLE_TARGET – When the TOASTing process kicks in because a value is higer than the threshold specified in the TOAST_TUPLE_THRESHOLD we have just covered, then that value will be compressed and moved off-row until such time as the entire row is shorter than this ‘TOAST_TUPLE_TARGET’, (and again this is usually 2kb). This value can be adjusted for each table with the below command:
    • ALTER TABLE <tablename> SET (toast_tuple_target = N)
    • The allowed values for this threshold are between 128 and 8160 bytes.
    • It can be sometimes be a useful performance boost to reduce the size of this threshold, doing so pushes more data into TOAST, making our table much smaller and less I/O required to read it. But again this would only be useful if the values you were pushing into TOAST were not often read.
  • TOAST_MAX_CHUNK_SIZE – This is the size at which TOASTED values are split up into when stored out of row. The default is 2kb, allowing for 4 rows per page.
  • So to finish up with today’s blog I hope you have a better understanding of how data in PostgreSQL is now stored off-row. The performance impact of TOAST on jsonb is for me a major hurdle that PostgresSQL needs to overcome, JSONB is such a great data type and all of the bells and whistles that come with it, such as indexing and pre-parsing, but the performance of accessing a jsonb value once it becomes toasted just increases linearly regardless of the nesting level. This is something the PostgreSQL community is very aware of and working hard at finding possible solutions. For some additional information on where TOAST and JSONB is heading in the future, check out this video (JSONB Perfomance) by the great Oleg Bartunov. That’s all folks.

Oh, before we go here are some useful queries for determining TOAST table and attribute sizes.

--Get size of table and associated Toast Table
select 
    pg_size_pretty(pg_relation_size(oid)) as table_size,
    pg_size_pretty(pg_relation_size(reltoastrelid)) as toast_size
FROM
    pg_class
where 
    relname = 'mytable'

--Get the size of the column
select 
    pg_column_size(mylargecol) --No. of bytes, possibly compressed
    ,octet_length(mylargecol) --Total no. of bytes 
from
    dbo.mytable;

Until next time, Slán!

DisclaimerNever just believe everything you read in a post, test this out for yourself, we all can make mistakes! Happy coding.