The R.A.G (Redshift Analyst Guide): Data Types and Compression

#data #analytics #sql #aws

Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.

Previously on the R.A.G....

Article No Longer Available

When building tables you determine what 'data type' the column(s) will be. You do this column by column and while it may seem silly to specify exactly what is going to go into each column, especially when Redshift can guess for you, it can be a big contributing factor in terms of speeding up performance and decreasing table size.

You can also choose to compress your columns.
Compression will allow for more data to fit inside a block, again decreasing table size. It will also improve the efficiencies of the Zone Maps, another thing that can speed up performance.

Data Types

Data Keywords	Description
SMALLINT ,INT2	This is for WHOLE numbers that only take up 2 bytes of data, range: -32768 to +32767
INTEGER, INT, INT4	Also for whole numbers that only take up 4 bytes of date, range: -2147483648 to +2147483647
BIGINT, INT8	Also for whole numbers that only take up 8 bytes of date, range: -9223372036854775808 to 9223372036854775807
DECIMAL, NUMERIC	For numbers with decimal points, up to 38 digits total. When you classify a column as decimal you must declare both the TOTAL length and then how many decimals. For example decimal(10,2) means ten numbers max with two decimal places, this equates to 8 digits on the left of the decimal, and 2 on the right.
REAL, FLOAT4	For storing smaller, rounded down, floating point numbers
DOUBLE PRECISION, FLOAT , FLOAT8	For storing larger, non rounded, floating point numbers
BOOLEAN, BOOL	Boolean is a single byte flag which is either 1 or 0, true or false. Though it can hold a null value. It can also get represented as a checkbox. You can specify that the default value is true or false, if you don't specify a default value then the default value will be null
CHAR, CHARACTER, NCHAR , BPCHAR	CHAR is a fixed length text string, ignore references to NCHAR and BPCHAR those are old functionality merged all into one. CHAR always takes up all of the space you specify, so if you specify char(100) but only put 'Hi' into the column, the remain 98 characters of space will be filled with spaces. Which can cause issues with EXACT object matching. Use when your column is always going to be a fixed length. CHAR will always use up at least 4 bytes of data, even if you specify CHAR(2)
VARCHAR, CHARACTER VARYING, NVARCHAR, TEXT	VARCHAR allows for varying character length which is good for free text fields. Unlike CHAR it will only use however much space has been entered. So a in a VARCHAR(100) the word 'Christmas' will only use 9 of that 100, saving space. VARCHAR will always use up at least 6 bytes of data, even if you specify VARCHAR(2)
DATE	Use this for dealing with time spanning as small as a day, ie. this will not use / show / handle time.
TIMESTAMP, TIMESTAMP WITHOUT TIME ZONE	Use this for dealing with time AND where your entire data warehouse is in the same timezone
TIMESTAMPZ, TIMESTAMP WITH TIMEZONE	Use this for dealing with time across various time zones

Data Type Usage Examples

Below are some examples of how to use the above data types in your code.

CREATE TABLE    <TABLE_NAME>
(
    COLUMN_NAME     SMALLINT,
    COLUMN_NAME     INT,
    COLUMN_NAME     INTEGER,
    COLUMN_NAME     INT,
    COLUMN_NAME     INT4,
    COLUMN_NAME     BIGINT,
    COLUMN_NAME     INT8,
    COLUMN_NAME     DECIMAL(20,2),
    COLUMN_NAME     NUMERIC(20,2),
    COLUMN_NAME     REAL,
    COLUMN_NAME     FLOAT4,
    COLUMN_NAME     DOUBLE PRECISION,
    COLUMN_NAME     FLOAT,
    COLUMN_NAME     FLOAT8,
    COLUMN_NAME     BOOLEAN DEFAULT FALSE,
    COLUMN_NAME     BOOL DEFAULT FALSE,
    COLUMN_NAME     CHAR(36),
    COLUMN_NAME     CHARACTER(36),
    COLUMN_NAME     NCHAR(36),
    COLUMN_NAME     BPCHAR(36),
    COLUMN_NAME     VARCHAR(36),
    COLUMN_NAME     CHARACTER VARYING(36),
    COLUMN_NAME     NVARCHAR(36),
    COLUMN_NAME     TEXT(36),
    COLUMN_NAME     DATE,
    COLUMN_NAME     TIMESTAMP,
    COLUMN_NAME     TIMESTAMP WITHOUT TIMEZONE,
    COLUMN_NAME     TIMESTAMPZ,
    COLUMN_NAME     TIMESTAMP WITH TIMEZONE,    
);

Which Data Type to use?

So there's a lot of data types to pick, and plenty of overlap, so why not just use VARCHAR(999) for everything and go about your day? Reasons, that's why! The below will help.

Dealing with numbers

Firstly, numbers are WAY more performant than text so you should never use CHAR or VARCHAR when you could be using INT, DECIMAL, or DATE.
INTEGERS don't have decimal places, so don't use them when your customer needs to go down to the decimal level, i.e Currency.
INT2 is obviously more performant than INT4/INT8 however it will cap out at 32767 so only use it for small numbers.
When dealing with decimals, be smart about how many decimal points, it helps performance. If you only need two decimal places then just specify 2

Dealing with Flags

Bool is what you should use for flags,
Don't use CHAR(1) or VARCHAR(1) when you could use BOOL as CHAR(1) uses up 4 bytes of data and VARCHAR(1) uses 6 bytes when BOOL uses 1 byte.
Try and set a default flag as it will save you in the long run

Dealing with Dates

When you are just dealing with the DATE, i.e the day, month, or year then use DATE
For Time just use TIMESTAMP
Only use TIMESTAMPZ when dealing with multiple time zones, good data warehouses use the same timezone across all data
Never put a date into a VARCHAR

Dealing with Text

Don't use CHAR if you don't know how long the text is going to be, you will hit size errors, i.e. this text is too big for the field, or you will be forced to set the limit too high wasting space (char uses up all the space by filling it in with spaces).
Conversely don't use VARCHAR if you know the length of all your values. For example a GUID, which is always 36 characters long should be char(36) not VARCHAR(36) as VARCHAR(36) is actually 40 bytes long. Redshift will perform better on char in these scenarios.
Don't use VARCHAR for anything less than 6 bytes, you won't gain any space with VARCHAR(2)
Don't use CHAR or VARCHAR if you are using a flag as BOOL will be quicker and smaller

Data Compression

Compression, also known as Encoding, makes the column smaller. You can chose different types of compression for different scenarios, and some compression types can only be used on certain data types. In theory, compressing data too much can make it longer to read, however that's not often the case as Amazon makes sure it's compression methods balance out storage and reading. In some scenarios, compression actually can use up more space.

Compression Types

Type	Keyword	Applicable Data Types
Raw (no compression)	RAW	All Types
AZ64	AZ64	INT2/INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ
Byte Dictionary	BYTEDICT	All but BOOL & TEXT
Delta	DELTA	INT2/INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ
Delta 32k	DELTA32K	INT4/INT8/DECIMAL/DATE/TIMESTAMP/TIMESTAMPZ
LZO	LZO	All but BOOL and FLOAT/FLOAT8
Mostly8	MOSTLY8	INT2/INT4/INT8/DECIMAL
Mostly16	MOSTLY16	INT4/INT8/DECIMAL
Mostly32	MOSTLY32	INT8/DECIMAL
Run-length	RUNLENGHT	All but TEXT
Text	TEXT255, TEXT32K	VARCHAR Only
Zstandard	ZSTD	All but TEXT

What one do I use?

Short Answer

AZ64, unless it doesn't apply, then ZSTD for everything else. Most analysis won't require you to nail compression.

Long Answer

Each compression has their specific use cases. Some are more general in their storage while some work in a very specific manner which lead to some very specific downsides if not used properly.

Some TLDRs...

RAW: No compression, your supposed to not compress your first sort key, so use it there.
AZ64: An aggressive compression algorithm with good savings and performance. A general all rounder for Integers and Dates.
BYTEDICT: Creates a dictionary of values, and so repeated values take up no space. Only use on tables with values that repeat a lot. It's dictionary is limited to 256 values, before new ones get stored as RAW.
DELTA /DELTA32K: Compresses data by only recording the difference between values. Only good if the delta (difference) is small and incremental. Large differences can actually cause DELTA to use more space than if it was uncompressed.
LZO: An aggressive compression algorithm with good savings and performance. Used to be the go to, though now you should use ZSTD or AZ64 instead as they are newer and perform better.
MOSTLY8/16/32: This method is one you use when most of the values in a column are 8/16/32 bits, however their are some outlying larger values. So by specifying a Mostyl8, you are saying the majority of values can be compressed to 8 bytes and the outliers left as raw. Only use when you know you can compress most of the columns.
RUNLENGHT: Very similar to ByteDict in the fact that repeat values, are recycled with one value. It's not limited like ByteDict to a set number of values, however you shouldn't use this on any a sort key column .
TEXT255/TEXT32K: For text values only, works similarly to ByteDict in that it makes a dictionary of values. TEXT255 stores values in 1 byte indexes and is limited to the first 245 unique words in a column before new values are stored uncompressed. TEXT32k stores values in 2 byte indexes and will keep storing new words until it hits a hard limit of 32k bytes of data for the combined dictionary. Only low unique text values will result in good compression.
ZSTD: An aggressive compression algorithm with good savings and performance. Will seldom result in using more data than it saves unlike other compression method. Use this where AZ64 does not apply.

Pro-Tip: If sort key columns are compressed more aggressively than other columns in the same query, Redshift may perform poorly.

Redshift can tell you what it recommends.

If you build a table and run the below command, Redshift will recommend, per column, what the compression should be and will even include it's guess at how MUCH the new compression will help by.

analyse compression <table_name>

header image drawn by me

Who am I?

Ronsoak 🏳️‍🌈🇳🇿

@ronsoak

Who Am I❓
|🇳🇿 |🇬🇧 |🏳️‍🌈
📊 Senior Data Analyst for @Xero
⚠️ My words do not represent my company.
✍️ Writer on @ThePracticalDev (dev.to/ronsoak)
✍️ Writer on @Medium (medium.com/@ronsoak)
🎨 I draw (instagram.com/ronsoak_art/)
🛰️ I ❤️ space!

03:11 AM - 13 Oct 2019