One way to effectively implement multitenancy for SaaS while optimizing resources and maintenance is to store all data in a single table and add a tenant_id
as a key. By utilizing RLS (Row Level Security), each tenant will only be able to view their own data. This can be achieved by setting a GUC (Global User Configuration) session variable containing the tenant_id
and incorporating it into the RLS policy. This will ensure tenant isolation by automatically adding a WHERE clause to all queries.
I exposed the solution in a previous post.
In certain cases, the policy may allow access to a list of tenants instead of just one. Unfortunately, it's not possible to define a GUC session variable as an array. However, there are workarounds available to handle this situation effectively.
Demo table with tenant_id
I create a table consisting of 26 tenant_id
values, each representing a letter from A to Z, and ten million rows.
create extension if not exists orafce;
alter function dbms_random.string volatile;
create table demo as
select generate_series(1,10000000) value
, dbms_random.string('u',1) tenant_id
;
create index demo_tenant_id on demo(tenant_id, value);
I will explain how to limit the queries to specific tenants during the demo. In this case, we will be restricting the queries to five tenants: 'A', 'E', 'I', 'O', and 'U'.
Without RLS policy
Without using row level security (RLS), I need to include a WHERE clause utilizing an ARRAY with either IN()
or = ANY
.
yugabyte=# explain (costs off, analyze)
select tenant_id from demo
where tenant_id = ANY (ARRAY['A','E','I','O','U']::text[]) ;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Index Only Scan using demo_tenant_id on demo (actual time=7.716..2193.248 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY ('{A,E,I,O,U}'::text[]))
Heap Fetches: 0
Planning Time: 8.809 ms
Execution Time: 2346.935 ms
Peak Memory Usage: 18 kB
(6 rows)
With a temporary table
To compare the performance, I use a WITH clause to utilize a temporary table instead of an array:
yugabyte=> explain (costs off, analyze)
/*+ Set(yb_bnl_batch_size 1024) */
with allowed(tenant_id) as (values ('A'), ('E'), ('I'), ('O'), ('U') )
select tenant_id from demo
where tenant_id in ( select tenant_id::text from allowed) ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (actual time=12.210..2671.709 rows=1923830 loops=1)
Join Filter: (demo.tenant_id = allowed.tenant_id)
CTE allowed
-> Values Scan on "*VALUES*" (actual time=0.002..0.003 rows=5 loops=1)
-> HashAggregate (actual time=0.013..0.014 rows=5 loops=1)
Group Key: allowed.tenant_id
-> CTE Scan on allowed (actual time=0.005..0.007 rows=5 loops=1)
-> Index Only Scan using demo_tenant_id on demo (actual time=12.119..1626.838 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY (ARRAY[allowed.tenant_id, $2, $3, ..., $1024]))
Heap Fetches: 0
Planning Time: 0.565 ms
Execution Time: 2802.131 ms
Peak Memory Usage: 494 kB
(13 rows)
With batched nested loop this is the same as with a list if values.
With UNNEST array
I can do the same with a subquery:
explain (costs off, analyze)
/*+ NestLoop(unnest demo) Set(yb_bnl_batch_size 1024) */
select tenant_id from demo
where tenant_id in (
select * from unnest(ARRAY['A','E','I','O','U']::text)
) ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (actual time=3.202..3075.332 rows=1923830 loops=1)
Join Filter: (demo.tenant_id = unnest.unnest)
-> HashAggregate (actual time=0.013..0.014 rows=5 loops=1)
Group Key: unnest.unnest
-> Function Scan on unnest (actual time=0.009..0.009 rows=5 loops=1)
-> Index Only Scan using demo_tenant_id on demo (actual time=3.113..1994.448 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY (ARRAY[unnest.unnest, $1, $2, ..., $1023]))
Heap Fetches: 0
Planning Time: 0.569 ms
Execution Time: 3206.993 ms
Peak Memory Usage: 494 kB
(11 rows)
Without the Cost Based Optimizer enabled, I needed to force the YB Batched Nested Loop Join
. Performance remained the same, so I chose to use the string representation of the array which will be easier to set in a variable.
From a session variable
My goal is to set the array as a session parameter but session variables are scalar. I can set it as a string representation of an array:
yugabyte=# set rls.tenant_id='{A,E,I,O,U}';
SET
The value can be read in a query with current_setting()
WHERE clause with current_setting()
Until a RLS policy is defined, this string can be read and cast as an array in the WHERE clause:
yugabyte=# explain (costs off, analyze)
select tenant_id from demo
where tenant_id = ANY ( current_setting('rls.tenant_id')::text[] ) ;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Index Only Scan using demo_tenant_id on demo (actual time=2.991..2439.003 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY ((current_setting('rls.tenant_id'::text))::text[]))
Heap Fetches: 0
Planning Time: 1.600 ms
Execution Time: 2557.392 ms
Peak Memory Usage: 24 kB
(6 rows)
Even if execution time is the same, casting text to an array at runtime is not optimal. Query planner can make better decisions with known values for selectivity.
Immutable PostgreSQL function
I create a function to read the session variable and cast it to an ARRAY. I define the function as immutable so that it can be executed before query planning.
yugabyte=#
create or replace function get_tenant_id_array(setting text) returns text[] as $$
select current_setting(setting)::text[]
$$ language sql immutable;
CREATE FUNCTION
Now, the casting is done once before execution and the result is visible in the WHERE clause
yugabyte=# explain (costs off, analyze)
select tenant_id from demo
where tenant_id = ANY ( get_tenant_id_array('rls.tenant_id') ) ;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Index Only Scan using demo_tenant_id on demo (actual time=3.958..2338.873 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY ('{A,E,I,O,U}'::text[]))
Heap Fetches: 0
Planning Time: 1.125 ms
Execution Time: 2454.236 ms
Peak Memory Usage: 14 kB
(6 rows)
This function has been defined and optimized for use in a WHERE clause.
Row Level Security policy
I am now ready to implement this as a RLS policy:
yugabyte=> create POLICY tenant_id ON demo FOR ALL TO public
USING (
tenant_id = ANY ( get_tenant_id_array('rls.tenant_id') )
);
CREATE POLICY
yugabyte=> ALTER TABLE demo ENABLE ROW LEVEL SECURITY;
ALTER TABLE
To test it, I create a user with the session parameter already set, grant SELECT on my table, and connect with this user
yugabyte=> create user franck password 'bsky-social-hhae3-zsl2a';
CREATE ROLE
yugabyte=> grant select on demo to franck;
GRANT
yugabyte=> \c - franck
psql (16.0, server 11.2-YB-2.18.3.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
You are now connected to database "yugabyte" as user "franck".
Now, running the same query without a WHERE clause but the session parameter set to the list of tenants I want to be restricted by the RLS policy:
yugabyte=> set rls.tenant_id='{A,E,I,O,U}';
SET
yugabyte=# explain (costs off, analyze)
select tenant_id from demo;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Index Only Scan using demo_tenant_id on demo (actual time=3.029..2302.233 rows=1923830 loops=1)
Index Cond: (tenant_id = ANY ('{A,E,I,O,U}'::text[]))
Heap Fetches: 0
Planning Time: 0.178 ms
Execution Time: 2417.699 ms
Peak Memory Usage: 14 kB
(6 rows)
My list of tenants was added automatically by the Row Level Security policy with the array known at query planning time.
Datatype, Array definition, and indexing
In order to cast the string of array values, you can use any datatype and change the return datatype for the function accordingly. For example, you can replace all text[]
by uuid[]
:
yugabyte=> alter table demo add tenant_uuid uuid unique;
ALTER TABLE
yugabyte=> create or replace function get_tenant_uuid_array(setting text) returns uuid[] as $$
select current_setting(setting)::uuid[]
$$ language sql immutable;
CREATE FUNCTION
yugabyte=> set rls.tenant_uuid='{d1ab011c-dead-face-c0de-dec1a551f1ed,ca5caded-face-dead-c0de-c1a551f1ab1e}';
SET
yugabyte=> explain select * from demo where tenant_uuid = ANY ( get_tenant_uuid_array('rls.tenant_uuid'));
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using demo_tenant_uuid_key on demo (cost=0.00..4.12 rows=1 width=36)
Index Cond: (tenant_uuid = ANY ('{d1ab011c-dead-face-c0de-dec1a551f1ed,ca5caded-face-dead-c0de-c1a551f1ab1e}'::uuid[]))
(2 rows)
I prefer using PostgreSQL array literal to define arrays. It involves enclosing the values in curly braces and separating them with a comma. Alternatively, you can use STRING_TO_ARRAY. However, it requires two arguments: the string and the delimiter.
yugabyte=> create or replace function get_tenant_uuid_array(setting text, delimiter text) returns uuid[] as $$
select STRING_TO_ARRAY(current_setting(setting),delimiter)::uuid[]
$$ language sql immutable;
CREATE FUNCTION
yugabyte=> set rls.tenant_uuid='d1ab011c-dead-face-c0de-dec1a551f1ed ca5caded-face-dead-c0de-c1a551f1ab1e';
SET
yugabyte=> explain select * from demo where tenant_uuid = ANY ( get_tenant_uuid_array('rls.tenant_uuid',' '));
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using demo_tenant_uuid_key on demo (cost=0.00..4.12 rows=1 width=36)
Index Cond: (tenant_uuid = ANY ('{d1ab011c-dead-face-c0de-dec1a551f1ed,ca5caded-face-dead-c0de-c1a551f1ab1e}'::uuid[]))
(2 rows)
Note that you cannot use ARRAY as a constructor in this case, as it takes multiple scalar values rather than a single character string that can be defined in a session variable.
That technique is optimal and can be used in both PostgreSQL and YugabyteDB. In particular, YugabyteDB has the advantage of using only a single Index Only Scan to seek and read the next rows within the five ranges, similar to a loose index scan (thanks to Hybrid Scan).
Optimized Policy For Singletons
The previous RLS policy always applies an = ANY
condition. With YugabyteDB, this is optimal as one Index Scan can go to multiple ranges in one call, with an index skip scan. The order of the rows returned are those if the index. I've added value
to the index, defined as (tenant_id HASH, value ASC)
and this helps to use an Index Only Scan:
yugabyte=# set rls.tenant_id='{A,E,I,O,U}';
SET
yugabyte=# explain (costs off, analyze)
select tenant_id from demo order by value
;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Sort (actual time=159.815..180.212 rows=192772 loops=1)
Sort Key: value
Sort Method: external merge Disk: 3416kB
-> Index Only Scan using demo_tenant_id on demo (actual time=3.406..83.238 rows=192772 loops=1)
Index Cond: (tenant_id = ANY ('{A,E,I,O,U}'::text[]))
Heap Fetches: 0
Planning Time: 0.159 ms
Execution Time: 219.653 ms
Peak Memory Usage: 5768 kB
(9 rows)
Because if scanning multiple range, an additional Sort operation was needed for order by value
.
Unfortunately, if a set a singleton array, the query planner doesn't realize that there's only one range, which is sorted on value
and still adds a Sort
operation:
yugabyte=# set rls.tenant_id='{Y}';
SET
yugabyte=# explain (costs off, analyze)
select tenant_id from demo order by value
;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Sort (actual time=47.869..49.779 rows=38674 loops=1)
Sort Key: value
Sort Method: quicksort Memory: 3179kB
-> Index Only Scan using demo_tenant_id on demo (actual time=1.502..36.325 rows=38674 loops=1)
Index Cond: (tenant_id = ANY ('{Y}'::text[]))
Heap Fetches: 0
Planning Time: 0.156 ms
Execution Time: 57.721 ms
Peak Memory Usage: 3516 kB
(9 rows)
If you expect singletons, it may be useful to handle two cases in the RLS policy:
yugabyte=# \c - yugabyte
psql (16.0, server 11.2-YB-2.19.3.0-b0)
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# drop POLICY tenant_id ON demo;
DROP POLICY
yugabyte=# create POLICY tenant_id ON demo FOR ALL TO public
USING (
case
when array_length(get_tenant_id_array('rls.tenant_id'),1) = 1
-- equals to the first element when there's only one
then tenant_id = (get_tenant_id_array('rls.tenant_id'))[1]
-- use =ANY when there's more elements
else tenant_id = ANY ( get_tenant_id_array('rls.tenant_id') )
end
);
CREATE POLICY
yugabyte=# \c - franck
psql (16.0, server 11.2-YB-2.19.3.0-b0)
You are now connected to database "yugabyte" as user "franck".
The same as before is generated for multi-values array:
yugabyte=# set rls.tenant_id='{A,E,I,O,U}';
SET
yugabyte=# explain (costs off)
select tenant_id from demo order by value;
QUERY PLAN
---------------------------------------------------------------
Sort
Sort Key: value
-> Index Only Scan using demo_tenant_id on demo
Index Cond: (tenant_id = ANY ('{A,E,I,O,U}'::text[]))
(4 rows)
An optimized version is generated with a singleton:
yugabyte=# set rls.tenant_id='{Y}';
SET
yugabyte=# explain (costs off)
select tenant_id from demo order by value;
QUERY PLAN
----------------------------------------------
Index Only Scan using demo_tenant_id on demo
Index Cond: (tenant_id = 'Y'::text)
(2 rows)
With this new policy, the Sort operation is avoided when not needed.
Top comments (6)
Hi Franck,
Nice informative article, thanks! I have discovered in practice (also found this discussion as well in the meantime) about some limitations of using RLS on more complex queries. Maybe an idea for a Part 2 of this article, featuring some queries that trip the planner up when using RLS, and what to do about it?
Thanks Mircea. I'll look at your case. I've already added a paragraph with a policy that is optimized when there's only one item in the array (from another user feedback)
Hi Franck,
We do use rls in our projects. We use similar type of rls in our project both in oracle and postgres. What I observed is that oracle is more advanced to PostgreSQL in Row level security area. For example, In our projects some times we want to by-pass the policies and return the all the data in the table. In this case, I think with above method by-passing rls is not possible in Postgres. Is there way we can by-pass the rls to return all the rows using above approach?
Thanks,
Ramnath
The ways I know to bypass RLS in general are: connect with the owner, or with a user with BYPASSRLS. For a single user with the same user that is used for RLS, you can also create a view on the table and grant it for select. Then reading from the view will not apply the RLS.
Hi, Frank, first thanks for your awewome blog, little question
On this row security level, i suppose we still have some statistic mixup on histogram
on tenand by id, or there is some way to escape this ?
regards marc
Thanks Marc. Yes, with only per-column statistics, the estimations will probably be underestimated. For example, WHERE tenant='a_customer_in_UK' and customer_name='Smith' will not know the correlation.
It may be good to CREATE STATISTICS on column groups