Patrick Wendo

Posted on Jun 23, 2022

Databases, Relation Algebra and Set Theory.

#programming #math #tutorial #database

This will be a post covering the common ground between database systems, set theory and relational algebra. It is intended as a basic refresher or an intro depending on the reader. I do assume some basic set theory notation knowledge on the part of the reader. Questions and clarifications are welcome.

Terminologies used

Set : These are collections that are completely characterized by their elements. Two sets are equal only if they contain the same elements.

$A = [1,2,3] B = [3,2,1]$ The sets A and B above are equal since they contain the same elements.
Relation (AKA table or entity set) : Given sets $D_1, D_2, ..., D_n$ A relation is a subset of $D_1 \times D_2 \times ... \times D_n$ . It is set of n-tuples $(a_1, a_2, ..., a_n)$ where $a_i \in D_i$
Each relation has a fixed number of columns that are explicitly names where each attribute name within a relation is unique. Rows/tuples in a relation have no ordering associated with them. A database has multiple relations.
Tuple (AKA row or entity instance): It is an element in a relation.
Attribute (AKA column): It describes a tuple in a relation.
Query Language : It is a language in which a user requests information from a database.

Operators

We will cover the 6 basic operators used in relational algebra. These operators can later be used in compositions to form higher order operations.

Select ( $\sigma$ )
Project ( $\Pi$ )
Union ( $\cup$ )
Set Difference ( $-$ )
Cartesian Product ( $\times$ )
Rename ( $\rho$ )

These operations take one or two relations as inputs and produce a new relation as a result. This is a property known as closure.

Select ( $\sigma$ )

\sigma_p(R)

in this example p is called the selection predicate. R is the relation upon which we perform the select operation. This operation will always return a subset of the initial relation. (remember that a set is also a subset of itself, (

R \in R

))

Formally, select is defined as

\sigma_p(R) = { t | t \in R \ and \ p(t) }

Example
Relational Algebra: $\sigma_{firstName="John"}({Employees})$

SQL : SELECT * EMPLOYEES WHERE firstName='John';

Will return all employees who's first name is John

Project ( $\Pi$ )

\Pi_{A_1, A_2,...,A_k}(R)

A_1, A_2,...,A_k

are attribute names and R is the relation.
The result is defined as the relation of K columns obtained by erasing the columns that are not listed. Duplicate rows are removed since relations are sets.

Example
Relational Algebra: $\Pi_{firstName, lastName}({Employees})$

SQL: SELECT firstName, lastName FROM EMPLOYEES;

Will return all employees, but only with their firstname and lastname

Union Operator

{R} \cup {S} = t | t \in R \ or \ t \in S

This requires two relations R and S and returns a relation in which every element is either a member of R or S.

For ${R} \cup {S}$ to be valid the following conditions have to be met.
(i) R and S must have the same [arity]https://en.wikipedia.org/wiki/Arity)(same number of columns)
(ii) The attribute domains must be the same. That is, the corresponding columns must have the same data type.

Example
If we have two relations, Sciences and Humanities and we want to find the courses offered in both we could run this query
Relational Algebra:

\Pi_{courseName}(Sciences) \cup \Pi_{courseName}(Humanities)

SQL : SELECT courseName FROM Sciences UNION SELECT courseName FROM Humanities;
It will return a relation with only the course names from both Sciences and Humanities

Set Difference

R - S = t | t \in R\ and\ t \notin S

This takes in two relations R and S and will return all the rows in R but not in S.

The shaded areas represent $R - S\ and \ S - R$ respectively

if A = [ 1, 2, 3] \ B = [2, 4, 5]\ then \ A - B = [1, 3]

Similar to Unions, set differences must be taken between relations with the same arity.

Cartesian Product

R \times S = tq | t \in R\ and\ q \in S

In this situation we have to assume that the attributes of R and S are disjoint, that is $R \cap S = \emptyset$ . If they are not disjoint then renaming must be used.

The cartesian product will concatenate each row of R with each row of S. The resultant relation will have #rows_in_R x #rows_in_S. (If R has 2 rows, and S has 3 rows then the result will have 3x2=6 rows.)

Rename Operator

It allows us to rename and therefore refer to the result of a relational algebra expression.

\rho_{x}(E)

This will return the expression E under the name X.
Similarly if an expression E has an arity n, then

\rho_{A_1, A_2,...,A_n}(E)

And this will return the result of expression E under the name X with the attributes renamed to

A_1, A_2,...,A_n

Note There are additional operators that do not add to the power of relational algebra, but help to simplify queries. These operators are compositions of the 6 basic operators. They are:

Set Intersection
Natural Join
Division
Assignment. I will explain these in a later post and link it here once I do so.

That's all folks

I'm trying to sharpen my database skills and as such, I figured I will start from the basics of the basics. More to come. Stay tuned.
Cheers,
W3NDO

DEV Community

Databases, Relation Algebra and Set Theory.

Terminologies used