Data independence
The fundamental socle on which modern data management is based (Edgar Codd 1950).
Data independence means the independence of the logical view on the data and physical storage.
A database management system stack is four-layer:
- A logical query language
- A logical model
- A physical compute layer that processes the query
- A physical storage layer where the data is physically stored
In the past two decades, the physical layer has changed: clusters with thousands of machines rather than a single machine. However, the look and feel of querying data remain the same to the end user.
Formal prerequisites
Sets and relations
set, inclusion: are trivial concepts
Cartesian product: AxB
relation: a relation R on a family of sets is a subset of their cartesian product.
partial function: A partial function p between two sets A and B is a relation is a relation that does not associate any element of A to more than an element of B
function: a partial function that is "injective"
Sets commonly used:
natural integers, relative integers, decimals, rational numbers, real numbers, a set of all values
Rational database management systems (RDBMS)
main concepts
RDBMS is based on a tabular data format. Its first class citizens:
- table: a collection of records
- attribute: a property that records can have
- row: a record in a collection. Also called: record, entity, document, item, business object
- primary key: attribute(s) tha uniquely identify a record
formalism behind relational model
Tables can be defined formally (as purely mathematical objects)
Relational integrity
A collection T fulfills relational integrity of all records have identical support.
Domain integrity
The values associated with each attribute should be restricted to a domain.
Atomic integrity
values are only atomic values: the collections does not contain any nested collections or sets or lists or anything has a structure of its own.
If these constraints are relaxed, we enter the world of NoSQL databases:
Relational Algebra
Selection
Projection
Grouping
Renaming
Changes the name of a column
Extended projections
Compute values and assign results to a new column.
Cartesian products
Joins
Join is a "filtered Cartesian proect" in which we only combine directly related tuples and omit all other non-matching pairs.
Normal Forms
Anomalies might occur in poorly designed databases. To avoid anomalies, best practice is to follow normal forms.
First normal form: atomic integrity
Second normal form: each column in a record contains information on the entire record. A table should be only about one type of thing. Each non key field must be about the same thing as the primary key.
Third normal form: forbids functional dependency on anything else than the primary key
SQL language
SQL is a declarative language. SQL is also a functional language (to some extent).
Example
SELECT c.century AS cent,
COUNT(c.name) AS num_captains,
SUM(s.id) AS ships
FROM captains c, ships s
WHERE c.id = s.captain
GROUP BY century
HAVING COUNT(c.name) >= 3
Internally:
Languages
Transactions
ACID:
- Atomicity: either an update is applied completely, or not at all
- Consistency: Properties of database stays the same
- Isolation: the system "feels like" the user is the only one using the system, however in fact thousands are using concurrently.
- Durability: all data written to database is durably stored and will not be lost (e.g. electricity shortage or a disk crash)
Scaling up and out
If we have a lot of rows: object storage, column storage, distributed file systems, massive parallel processing.
If we have a lot of columns: column storage.
If we have a lot of nesting: syntax, data models, document stores, querying.
So, if you do not fully understand normal forms, 3 integrities, and ACID. It is OK for this course, we will very soon break these conventions and enter the Big Data world.
Top comments (0)