Tony Ochieng

Posted on Mar 1, 2023

SQL 101:INTRODUCTION TO SQL FOR DATA ANALYSIS

#sql #database

What is SQL?
SQL (Structured Query Language) is a powerful programming language that is widely used for managing and manipulating data in a relational database. It provides a set of commands that allow you to query, filter, and transform data stored in tables, which makes it an essential tool for data analysts.
SQL is a standard language that is supported by many database management systems (DBMS) such as MySQL, PostgreSQL, Oracle, SQL Server, and many more. It offers a range of features that enable data analysts to extract, analyze, and visualize data, including selecting specific columns, filtering rows based on criteria, aggregating data using functions, and sorting and joining tables.
Why do data analysts need to learn SQL?
SQL is used extensively in data analysis because it allows analysts to retrieve and manipulate large datasets quickly and efficiently. It is also used for database administration tasks such as creating tables, modifying schema, and adding or deleting data. With SQL, data analysts can easily extract valuable insights from data, which is critical for making informed business decisions.
Here are some of the basic I got to understand in regards to SQL:-
Types of SQL commands

Data Definition Language(DDL)
Used to manipulate database structure

CREATE create a database or table.
ALTER Change the structure of the table such as changing table names, adding and deleting columns.
DROP delete database or table.

Data Manipulation Language(DML)
used to manipulate data in a database

INSERT insert new data into a table.
SELECT selects and displays columns.
DELETE delete data from a table.
UPDATE change or edit data in a table.

Data Control language(DCL)
used to control and manipulate the database permissions

GRANT used to grant admin permissions to user.
REVOKE used to revoke the access rights of a user.

Transaction Control Language
Deals with transactions in the database

COMMIT used to permanently store transactions in the database.
ROLLBACK used to return database to the last commit.
SAVEPOINT allows commands executed after they are set to be rolled back

Relational database management system(RDBMS)
Is a program that allows us to create, update and manage a relational database.

Types of database relationship

one-to-one each record in one table corresponds to exactly one record in another table and vice versa.
one-to-many each record in one table can correspond to one or more record in another table, but each record in the second table corresponds to only one record in the first table.
many-to-many each record in one table can correspond to one or many records in another table.

In SQL, a key is a field or combination of fields that uniquely identifies a row in a table. Keys are used to establish relationships between tables and to ensure data integrity by preventing duplicate rows.

There are several types of keys in SQL, including:

Primary key: A primary key is a field or combination of fields that uniquely identifies each row in a table. It cannot contain null values and each table can have only one primary key.

Foreign key: A foreign key is a field in one table that refers to the primary key of another table. It is used to establish relationships between tables.

Candidate key: A candidate key is a field or combination of fields that can be used as a primary key. It is unique and can be used to identify each row in a table.

Composite key: A composite key is a key that consists of multiple fields. It can be used to uniquely identify a row in a table when no single field can provide a unique identifier.

Unique key: A unique key is a key that ensures that the values in a field or combination of fields are unique. It can be used to prevent duplicate rows in a table.

Comments
Comments are used to enhance the readability of code. When written between code lines the interpreter identifies them and does not execute them. Comments can span across one or multiple lines.
Single line comment: You can use — — (two dashes) to comment out everything to the right of them on a given line

SELECT *  --This comment won't affect the way the code runs
FROM tablename

Multi line Comment : Comments across multiple lines use /* to begin the comment and */ to close.

SELECT *  /* Here's a comment so long and descriptive that
it could only fit on multiple lines. Fortunately,
it, too, will not affect how this code runs. */
FROM tablename

Creating a Database
One can use the CREATE DATABASE statement example;

CREATE DATABASE Customers;

CREATING A TABLE

CREATE TABLE customers(
id INTEGER,
first_name TEXT,
last_name TEXT
);

Fetching all columns from a table

SELECT *
FROM customers;

Inserting data into a table
Using the INSERT INTO statement. Basic syntax:

INSERT INTO customers (id, first_name, last_name, email)
VALUES (1, 'Tony', 'Ochieng', 'Tonyochieng@example.com');

This statement will insert a new row into the "customers" table with an "id" of 1, "first_name" of 'Tony', "last_name" of 'Ochieng', and "email" of 'Tonyochieng@example.com'.

Filtering Data
Filtering output in SQL is done using the SELECT statement and the WHERE clause. The WHERE clause allows you to specify conditions that the selected data must meet. Here's the basic syntax of the SELECT statement with the WHERE clause:

SELECT column1, column2, ...
FROM table_name
WHERE condition;

The condition in the WHERE clause is a logical expression that evaluates to true or false. Only the rows that satisfy the condition are returned in the output.

For example, let's say you have a table called "employees" with columns for "id", "name", "department", and "salary". If you want to select all employees in the "sales" department with a salary greater than $50,000, you can use the following statement:

SELECT id, name, salary
FROM employees
WHERE department = 'sales' AND salary > 50000;

This statement will select the "id", "name", and "salary" columns from the "employees" table where the "department" is 'sales' and the "salary" is greater than 50000. Only the rows that meet these conditions will be returned in the output.

You can also use other operators in the WHERE clause to create more complex conditions, such as OR, NOT, and comparison operators like =, <>, <, >, <=, and >=.

Querying Multiple tables
Querying multiple tables in SQL is done using the JOIN operation, which allows you to combine data from two or more tables based on a common column. There are several types of joins in SQL, including INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

Here's an example to illustrate how to join two tables named "customers" and "orders". The "customers" table has columns for "id", "first_name", "last_name", and "email", and the "orders" table has columns for "id", "customer_id", "product", and "price".

To select all orders made by a customer with a given email address, you can use an INNER JOIN like this:

SELECT customers.first_name, customers.last_name, orders.product, orders.price
FROM customers
INNER JOIN orders
ON customers.id = orders.customer_id
WHERE customers.email = 'Tonyochieng@example.com';

This statement will select the "first_name" and "last_name" columns from the "customers" table and the "product" and "price" columns from the "orders" table where the customer's email address is 'Tonyochieng@example.com'. The INNER JOIN is used to join the two tables based on the "id" column in the "customers" table and the "customer_id" column in the "orders" table.

If you want to include all customers in the output, even if they haven't made any orders, you can use a LEFT JOIN like this:

SELECT customers.first_name, customers.last_name, orders.product, orders.price
FROM customers
LEFT JOIN orders
ON customers.id = orders.customer_id
WHERE customers.email = 'Tonyochieng@example.com';

This statement will return all customers in the "customers" table, along with any orders they have made, if any. If a customer has not made any orders, the "product" and "price" columns in the output will be NULL. The LEFT JOIN is used to include all rows from the "customers" table, even if there is no matching row in the "orders" table.

RIGHT JOIN is a type of join operation that returns all the rows from the right table and matching rows from the left table. If there are no matching rows in the left table, the result will contain NULL values.

The syntax for a RIGHT JOIN:

SELECT column1, column2, ...
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

In this syntax, table1 is the left table and table2 is the right table. The ON clause specifies the condition for the join.
FULL OUTER JOIN (or simply a FULL JOIN) is a type of join operation that returns all the rows from both the left and right tables, including any non-matching rows. If there are no matching rows in one of the tables, the result will contain NULL values.

The syntax for a FULL OUTER JOIN:

SELECT column1, column2, ...
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;

Here's an example to illustrate how a FULL OUTER JOIN works. Let's say you have two tables, "orders" and "customers", with the following columns:

"orders" table: "order_id", "customer_id", "product", "price"
"customers" table: "customer_id", "first_name", "last_name", "email"
If you want to retrieve all the customers and all the orders, including any non-matching rows, you can use a FULL OUTER JOIN like this:

SELECT customers.first_name, customers.last_name, orders.product, orders.price
FROM orders
FULL OUTER JOIN customers
ON orders.customer_id = customers.customer_id;

In conclusion, SQL is a powerful and widely used language that data analysts must have in their toolkit. It allows analysts to work with large datasets, manipulate data, and extract valuable insights that can drive business decisions. With this introduction to SQL, you are now ready to start learning more about its features and capabilities.

Top comments (1)

Aaron Reese • Mar 1 '23

Good article. Well done. A couple of minor points...
1) inline comments using the -- can be risky. If you save the code and it gets reloaded without the line breaks you could end up changing the logic of the query. E.g. if you commented out and AND subclause in the WHERE clause.
2) in 30 years of coding I have never come across a legitimate use of RIGHT JOIN in production code. It is a code-smell that you don't understand the data. E.g. if you wanted all orders and their customer details, then refactor the code to select orders first and LEFT JOIN to customers.
3) in MSSQL the INNER and FULL keywords are redundant. LEFT and RIGHT are not.
4) you neglected CROSS JOIN. This is a more complex setup so it may have been deliberate as this is an intro post but thought it was worth mentioning
5) although you gave a good overview if keys and relationships you didn't mention Third Normal Form which is important for keys to be effective. In you SELECT examples you join customer to order and get product details and price. In a 3NF design you are unlikely to hold product details on the order and there could be a 1:many relationship (orderLine.order_id = order.id) and order value should not be stored if it can be calculated from summing the order lines as the two values can get out of sync.

But overall a good overview of basic SQL

DEV Community

SQL 101:INTRODUCTION TO SQL FOR DATA ANALYSIS

Top comments (1)

Read next

Master Bidirectional One-to-One Relations in 5 Steps: Boost Spring Data JPA Efficiency

GitCone.com: Chat with all the repositories from this challenge, or with any other repository

Local AI Knowledge Base with Next.js, Ollama, and PostgreSQL

Database Monitoring: Importance, Tools, and Best Practices for Performance Optimization