DEV Community

Cover image for Essential SQL Commands for Data Science
Hassie Mike Perekamoyo
Hassie Mike Perekamoyo

Posted on

Essential SQL Commands for Data Science

SQL (Structured Query Language) is a programming language used to manage and manipulate relational databases. It is used to create, modify, and query databases, as well as perform data manipulation operations such as inserting, updating, deleting, and retrieving data.

In this playlist you will get to know about different sql commands, queries and all the sql functions for the data science

favicon youtube.com

SQL is a standard language, meaning it can be used with different relational database management systems (RDBMS) such as MySQL, Oracle, Microsoft SQL Server, and others. The syntax and commands used in SQL may vary slightly depending on the specific RDBMS being used, but the basic structure and principles of the language remain the same.

SQL is a declarative language, meaning it focuses on specifying what data should be retrieved or manipulated, rather than how it should be done.

Some common tasks performed using SQL include creating tables to store data, inserting data into those tables, querying data to retrieve specific information, and updating or deleting existing data. SQL can also be used to perform more advanced operations, such as joining multiple tables, aggregating data, and creating views and stored procedures.

SQL (Structured Query Language) is a powerful tool for data science that allows you to manage and manipulate relational databases. Here are some essential SQL commands for data science:

1. Simple Data Retrieval

SELECT:
The SELECT keyword is used to retrieve data from one or more database tables. When used in a SQL query, it specifies the columns that should be included in the query result.
For example, let's say we have a table called "customers" with columns "customer_id", "first_name", and "last_name". To retrieve all data from this table, we would use the following query:

SELECT *
FROM customers;
Enter fullscreen mode Exit fullscreen mode

This would retrieve all columns from the "customers" table. Alternatively, if we only wanted to retrieve the "customer_id" and "first_name" columns, we would use the following query:

SELECT customer_id, first_name
FROM customers;
Enter fullscreen mode Exit fullscreen mode

FROM:
The FROM keyword is used to specify the table or tables from which to retrieve data. It is used in combination with the SELECT keyword to specify the source of the data being retrieved.
For example, if we wanted to retrieve data from the "orders" table, we would use the following query:

SELECT *
FROM orders;
Enter fullscreen mode Exit fullscreen mode

DISTINCT:
The DISTINCT keyword is used to remove duplicate rows from the query result. When used in a SQL query, it specifies that only unique values should be included in the query result.
For example, let's say we have a table called "orders" with columns "order_id", "customer_id", and "order_date". If we wanted to retrieve only the unique customer IDs from this table, we would use the following query:

SELECT DISTINCT customer_id
FROM orders;
Enter fullscreen mode Exit fullscreen mode

This would retrieve only the unique customer IDs from the "orders" table, and would exclude any duplicates.

In summary, the combination of SELECT, FROM, and DISTINCT in a SQL query allows you to retrieve specific data from one or more tables while removing any duplicates in the query result.

2. Data Retrieval with Simple Conditions

WHERE:
The WHERE keyword is used to filter data based on a specified condition. It is used in combination with the SELECT keyword to specify the criteria for selecting data from a table.
For example, if we wanted to retrieve only the orders placed by a specific customer with a customer ID of 123, we would use the following query:

SELECT *
FROM orders
WHERE customer_id = 123;
Enter fullscreen mode Exit fullscreen mode

This would retrieve only the rows from the "orders" table where the customer ID is equal to 123.

ORDER BY:
The ORDER BY keyword is used to sort the query result by one or more columns in ascending or descending order. It is used in combination with the SELECT keyword to specify the sorting order for the data being retrieved.
For example, if we wanted to retrieve all orders from the "orders" table and sort them by order date in descending order, we would use the following query:

SELECT *
FROM orders
ORDER BY order_date DESC;
Enter fullscreen mode Exit fullscreen mode

This would retrieve all rows from the "orders" table and sort them in descending order based on the order date column.

LIMIT:
The LIMIT keyword is used to limit the number of rows returned by a query. It is used in combination with the SELECT keyword to specify the maximum number of rows to retrieve.
For example, if we only wanted to retrieve the first 10 orders from the "orders" table, we would use the following query:

SELECT *
FROM orders
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

This would retrieve the first 10 rows from the "orders" table, based on the default sorting order of the table.

In summary, the combination of WHERE, ORDER BY, and LIMIT in a SQL query allows you to filter, sort, and limit data retrieval based on specific conditions and criteria.

3. Aggregations

GROUP BY:
The GROUP BY keyword is used to group data by one or more columns in a table. It is used in combination with the SELECT keyword to specify how the data should be grouped.
For example, if we wanted to count the number of orders for each customer in the "orders" table, we would use the following query:

SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
Enter fullscreen mode Exit fullscreen mode

This would group the data by customer ID and count the number of orders for each customer.

COUNT():
The COUNT() function is used to count the number of rows in a table or the number of non-null values in a specific column. It is used in combination with the SELECT keyword to specify what should be counted.
For example, if we wanted to count the total number of orders in the "orders" table, we would use the following query:

SELECT COUNT(*)
FROM orders;
Enter fullscreen mode Exit fullscreen mode

This would count the total number of rows in the "orders" table.

SUM():
The SUM() function is used to calculate the sum of values in a specific column. It is used in combination with the SELECT keyword to specify which column should be summed.
For example, if we wanted to calculate the total revenue from all orders in the "orders" table, we would use the following query:

SELECT SUM(order_total)
FROM orders;
Enter fullscreen mode Exit fullscreen mode

This would calculate the sum of the "order_total" column in the "orders" table.

AVG():
The AVG() function is used to calculate the average of values in a specific column. It is used in combination with the SELECT keyword to specify which column should be averaged.
For example, if we wanted to calculate the average order total for each customer in the "orders" table, we would use the following query:

SELECT customer_id, AVG(order_total)
FROM orders
GROUP BY customer_id;
Enter fullscreen mode Exit fullscreen mode

This would group the data by customer ID and calculate the average order total for each customer.

HAVING:
The HAVING keyword is used to filter data based on a condition applied to an aggregated column. It is used in combination with the GROUP BY keyword to specify the criteria for selecting data.
For example, if we wanted to retrieve only the customers with a total order value greater than $1000 from the "orders" table, we would use the following query:

SELECT customer_id, SUM(order_total)
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000;
Enter fullscreen mode Exit fullscreen mode

This would group the data by customer ID and sum the order totals for each customer, and then only retrieve the rows where the sum is greater than $1000.

MIN():
The MIN() function is used to calculate the minimum value in a specific column. It is used in combination with the SELECT keyword to specify which column should be used.
For example, if we wanted to retrieve the minimum order total from the "orders" table, we would use the following query:

SELECT MIN(order_total)
FROM orders;
Enter fullscreen mode Exit fullscreen mode

This would retrieve the minimum value in the "order_total" column of the "orders" table.

Alias:
An alias is a shorthand name given to a table or column in a SQL query. It is used to make the query more readable and can be used to rename columns or tables.
For example, if we wanted to rename the "order_total" column in the "orders" table to "total", we would use the following query:

SELECT customer_id, SUM(order_total)
Enter fullscreen mode Exit fullscreen mode

4. Joins

Joins are used in SQL to combine data from multiple tables based on a common column. The most common type of join is the INNER JOIN, also known as the JOIN operation.

The INNER JOIN operation returns only the rows that have matching values in both tables being joined. To perform an INNER JOIN, we need to specify the tables to be joined and the columns on which to join them.

For example, suppose we have two tables "customers" and "orders" with the following data:
customers table:

customer_id customer_name customer_email
1 John Smith john@gmail.com
2 Jane Doe jane@gmail.com
3 Bob Johnson bob@gmail.com

orders table:

order_id order_total customer_id
1 100 1
2 200 1
3 150 3
4 50 2

We can join these tables using the following query:

SELECT *
FROM customers
INNER JOIN orders
ON customers.customer_id = orders.customer_id;
Enter fullscreen mode Exit fullscreen mode

This query would return the following result:

customer_id customer_name customer_email order_id order_total customer_id
1 John Smith john@gmail.com 1 100 1
1 John Smith john@gmail.com 2 200 1
2 Jane Doe jane@gmail.com 4 50 2
3 Bob Johnson bob@gmail.com 3 150 3

In this example, we joined the "customers" and "orders" tables on the "customer_id" column. The resulting table contains all columns from both tables where the "customer_id" values match.

The ON keyword is used to specify the join condition. In this case, we specified that the "customer_id" column in the "customers" table should match the "customer_id" column in the "orders" table.

Note that when joining tables, it is important to use unique column names in each table, or to use table aliases to disambiguate column names. This ensures that the correct columns are used in the join condition and in the SELECT clause.

5. Changing Data Types

In SQL, we can use the CAST function to convert a value from one data type to another. The syntax of the CAST function is as follows:

CAST(expression AS data_type)
Enter fullscreen mode Exit fullscreen mode

Here, the expression is the value that we want to convert, and the data_type is the data type to which we want to convert the expression.

For example, suppose we have a table "sales" with a column "sale_amount" that contains decimal values. We want to convert these values to integers. We can use the CAST function as follows:

SELECT CAST(sale_amount AS INTEGER)
FROM sales;
Enter fullscreen mode Exit fullscreen mode

This query will return the "sale_amount" values converted to integers.

Another useful function for working with numeric data is the ROUND function. The ROUND function is used to round a numeric value to a specified number of decimal places. The syntax of the ROUND function is as follows:

ROUND(expression, decimal_places)
Enter fullscreen mode Exit fullscreen mode

Here, the expression is the value that we want to round, and the decimal_places is the number of decimal places to which we want to round the expression.

For example, suppose we have a table "products" with a column "price" that contains decimal values. We want to round these values to two decimal places. We can use the ROUND function as follows:

SELECT ROUND(price, 2)
FROM products;
Enter fullscreen mode Exit fullscreen mode

This query will return the "price" values rounded to two decimal places.

6. Complex Conditions

In SQL, we can use complex conditions to create more advanced filtering and logical expressions. Complex conditions are created using logical operators such as AND, OR, and NOT, and can be combined with parentheses to group expressions together.

For example, suppose we have a table "employees" with columns "first_name", "last_name", and "salary". We want to select all employees who have a salary greater than $50,000 and whose first name is not "John". We can use the following query:

SELECT *
FROM employees
WHERE salary > 50000 AND NOT first_name = 'John';
Enter fullscreen mode Exit fullscreen mode

This query will return all rows from the "employees" table where the "salary" column is greater than 50,000 and the "first_name" column is not "John".

CASE Statement:

The CASE statement is used in SQL to perform conditional logic in the SELECT statement. It allows us to define different output values based on the evaluation of one or more conditions.

The syntax of the CASE statement is as follows:

CASE
   WHEN condition_1 THEN result_1
   WHEN condition_2 THEN result_2
   ...
   ELSE default_result
END
Enter fullscreen mode Exit fullscreen mode

For example, suppose we have a table "orders" with columns "order_id" and "order_total". We want to create a new column called "order_type" that categorizes orders as "low", "medium", or "high" based on their total amount. We can use the following query:

SELECT order_id, order_total,
   CASE
      WHEN order_total < 100 THEN 'low'
      WHEN order_total < 500 THEN 'medium'
      ELSE 'high'
   END AS order_type
FROM orders;
Enter fullscreen mode Exit fullscreen mode

This query will return all rows from the "orders" table, with a new column "order_type" that categorizes each order as "low", "medium", or "high" based on its "order_total" value.

Subqueries:

A subquery is a SQL query that is nested inside another query. Subqueries are useful when we need to perform a separate query to retrieve data that we will use in the main query.

For example, suppose we have two tables "customers" and "orders". We want to find all customers who have placed an order in the last 30 days. We can use the following query:

SELECT *
FROM customers
WHERE customer_id IN (
   SELECT customer_id
   FROM orders
   WHERE order_date >= DATEADD(day, -30, GETDATE())
);
Enter fullscreen mode Exit fullscreen mode

This query will return all rows from the "customers" table where the "customer_id" is in the result set of a subquery that retrieves all "customer_id" values from the "orders" table where the "order_date" is within the last 30 days.

Common Table Expressions (CTEs):

A Common Table Expression (CTE) is a temporary named result set that we can reference within a SQL statement. CTEs are useful for breaking down complex queries into smaller, more manageable parts.

The syntax of a CTE is as follows:

WITH cte_name AS (
   SELECT ...
)
SELECT ...
FROM cte_name ...
Enter fullscreen mode Exit fullscreen mode

For example, suppose we have a table "orders" with columns "order_id" and "order_total". We want to calculate the total sales for each customer, and then find the customers with the highest total sales. We can use the following query with a CTE:

WITH customer_sales AS (
   SELECT customer_id, SUM(order_total) AS total_sales
   FROM orders
   GROUP BY customer_id
)
SELECT *
FROM customer_sales
WHERE total_sales = (
   SELECT MAX(total_sales)
   FROM customer_sales
);

Enter fullscreen mode Exit fullscreen mode

In summary, understanding essential SQL commands is crucial for data scientists to effectively manipulate and analyze data stored in relational databases. These commands enable data scientists to retrieve, filter, group, join, aggregate, and sort data, and to perform various analytical tasks on it.

Top comments (1)

Collapse
 
rouilj profile image
John P. Rouillard

In section 3 under alias, should:

SELECT customer_id, SUM(order_total)
Enter fullscreen mode Exit fullscreen mode

be

SELECT customer_id, SUM(order_total) as total
Enter fullscreen mode Exit fullscreen mode

?