DEV Community

Cover image for SQL Queries That Every Data Scientist Should Know!
Abdul Raheem
Abdul Raheem

Posted on • Updated on

SQL Queries That Every Data Scientist Should Know!

What is SQL

SQL (Structured Query Language) is a programming language used to manage and manipulate relational databases. It is used to insert, update, and retrieve data from databases, as well as to create, modify, and manage database structures. In simple terms, SQL is a way to talk to databases and get information from them or put information into them.

In Data science, we used SQL to read data and manipulate it to get our desired result.

Example
Let's say you have a database of all the students in your school, and each student has a name, an age, and a grade level. You could use SQL to ask the database for a list of all the names of the students in your grade. Or you could use SQL to change a student's grade level if they got promoted.

Why we should use SQL?
SQL is the most common and easy method to access data in databases. In Data science, we will use SQL to read data and manipulate it to get our desired result. Here we will be focused on manipulating data rather than the creation and removal of data.

Some Key Features of SQL

  • Easy to understanding- SQL is a simple and intuitive language to learn and use.
  • Direct data access- Traditional databases allow users to easily access and retrieve specific data.
  • Data audit and replication: It is easy to audit and replicate data in traditional databases.
  • Multi-table analysis- SQL is powerful for analyzing data from multiple tables at once.
  • Complex analysis- SQL allows users to analyze more complex data and questions than dashboard tools like Google Analytics.

SQL Queries

To understand and execute queries first we need to configure Parch and Posy Database

Database Schema:

Image description

1. SELECT & From

This statement is used to query a database and retrieve specific data from one or more tables. It is one of the most commonly used SQL commands. The basic syntax of this statement is as follows:

Query
SELECT column1, column2 From table_name;

In this query, the SELECT keyword is used to specify that you want to retrieve data from the database. The column1, column2, ... are the names of the columns that you want to retrieve data from. You can also use the * wildcard to select all columns. The FROM keyword is used to specify the table that you want to retrieve data from.

Query
SELECT * FROM orders;
This statement will retrieve all the columns in the order table as shown in figure below.

Image description

2. LIMIT

The LIMIT statement allows you to retrieve only a specific number of rows from a table, which can be useful when you only need to see the initial data and don't need to load the entire dataset. This can be much quicker for loading, as it reduces the amount of data that needs to be loaded.

LIMIT Statement will always used at the last of query.

Query
SELECT * FROM orders LIMIT 10;
This command will limit first 10 rows of all the columns of table.

Image description

3. ORDER BY

The ORDER BY statement in SQL allows sorting of query results based on data in any column. However, the sorting effect is only temporary and specific to that query, unlike sorting in spreadsheet software which permanently alters the data. This difference highlights the purpose of a SQL query.

DESC can be used after the column in your ORDER BY statement to sort column in descending order, as the default ORDER BY query will sort in ascending order.

Query
SELECT id, sales_rep_id
FROM accounts
ORDER BY id
LIMIT 10;

In this statement the columns in the table will be sorted in ascending order with respect to id column.
Image description

We can even sort by multiple columns using ORDER BY by providing a list of columns. The sorting process first uses the leftmost column in the list, then the next column and so on. Additionally, it is still possible to reverse the order using the DESC keyword.

Query
SELECT account_id, total_amt_usd
FROM orders
ORDER By account_id, total_amt_usd DESC;

Here account _id will be sorted in ascending order as usual and total_amt_usd will be sorted in descending order as shown in figure below.

Image description

4. WHERE

The WHERE statement in SQL is used to filter the results of a query. It allows you to specify certain conditions that the data in your query must meet in order to be included in the final output.

Common symbols used in WHERE statements include:

> (greater than)
< (less than)
>=(greater than or equal to)
<=(less than or equal to)
=(equal to)
!=(not equal to)

Query
SELECT *
FROM orders
WHERE account_id = 4251
ORDER BY occurred_at
LIMIT 1000;

In this statement, we are extracting the information, where account_id is equal to 4251.

Image description

WHERE statement will work with non-numeric data as well.

Query
SELECT *
FROM accounts
WHERE name = 'Walmart;'

Image description

5. Arithmetic Operators

In SQL, arithmetic operators are used to perform mathematical operations on values in a query. The most common arithmetic operators in SQL are:

* (Multiplication)
+ (Addition)
- (Subtraction)
/ (Division)

Order of all the arithmetic operation will follow PADMAS rule

Query
SELECT id, (standard_amt_usd/total_amt_usd)*100
FROM orders
LIMIT 10;

In the output, you will notice that it created a new column after multiplication and the name of column is unknown. This can be solved using derived column.

Image description

Derived Column:
A derived column, also known as a calculated or computed column, is a new column created by combining existing columns in a table. This new column can be given a name, known as an alias, using the AS keyword.

Query
SELECT id, (standard_amt_usd/total_amt_usd)*100 AS std_percent
FROM orders
LIMIT 10;

Here, we are dividing the dollar amount of standard paper by the total order amount to calculate the percentage of standard paper used in the order. We named this new column "std_percent" using the AS keyword.

Image description

6. Logical Operators

In SQL, logical operators are used to combine multiple conditions in a query. Some of the commonly used logical operators are:

(i) Like

The LIKE operator in SQL is used to match a specific pattern in a column, it allows you to perform operations similar to using WHERE and =, but when you don't know the exact value.

It's particularly useful for working with text data. The LIKE operator is frequently used with % which means search for the text that is in between % no matter what comes on either side of the text.

It's case-sensitive, so searching for 'T' is different from searching for 't'

Query:
SELECT *
FROM accounts
WHERE website LIKE '%google%';

This query will select row that contains the google in it, no matter what comes before or after the google.

Image description

(ii) IN

The IN operator is useful for working with multiple values in both numeric and text columns, it allows you to check for one, two or many values within the same query. It's similar to using = but for multiple values of a particular column. It's a cleaner way of writing queries compared to using the OR operator which also allows you to perform similar tasks.

Query:

SELECT *
FROM orders
WHERE account_id IN (1001,4251);

It will extract the row from 1001 to 4251.

Image description

(iii) NOT

The NOT operator in SQL is used to negate a condition, it can be used in combination with other operators like IN and LIKE to retrieve rows that do not match specific criteria.

For example, the NOT IN operator can be used to retrieve rows that do not have a specific value in a column and the NOT LIKE operator can be used to retrieve rows that do not match a specific pattern in a column. It's useful for filtering and retrieving data that does not meet specific conditions.

Query:

SELECT *
FROM orders
WHERE account_id NOT IN (1001,4251);

In this query, all the row that not from 1001 and 4251 will be retrieved.

Image description

(iv) AND

In SQL, the AND operator is used to combine multiple conditions in a query. The AND operator is used to match rows where all conditions are true.

Query:
SELECT *
FROM orders
WHERE account_id NOT IN (1001,4251) AND id IN(17,26);

In this query, two conditions will be checked if both of them are true then it will retrieve the information as shown in figure.

Image description

(iv) BETWEEN

In SQL, the BETWEEN operator is used to match a range of values within a column. The BETWEEN operator is used to match rows where a column value is between two specified values, inclusive of the specified values.

Query:
SELECT *
FROM orders
WHERE account_id BETWEEN 1001 AND 4251

In this query, all the row between 1001 and 4251 in account_id will be retrieved.

Image description

(v) OR

The OR operator in SQL is used to combine multiple conditions in a query, it matches rows where at least one condition in query is true.

Query:
SELECT standard_qty, gloss_qty
FROM orders
WHERE (standard_qty = 0 OR gloss_qty = 0);

In this query, it will select all the rows where either gloss quantity is zero or standard quantity is zero.

Image description

These operator can be used in combination with other operators such as arithmetic operators (+, *, -, /)


Best Practice For Formatting Query

Following are the best practices that one should follow while writing the SQL queries:

1. Capitalize SQL commands like SELECT and FROM, and keep everything else in lower case to make the query more readable.
SELECT account_id
FROM orders;

2. Use underscores instead of spaces in column and table names.
SELECT account_id, standard_qty
FROM orders;

3. Always include a semicolon at the end of each statement, as it may be required in some SQL environments.
SELECT account_id
FROM orders;

4. Use double quotes or square brackets to reference tables and columns that have spaces in their names.
SELECT "full name", "age"
FROM "employee information";

5. Use comments in your query for better understanding of your code.
-- This query selects all information from 'customers' table
SELECT * FROM customers;

6. Be consistent in formatting throughout your queries and scripts.

7. Use white space in queries to make them more readable.

8. Use indentation to make the query structure more clear.

9. Use meaningful name for columns, tables and variables in order to make it more understandable and readable.

10. Avoid using too many subqueries and joins, use them only when it's necessary.

Top comments (0)