DEV Community

Cover image for How to use SQL to directly query files
Arctype Team for Arctype

Posted on • Originally published at arctype.com

How to use SQL to directly query files

Have you ever been sent a file and asked to find important information buried within it? Your coworkers would be very impressed if you could query the files in a quick and efficient manner. But… how exactly are you going to achieve such a feat?

As you probably know, SQL allows you to modify database data quickly and easily. When trying to work with data files, developers usually load data into a database and manage it via SQL. In a perfect world, you could just query the database to get the information your company needs. But in real life, there’s a catch: data loading is often not straightforward.

It would be extremely convenient if SQL queries could be run directly on files, skipping the database setup step. It turns out, other people have noticed this fact and have built tools to query your files directly with SQL. In this article, we'll look at a few tools and list some different SQL file query scenarios that they can handle. We'll also compare the tools and share some general considerations. Let’s begin.

TextQL

TextQL might be a suitable fit for you if you're looking for a simple tool to query .csv or .tsv files. TextqQL allows you to execute SQL on structured text with ease. It also allows you to use quote-escaped delimiters. When running a query, TextQL can automatically detect numeric and datetime data in appropriate formats, which helps you make your work easier. TextQL lets you list as many files and folders as you want. You can load the files or directories you’re working with by listing them at the end of the command.

Installation

Let's look at a few ways to install TextQL. One of the simplest methods you can use to get started is to use Homebrew. Just run the command below:

brew install textql
Enter fullscreen mode Exit fullscreen mode

If you're running Docker, you can also install TextQL with a single short command:

docker build -t textql .
Enter fullscreen mode Exit fullscreen mode

Lastly, if you're using Linux, you may be familiar with the AUR tool. AUR installation is also straightforward:

yaourt textql-git
Enter fullscreen mode Exit fullscreen mode

Now that we've gotten TextQL set up and working, let's see what it can do.

Example

We're going to examine an example to understand how TextQL works. Assume we have the following data.

cat sample_data.csv
Id,name,value,timestamp
1, Jack,5,1643004723 
1, John,11,1643114723 
1, James,-3,1645596723
Enter fullscreen mode Exit fullscreen mode

Now, excluding the header, let's count the number of data items we have.

textql -header -sql "select count() from sample_data" sample_data.csv

#output
3
Enter fullscreen mode Exit fullscreen mode

You should see an output of "3," which is exactly what we'd expect. We can also use TextQL to find the maximum value of this data. We can do that by running the command below:

textql -header -sql "select max(value) from sample_data" sample_data

#output
11
Enter fullscreen mode Exit fullscreen mode

Again, we see the expected value of 11. Very nice!

q

You might consider using q to query your files if you are working with .csv or .tsv files and need something that works faster than TextQL. q aims to bring SQL’s expressive capability to the Linux command line by offering direct access to multi-file SQLite3 databases and simple access to text as actual data.

q uses the SQLite engine. It allows you to run SQL-like statements directly on tabular text data, with the data being auto-cached to speed up subsequent queries on the same file. Using q, SQL statements can be run directly on multi-file SQLite3 databases without requiring them to be merged or loaded into memory.

Installation

Like TextQL, it's easy to install q via homebrew. Simply run the command below:

brew install harelba/q/q
Enter fullscreen mode Exit fullscreen mode

q also has a standalone executable, which you can download from this link. If you want to use the Windows installer, you can run the executable and follow the prompts displayed on the screen.

You can also install q as an .rpm package. You can download the package from this link, then install it via the following commands:

rpm -ivh <package-filename>
Or
rpm -U <package-filename>
Enter fullscreen mode Exit fullscreen mode

Example

To get an idea of how q works, let's run a query on a file where columns are named a1, a2, ... aN.

q "select a1,a5 from file.csv"
Enter fullscreen mode Exit fullscreen mode

Now, we'll COUNT DISTINCT values in a specific field. In this case, let's try to see how many UUIDs we have with the following command:

q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./file.csv"
Enter fullscreen mode Exit fullscreen mode

The output of this command will be a numeric value, equal to the number of UUIDs in the file.

OctoSQL

OctoSQL is primarily a command-line application that allows you to query a variety of databases and filetypes using SQL in a single interface, as well as perform JOINS between them. OctoSQL is a fully expandable, fully-featured dataflow engine that can be used to provide a SQL interface for your applications. It validates and optimizes queries based on database types. It may process massive volumes of data and return partial results before completing the complete query. Out of the box, OctoSQL only supports .csv and JSON files. You'll need to install a plugin to query other file formats, like Excel or Parquet files.

Installation

You can run the following command to install OctoSQL with Homebrew.

brew install cube2222/octosql/octosql
Enter fullscreen mode Exit fullscreen mode

Installation with Go is also supported, as shown below:

go install -u github.com/cube2222/octosql
Enter fullscreen mode Exit fullscreen mode

As mentioned, OctoSQL has a robust plugin capability. You can install plugins using commands like the one shown below:

octosql plugin install postgres
Enter fullscreen mode Exit fullscreen mode

Example

Let's have a look at a simple example of invoice data. Suppose we have .csv of invoice data in the format shown below:

octosql "SELECT * FROM ./invoice2.csv"

cust_id,cust_name,value
121, Jack,599.00
122, John,1100.00
123, James,400.50
Enter fullscreen mode Exit fullscreen mode

Suppose we want to calculate a sum of the invoice values. In that case, we could use the following command to find the answer:

#Sum
octosql "SELECT id, SUM(value) as value_sum FROM ./invoices2.csv GROUP BY id ORDER BY value sum DESC”
Enter fullscreen mode Exit fullscreen mode

Similarly, we can find a count of all the entries using this command:

#Count
octosql "SELECT COUNT(*) FROM mydb.customer”
Enter fullscreen mode Exit fullscreen mode

We can also perform more advanced operations like joins. The command below will get us what we want:

octosql "SELECT inv_id, value, email FROM ./invoices.csv JOIN mydb.customers ON customer_id = customer_id”
Enter fullscreen mode Exit fullscreen mode

OctoSQL may be the ideal fit for you if you require a tool that can work with a variety of file formats. One drawback that few people consider is that OctoSQL uses a custom engine instead of SQLite, which indicates that it may be missing some features.

DSQ

DSQ is a command-line tool that lets you execute SQL queries. It supports a wide range of file formats, including .csv, JSON, .tsv, Excel, Parquet, and .ods.

Installation

To install DSQ on a Mac or Linux operating system, use the following command:

$ curl -LO "https://github.com/multiprocessio/dsq/releases/download/$VERSION/dsq-$(uname -s | awk '{ print tolower($0) }')-x64-$VERSION.zip"
Enter fullscreen mode Exit fullscreen mode

If you're using Windows, you can install DSQ using a different workflow. Just download the latest Windows release, unzip it, and add DSQ to your $PATH

You can also install DSQ in Go using the following command:

$ go install github.com/multiprocessio/dsq@latest
Enter fullscreen mode Exit fullscreen mode

Example

DSQ allows you to either stream data or provide a file name to work with. The example below shows how you can use either a .json or .ndjson format, depending on your requirements.

$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"
#or
$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"
Enter fullscreen mode Exit fullscreen mode

You can also use DSQ to work with data from different origin types. The example below connects a .csv dataset and a .json dataset.

$ dsq testdata/join/users.csv testdata/join/ages.json \
      "select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"
Enter fullscreen mode Exit fullscreen mode

If you need a tool that can handle a wide range of file formats and uses SQLite, this is the tool for you.

Cases for SQL file queries

To see these tools in action, let's look at a few SQL file query examples. We'll be using .txt or .csv files for these examples, since almost all of the tools we discussed above support this file format.

Filtering via SQL

Filtering allows you to view only the information you wish to see. Filters are useful for displaying only the desired records from a form or report. Using a filter, you may limit the data in a view without affecting the architecture of the underlying object.

For example, let's start by selecting records from a text file that contains the information below.

Class Name English History Math
1 James Kim 78 65 67
1 John White 87 61 79
2 Paige Davis 77 82 94
3 Edwin Henderson 65 78 94

Suppose we want to choose the students in class 1 in the above file. The file's first row provides column names, whereas the rest of the rows have comprehensive information that we can use to query the records we want. So, we could run the following command to get the information we want:

$select * from E:/txt/Students_file.txt where CLASS = 1
Enter fullscreen mode Exit fullscreen mode

In SQL, you may sort data in ascending or descending order by one or more columns. To try this out, let's sort the table of student scores by class in ascending order and total score in descending order. We can do that with the following command:

$select * from E:/txt/Students_file.txt order by CLASS,ENGLISH+HISTORY+MATH desc
Enter fullscreen mode Exit fullscreen mode

You can combine results using aggregation by grouping records depending on their value. Grouping can also be used to calculate the sum of many values in a group. The grouped aggregates give a summary of a set of rows. For example, you can use the command below to find the lowest English score, the highest History score, and the overall Math score.

$select CLASS,min(English),max(History),sum(Math) from E:/txt/students_file.txt group by CLASS
Enter fullscreen mode Exit fullscreen mode

As you can see, the ability to perform SQL queries directly on files can help us answer important questions from our data.

Conclusion

In conclusion, let's use the chart below to take a quick look at all of the tools we mentioned before.

Name Speed Supported Files Engine
q Fast .csv, .tsv SQLite
TextQL Ok .csv, .tsv SQLite
OctoQL Slow .csv, .json, Excel, Parquet Custom
DSQ Ok .csv, .tsv, .json, Parquet, Excel, logs SQLite

Without having to go through a database, you can use the tools above to execute SQL on your files for instant insights. Each of these tools has different pros and cons, so be sure to choose the option that best suits your needs. Happy querying.

Top comments (1)

Collapse
 
dcmoura profile image
Daniel Moura

Great article. You should also consider SPyQL (github.com/dcmoura/spyql) that allows to execute python code in the SQL queries and does not load the full dataset into an in-memory database. (disclaimer: I am the author of SPyQL)