sq data wrangler
Go to file
neilotoole 2d3f68bb47 merge
2024-02-20 16:28:23 -07:00
.github workflow debugging 2024-02-10 11:52:27 -07:00
.images Release wrap up v0.47.0 (#377) 2024-01-28 14:55:51 -07:00
cli #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
drivers #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
grammar #307: Ingest cache (#354) 2024-01-14 18:45:34 -07:00
libsq #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
packages/docker Docker install 2024-02-10 13:17:10 -07:00
testh #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
tools Add flag --src.schema (#326) 2023-11-18 17:05:48 -07:00
.editorconfig #199 - Config, refactoring (#204) 2023-04-18 23:28:09 -06:00
.generate-changelog-delta.sh workflow: tweak .generate-changelog-delta.sh 2023-03-25 20:59:33 -06:00
.generate-completions.sh Shell completion fixes (#388) 2024-01-30 23:04:10 -07:00
.gitattributes Added go test to github action; fixed Windows test issues (#52) 2020-08-07 23:10:41 -06:00
.gitignore .gitignore update 2024-02-15 11:05:32 -07:00
.golangci.yml #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
.goreleaser-darwin.yml CHANGELOG update for v0.42.0; tidy up worfklow files 2023-08-22 07:35:41 -06:00
.goreleaser-linux-amd64.yml CHANGELOG update for v0.42.0; tidy up worfklow files 2023-08-22 07:35:41 -06:00
.goreleaser-linux-arm64.yml CHANGELOG update for v0.42.0; tidy up worfklow files 2023-08-22 07:35:41 -06:00
.goreleaser-windows.yml #279: SQLite virtual tables (#304) 2023-08-21 10:05:17 -06:00
.goreleaser.yml Shell completion fixes (#388) 2024-01-30 23:04:10 -07:00
.manpages.sh Feature/manpages (#136) 2022-12-29 19:02:10 -07:00
.shellcheckrc Add flag --src.schema (#326) 2023-11-18 17:05:48 -07:00
CHANGELOG.md #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md 2022-12-27 10:39:24 -07:00
CONTRIBUTING.md Create CONTRIBUTING.md 2023-04-02 15:20:38 -06:00
cosign.pub Added cosign keys 2022-12-29 09:29:40 -07:00
go.mod Updated deps (#397) 2024-02-09 22:47:28 -07:00
go.sum #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
huzzah.txt db tools preliminary work; --src.schema changes (#392) 2024-02-09 09:08:39 -07:00
install.sh Add flag --src.schema (#326) 2023-11-18 17:05:48 -07:00
LICENSE Multiplatform goreleaser (#106) 2022-11-06 22:49:05 -07:00
main.go db tools preliminary work; --src.schema changes (#392) 2024-02-09 09:08:39 -07:00
Makefile db tools preliminary work; --src.schema changes (#392) 2024-02-09 09:08:39 -07:00
README.md Docker install 2024-02-10 13:17:10 -07:00
sakila_a_tail.txt #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
sakila_a.txt #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
sakila_b_tail.txt #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
sakila_b.txt #353 Diff performance (#399) 2024-02-20 16:26:45 -07:00
sakila-start-local.sh #191: XLSX driver auto-detects header row (#284) 2023-07-08 09:21:27 -06:00
sakila-stop-local.sh #191: XLSX driver auto-detects header row (#284) 2023-07-08 09:21:27 -06:00
SECURITY.md Create SECURITY.md 2023-03-29 19:15:32 -06:00
sq.json Scoop update for sq version v0.47.4 2024-02-09 22:58:46 -07:00

Go Reference Go Report Card License Main pipeline

sq data wrangler

sq is a command line tool that provides jq-style access to structured data sources: SQL databases, or document formats like CSV or Excel. It is the lovechild of sql+jq.

sq

sq executes jq-like queries, or database-native SQL. It can join across sources: join a CSV file to a Postgres table, or MySQL with Excel.

sq outputs to a multitude of formats including JSON, Excel, CSV, HTML, Markdown and XML, and can insert query results directly to a SQL database.

sq can also inspect sources to view metadata about the source structure (tables, columns, size). You can use sq diff to compare tables, or entire databases. sq has commands for common database operations to copy, truncate, and drop tables.

Find out more at sq.io.

Install

macOS

brew install neilotoole/sq/sq

Linux

/bin/sh -c "$(curl -fsSL https://sq.io/install.sh)"

Windows

scoop bucket add sq https://github.com/neilotoole/sq
scoop install sq

Go

go install github.com/neilotoole/sq

Docker

The ghcr.io/neilotoole/sq image is preloaded with sq and a handful of related tools like jq.

Local

# Shell into a one-time container.
$ docker run -it ghcr.io/neilotoole/sq zsh

# Start detached (background) container named "sq-shell".
$ docker run -d --name sq-shell ghcr.io/neilotoole/sq
# Shell into that container.
$ docker exec -it sq-shell zsh 

Kubernetes

Running sq in a Kubernetes environment is useful for DB migrations, as well as general data wrangling.

# Start pod named "sq-shell".
$ kubectl run sq-shell --image ghcr.io/neilotoole/sq
# Shell into the pod.
$ kubectl exec -it sq-shell -- zsh 

See other install options.

Overview

Use sq help to see command help. Docs are over at sq.io. Read the overview, and tutorial. The cookbook has recipes for common tasks, and the query guide covers sq's query language.

The major concept is: sq operates on data sources, which are treated as SQL databases (even if the source is really a CSV or XLSX file etc.).

In a nutshell, you sq add a source (giving it a handle), and then execute commands against the source.

Sources

Initially there are no sources.

$ sq ls

Let's add a source. First we'll add a SQLite database, but this could also be Postgres, SQL Server etc., or a document source such Excel or CSV.

Download the sample DB, and sq add the source.

$ wget https://sq.io/testdata/sakila.db

$ sq add ./sakila.db
@sakila  sqlite3  sakila.db

$ sq ls -v
HANDLE   ACTIVE  DRIVER   LOCATION                         OPTIONS
@sakila  active  sqlite3  sqlite3:///Users/demo/sakila.db

$ sq ping @sakila
@sakila       1ms  pong

$ sq src
@sakila  sqlite3  sakila.db

The sq ping command simply pings the source to verify that it's available.

sq src lists the active source, which in our case is @sakila. You can change the active source using sq src @other_src. When there's an active source specified, you can usually omit the handle from sq commands. Thus you could instead do:

$ sq ping
@sakila  1ms  pong

Tip

Document sources such as CSV or Excel can be added from the local filesystem, or from an HTTP URL.

$ sq add https://acme.s3.amazonaws.com/sales.csv

sq inspect remote See the sources docs for more.

Query

Fundamentally, sq is for querying data. The jq-style syntax is covered in detail in the query guide.

sq query where slq

The above query selected some rows from the actor table. You could also use native SQL, e.g.:

sq query where sql

But we're flying a bit blind here: how did we know about the actor table?

Inspect

sq inspect is your friend.

sq inspect

Use sq inspect -v to see more detail. Or use -j to get JSON output:

sq inspect -j

Combine sq inspect with jq for some useful capabilities. Here's how to list all the table names in the active source:

$ sq inspect -j | jq -r '.tables[] | .name'
actor
address
category
city
country
customer
[...]

And here's how you could export each table to a CSV file:

$ sq inspect -j | jq -r '.tables[] | .name' | xargs -I % sq .% --csv --output %.csv
$ ls
actor.csv     city.csv	    customer_list.csv  film_category.csv  inventory.csv  rental.csv		     staff.csv
address.csv   country.csv   film.csv	       film_list.csv	  language.csv	 sales_by_film_category.csv  staff_list.csv
category.csv  customer.csv  film_actor.csv     film_text.csv	  payment.csv	 sales_by_store.csv	     store.csv

Note that you can also inspect an individual table:

sq inspect actor verbose

Read more about sq inspect.

Diff

Use sq diff to compare metadata, or row data, for sources, or individual tables.

The default behavior is to diff table schema and row counts. Table row data is not compared in this mode.

sq diff

Use --data to compare row data.

sq diff data

There are many more options available. See the diff docs.

Insert query results

sq query results can be output in various formats (text, json, csv, etc.). Those results can also be "outputted" as an insert into a database table.

That is, you can use sq to insert results from a Postgres query into a MySQL table, or copy an Excel worksheet into a SQLite table, or a push a CSV file into a SQL Server table etc.

Tip

If you want to copy a table inside the same (database) source, use sq tbl copy instead, which uses the database's native table copy functionality.

Here we query a CSV file, and insert the results into a Postgres table.

sq query insert inspect

Cross-source joins

sq can perform the usual joins. Here's how you would join tables actor, film_actor, and film:

$ sq '.actor | join(.film_actor, .actor_id) | join(.film, .film_id) | .first_name, .last_name, .title'

But sq can also join across data sources. That is, you can join an Excel worksheet with a Postgres table, or join a CSV file with MySQL, and so on.

This example joins a Postgres database, an Excel worksheet, and a CSV file.

sq join multi source

Read more about cross-source joins in the query guide.

Table commands

sq provides several handy commands for working with tables: tbl copy, tbl truncate and tbl drop. Note that these commands work directly against SQL database sources, using their native SQL commands.

$ sq tbl copy .actor .actor_copy
Copied table: @sakila.actor --> @sakila.actor_copy (200 rows copied)

$ sq tbl truncate .actor_copy
Truncated 200 rows from @sakila.actor_copy

$ sq tbl drop .actor_copy
Dropped table @sakila.actor_copy

UNIX pipes

For file-based sources (such as CSV or XLSX), you can sq add the source file, but you can also pipe it:

$ cat ./example.xlsx | sq .Sheet1

Similarly, you can inspect:

$ cat ./example.xlsx | sq inspect

Drivers

sq knows how to deal with a data source type via a driver implementation. To view the installed/supported drivers:

$ sq driver ls
DRIVER     DESCRIPTION                          
sqlite3    SQLite                               
postgres   PostgreSQL                           
sqlserver  Microsoft SQL Server / Azure SQL Edge
mysql      MySQL                                
csv        Comma-Separated Values               
tsv        Tab-Separated Values                 
json       JSON                                 
jsona      JSON Array: LF-delimited JSON arrays 
jsonl      JSON Lines: LF-delimited JSON objects
xlsx       Microsoft Excel XLSX                 

Output formats

sq has many output formats:

CHANGELOG

See CHANGELOG.md.

Acknowledgements