As part of learning more about common table expressions (CTEs) in SQLite I chose to look at some complex queries I needed to make for a hierarchical data set of employees. See the the first article in this series for the setup.
NOTE: I recommend checking out my Today I Learned on SQLite schema work to learn more.
Entity relationships and schema
The basic premise in my Investment Tracker is that a Person can invest some percentage of their time, between 0% and 100%, in an Initiative.
An Initiative is a project or a program or any other bundle of work that a number of people, or parts of an organization, are working on (“investment their time in”).
The is tracked as a staff Investment of effort.
And to keep things organized, Investments can have tags like enterprise
and keep-the-lights-on
.
As I mentioned in the opening article of the series, this data all lives in a couple of YAML files. Before I transformed the data from YAML into SQLite I needed to create the schema. Here’s what I eventually came up with after trying a few different shapes. Remember again, this is the basic data I was going to pull in from YAML. There’s more to SQL to come before I can answer my questions around this hierarchical data.
|
|
Get data from YAML to SQLite
I experimented with a few different ways to get the data from YAML into SQLite.
For example, I started with some fancy (read: complicated) code using yaml
attributes on Go structs to unmarshal the YAML into a Go struct.
Then I used the database/sql
package to insert the data into the database.
But to be honest there was some pre-processing that I wanted to do beforehand, and I eventually ditched all that yaml
-annotate Go code in favor of using the yq
command line tool to transform the YAML into CSV, which I could then more easily import into SQLite.
I kind of regret that decision just because the yq
incantations were a real struggle to get right.
I spent hours trying to get the subsets and pivots and joins and filters just right.
Nonetheless, it’s an incredible tool following in the footsteps of jq
.
Here’s a reminder what the input formats look like.
First, each person record:
|
|
And then the investments that link to people:
|
|
Now remember, I’m trying to get CSV files fit for import into the SQL schema defined above.
Get the people
And I was doing this in a Taskfile.yaml so watch out for the {{
and }}
syntax in the yq
commands.
The yq
steps look like this.
|
|
- The
.[]
creates an array of the top-level keys in the YAML. - The
|
pipes the output of the previous step into the next step. - The
[key, .name, .email, .title, .employment_type, .cost_center, .manager]
is a list of the fields I want to keep. - The
| @csv
converts the list into a CSV string. - The
>
redirects the output to a file.
So in one step I’m flattening the YAML, selecting the fields I want, and writing the output to a CSV file. I have a row-per-person in a CSV like this:
...
bitsbyd,Dylan Thomas,bitsbyd@github.com,Senior Director Technical Program Management, \
employee,XXXXDEPT,Alice
...
Get the investments
The second transform gets basic info on each investment:
|
|
The same pattern basically, leaving me with a line-per-investment like so:
...
super-secret,Super Secret,https://www.github.com,"Something neat *goes here*."
...
Get the person investments by initiative
This is where is got mind-bendingly complicated…for me at least. You know when you just read the docs super carefully, then even more deliberately craft the command, and it does something entirely unexpected…like nothing, or something just weird? That was a few hours (hours!) of my life at this point.
|
|
The whole keys,keys,.[]
. I honestly don’t know how I even landed on that as I reread it.
But it works so…I’ll take it.
The output is this with one line for each unique person-investment-initiative combination.
Here we see that BitsByD (me) is saying that he has invested 100% of his time on the super-secret
project.
...
super-secret,bitsbyd,1.0
...
Tags, anyone?
The final couple of commands are relatively straightforward.
They extract unique tag
values from across all investments (like a DISTINCT
), and then I grab the tags-per-initiative as well:
|
|
End result? I have several CSV files containing normalized data ready to be loaded into my SQLite database.
Get the CSVs into SQLite
When I first started tinkering on this project I used the sqlc
tool to generate Go code for the SQL schema.
It’s really freaking cool, and that’s where the strict separation of SQL schema from any other query or data modification code comes from.
I eventually found it too limiting for SQLite queries and Go codegen, so I ditched it (if I use Postgres I’ll try it again; the support seems much richer and deeper).
So I have a good SQL-based schema. Now I need to actually spin up a SQLite database and load the data into it.
That’s where my import.sql
comes in:
|
|
This is a script for SQLite specifically, run by as follows, right after I create the schema in which to load data:
|
|
Salient points:
- It’s very, very simple to load CSV data into SQLite if the schema and the CSV files match.
That’s why I was so careful to have my
yq
commands spit out CSV files that match my target schema in terms of column order and column name. - I liked using the
UNION
to get a quick summary of the data I just loaded. That helped me spot easy mistakes like…nothing got imported, or having twice as many employees as expected! - The SQLite pragma for
.mode table
made that a breeze to see on the console. Check it out:
|
|
Woohoo! I’ve got my YAML all CSV-ified and into SQLite where I can start querying it.
The Common Table Expressions (CTE)
Finally! One of the prime motivators for doing this work in the first place was to learn more about hierarchical data in SQL, and how to use CTEs to simplify more complex logic into simpler individual steps. And even more: combine the two into recursive CTEs. Woohoo!
This is going to be an eye-chart. I created three views that I want to use to answer the questions I posed at the beginning of this series. First, I want to get a seamless view of employees, with their hierarchy, and the total number of people on that persons entire organizational team; not just the people they directly manage, but everyone under them in the org chart.
The really interesting pattern I found with CTEs in general, was the ability to decompose overly complex logic into simpler steps, and then combine them back together into a single view. It’s not like I discovered that! Decomposition appears to be one of the fundamental benefits of CTEs: make it easy to reason about. While writing these CTEs it was also really nice to get little chunks of logic working first and then string them together for the big result.
The organizing has a hierarchy
OK, so look at the first highlight. This is a recursive CTE.
You can see that from the WITH
opening statement and the immediate RECURSIVE
keyword.
For recursive CTEs the second notable thing is the UNION
statement.
The first part of the query before the UNION
tells us where we’ll start our hierarchical walk.
In my case, WHERE id = (select id from people where manager IS NULL)
on line 24 tells us to start at the CEO: the person who has no manager.
The second part of the query after the UNION
tells us how to walk the hierarchy from that starting point…recursively.
I do that by joining the row(s) in the second section to the parent manager from the first section: JOIN subordinate s ON p.manager = s.id
.
As a side note, the json_array()
function is a really nice way to build up a JSON array in SQLite.
I wanted this because it made my eventual JSON output easier to work with.
For example, if Bob reports to Alice, and Alice reports to John, and John reports to Janet who is the CEO, then the value of the org_tree
column for Bob would be ["Janet", "John", "Alice"]
.
A JSON array of strings.
But it keeps going.
Look at line 40 – notice the ,
comma indicating that our next CTE is coming in hot on the heels of the first one.
And sure enough, the team_size
CTE begins on line 57.
This is mostly the same but the recursion is used to add up the total number of people in any single employee’s org hierarchy: SELECT COUNT(*) FROM cte WHERE manager IS NOT NULL
.
So now I have, for my imaginary Bob, a column called org_count
containing a recursive count of employees reporting to Bob…or just zero if Bob is what we call ‘an individual contributor’ (no one reports to them).
Finally, on line 85 we have a much simpler LEFT JOIN
that pulls the information from the subordinate
CTE and team_size
CTE into a single view. Literally a view since back up on line 5 I opened with CREATE VIEW IF NOT EXISTS org
.
|
|
Let’s look at some data (I’m fudging the data to protect the innocent):
|
|
id | name | title | cost_center | manager | org_level | org_tree | manager_name | org_size | |
---|---|---|---|---|---|---|---|---|---|
bitsbyd | Dylan Thomas | bitsbyd@github.com | Senior Director Technical Program Management | ORG NAME | ALICE | 3 | [“The Boss”,“Another Boss”,“My Boss”] | Alice | 18 |
The salient points are the reporting structure as a JSON array, the total org size under me, and my level within the org (which comes in next).
Investments
Next up I want to figure out where all these employees are investing their time. The CTE pattern looks exactly the same now: multiple little-ish queries that I can reason about ‘in the small’, and then combine them into a single view.
I put quite a few inline comments in these queries because I’m baking business rules into SQL. Depending on who you believe, that’s either heresy or just not a big deal. I’m the one writing this code, so it’s not a big deal!!!
The first point of interest is the use of json_each()
to iterate over the org_tree
column.
That really took me a while to unpack.
There’s some good documentation on the SQLite JSON functions but to be honest I think the nuances is how these group/array functions work compared to other SQL-like set-based operations work.
At least, it had me turned around a few times.
The rest of it is pretty straightforward.
Sure take your time with it and ready it carefully, but apart from the CASE
statement and my usual habit of getting the truthiness of WHEN...THEN
wrong, it’s pretty straightforward.
|
|
Let’s look at some data:
|
|
I have committed 100% of my team of 18 people, which means that I have 18 FTEs invested in this initiative.
And it says TRUE
(1
) in the included
column, which means that my investment is included in the total investment for this initiative, because nobody above me in the company has said they’re investing in this initiative.
initiative | person | investment | org_size | included | org_level | fte |
---|---|---|---|---|---|---|
super-secret | bitsbyd | 1.0 | 18 | 1 | 3 | 18.0 |
The final view is really simple and I’m just including it here for completeness. Basically, fix some empty things and add up the FTEs for each initiative.
|
|
And the data shows that my investment and that of others (totally fabricated) is included in the total investment for this initiative.
|
|
id | name | url | description | total_investment |
---|---|---|---|---|
super-secret | Super Secret | https://www.github.com | Something neat goes here. | 24.4 |
Using in a Go program
I’ll stop here and keep focused on the SQL bits in this post. But as a bit of a teaser, I’m going to show how I’m using this data in a Go program. I decided to use GORM to access the data. As an ORM most of the work was done using object-like interfaces, like getting a person using their handle. But I’ve also included an example where dropping straight into some SQL is useful.
|
|
That’s it for now. Next up, using this stuff in a Go web app, and some GORM-isms that I learned along the way.