Python, SQL, and data eng

Data Engineering audiocast #

This is a conversation that Jason McCollum and I recorded last week. Light on technical content, but hopefully putting data engineering into context with the more widely known data science term.

Downloading photos using Python #

While someone was tackling this coding exercise for a Woolpert role, I tried it out in Python. First time attempt; I’m not sure how idiomatic or even performant the concurrent download bit is. But I had fun doing it.

Learning parallelism on the fly with Python #

starmap is a very new concept to me. Took a few minutes to get the correct syntax for calling a function with multiple arguments, but got there in the end.
Tried to use a dataclass at first to make the code easier to read. Then figured I was making things too complicated, then wished I had stuck with it instead of the crummy key['value'] crap all over the place.

main.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import requests
import logging
import pathlib
import shutil
import os
from multiprocessing.pool import ThreadPool
import sys


# https://www.delftstack.com/howto/python/python-logging-stdout/
Log_Format = "%(levelname)s - %(message)s"
logging.basicConfig(
    stream=sys.stdout, filemode="w", format=Log_Format, level=logging.INFO
)
logger = logging.getLogger()


def call_github_api():
    URL = "https://api.github.com/search/users?q=followers:%3E10000+sort:followers&per_page=50"

    r = requests.get(URL)

    if r.status_code == 403:
        logger.warn("Hitting a GitHub API usage error")
        return {}

    data = r.json()["items"]
    return data


def get_users():
    data = call_github_api()
    users = []

    for user in data:
        u = [user["login"], user["avatar_url"]]
        users.append(u)

    logger.info(f"Found {len(users)} users")
    return users


def download_photo(login: str, avatar_url: str):
    response = requests.get(avatar_url)
    if response.status_code == 200:
        file = f"photos/{login}.jpg"
        logger.info(f"Downloading {file}...")
        with open(file, "wb") as f:
            f.write(response.content)


def downloadPhotos(users):
    dirpath = "photos"
    if os.path.exists(dirpath) and os.path.isdir(dirpath):
        shutil.rmtree(dirpath)
    p = pathlib.Path(dirpath)
    p.mkdir(parents=True, exist_ok=True)

    # starmap needs an array of arguments mapped from the list
    # one mini-list matching arguments needed by the download function
    # So the structure [['bob', 'https://picture.jpeg'], ['alice', 'https://picture2.jpeg']] *just works* in this context.
    #
    # https://stackoverflow.com/a/5442981
    ThreadPool(10).starmap(download_photo, users)


# Sequential blocking downloads
#  for u in users:
#    downloadPhoto(u['user'], u['photo'])


def main():
    users = get_users()
    downloadPhotos(users)


if __name__ == "__main__":
    main()

SQL coding exercise #

This short set of SQL exercises was really fun to write.

Thanks to Repl.it version control I shared the incomplete exercises with candidates. However, my version of the answers are there as well.

For example, here’s the music.sql answer I can up with. And of course, the first person to look at this proposed a more direct version without the common table expression (CTE), which I was using for clarity’s sake.

music.sql

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 TODO - print the artist name and album count
 
 ArtistName  AlbumCount
 ----------  ----------
 Lost        3         
 Creedence   2         
 The Office  2  
 
 ONLY those artists who have released:
 
 - at least 2 albums
 - each having at least 20 tracks on them.
 
 Tip: the .tables and .schema [table] commands are handy!
 */
.open sample.db
.headers on
.mode column

WITH tracks_and_artists AS (
	SELECT t.albumid,
		albums.artistid,
		artists.name,
		COUNT(t.trackid) as track_count
	FROM tracks t
		INNER JOIN albums on albums.albumid = t.albumid
		INNER JOIN artists on artists.artistid = albums.artistid
	GROUP BY t.albumid,
		albums.title,
		albums.artistid
	HAVING track_count >= 20
)
SELECT name as ArtistName,
	COUNT(albumid) as AlbumCount
FROM tracks_and_artists
GROUP BY ArtistName
HAVING AlbumCount >= 2
ORDER BY AlbumCount DESC;

I love, love, love the fact that Repl.it supports SQLite as a first class project type. Makes it so much easier to share an idea without needing a whole dev environment.

Data Engineering audiocast #

Downloading photos using Python #

Learning parallelism on the fly with Python #

SQL coding exercise #

See Also