November 11, 2021

November 11, 2021
replit python data engineering sql sqlite

Data Engineering audiocast

This is a conversation that Jason McCollum and I recorded last week. Light on technical content, but hopefully putting data engineering into context with the more widely known data science term.

Downloading photos using Python

While someone was tackling this coding exercise for a Woolpert role, I tried it out in Python. First time attempt; I’m not sure how idiomatic or even performant the concurrent download bit is. But I had fun doing it.

Learning parallelism on the fly with Python

starmap is a very new concept to me. Took a few minutes to get the correct syntax for calling a function with multiple arguments, but got there in the end.
Tried to use a dataclass at first to make the code easier to read. Then figured I was making things too complicated, then wished I had stuck with it instead of the crummy key['value'] crap all over the place.

main.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import requests
import logging
import pathlib
import shutil
import os
from multiprocessing.pool import ThreadPool
import sys


# https://www.delftstack.com/howto/python/python-logging-stdout/
Log_Format = "%(levelname)s - %(message)s"
logging.basicConfig(
    stream=sys.stdout, filemode="w", format=Log_Format, level=logging.INFO
)
logger = logging.getLogger()


def call_github_api():
    URL = "https://api.github.com/search/users?q=followers:%3E10000+sort:followers&per_page=50"

    r = requests.get(URL)

    if r.status_code == 403:
        logger.warn("Hitting a GitHub API usage error")
        return {}

    data = r.json()["items"]
    return data


def get_users():
    data = call_github_api()
    users = []

    for user in data:
        u = [user["login"], user["avatar_url"]]
        users.append(u)

    logger.info(f"Found {len(users)} users")
    return users


def download_photo(login: str, avatar_url: str):
    response = requests.get(avatar_url)
    if response.status_code == 200:
        file = f"photos/{login}.jpg"
        logger.info(f"Downloading {file}...")
        with open(file, "wb") as f:
            f.write(response.content)


def downloadPhotos(users):
    dirpath = "photos"
    if os.path.exists(dirpath) and os.path.isdir(dirpath):
        shutil.rmtree(dirpath)
    p = pathlib.Path(dirpath)
    p.mkdir(parents=True, exist_ok=True)

    # starmap needs an array of arguments mapped from the list
    # one mini-list matching arguments needed by the download function
    # So the structure [['bob', 'https://picture.jpeg'], ['alice', 'https://picture2.jpeg']] *just works* in this context.
    #
    # https://stackoverflow.com/a/5442981
    ThreadPool(10).starmap(download_photo, users)


# Sequential blocking downloads
#  for u in users:
#    downloadPhoto(u['user'], u['photo'])


def main():
    users = get_users()
    downloadPhotos(users)


if __name__ == "__main__":
    main()

SQL coding exercise

This short set of SQL exercises was really fun to write.

Thanks to Repl.it version control I shared the incomplete exercises with candidates. However, my version of the answers are there as well.

For example, here’s the music.sql answer I can up with. And of course, the first person to look at this proposed a more direct version without the common table expression (CTE), which I was using for clarity’s sake.

music.sql

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 TODO - print the artist name and album count
 
 ArtistName  AlbumCount
 ----------  ----------
 Lost        3         
 Creedence   2         
 The Office  2  
 
 ONLY those artists who have released:
 
 - at least 2 albums
 - each having at least 20 tracks on them.
 
 Tip: the .tables and .schema [table] commands are handy!
 */
.open sample.db
.headers on
.mode column

WITH tracks_and_artists AS (
	SELECT t.albumid,
		albums.artistid,
		artists.name,
		COUNT(t.trackid) as track_count
	FROM tracks t
		INNER JOIN albums on albums.albumid = t.albumid
		INNER JOIN artists on artists.artistid = albums.artistid
	GROUP BY t.albumid,
		albums.title,
		albums.artistid
	HAVING track_count >= 20
)
SELECT name as ArtistName,
	COUNT(albumid) as AlbumCount
FROM tracks_and_artists
GROUP BY ArtistName
HAVING AlbumCount >= 2
ORDER BY AlbumCount DESC;

I love, love, love the fact that Repl.it supports SQLite as a first class project type. Makes it so much easier to share an idea without needing a whole dev environment.

Data Engineering audiocast

Downloading photos using Python

Learning parallelism on the fly with Python

SQL coding exercise

See Also