This is a conversation that Jason McCollum and I recorded last week.
Light on technical content, but hopefully putting data engineering into context with the more widely known data science term.
While someone was tackling this coding exercise for a Woolpert role, I tried it out in Python.
First time attempt; I’m not sure how idiomatic or even performant the concurrent download bit is.
But I had fun doing it.
starmap is a very new concept to me.
Took a few minutes to get the correct syntax for calling a function with multiple arguments, but got there in the end.
Tried to use a dataclass at first to make the code easier to read.
Then figured I was making things too complicated, then wished I had stuck with it instead of the crummy key['value'] crap all over the place.
importrequestsimportloggingimportpathlibimportshutilimportosfrommultiprocessing.poolimport ThreadPool
importsys# https://www.delftstack.com/howto/python/python-logging-stdout/Log_Format ="%(levelname)s - %(message)s"logging.basicConfig( stream=sys.stdout, filemode="w",format=Log_Format, level=logging.INFO
)logger = logging.getLogger()defcall_github_api(): URL ="https://api.github.com/search/users?q=followers:%3E10000+sort:followers&per_page=50" r = requests.get(URL)if r.status_code ==403: logger.warn("Hitting a GitHub API usage error")return{} data = r.json()["items"]return data
defget_users(): data = call_github_api() users =[]for user in data: u =[user["login"], user["avatar_url"]] users.append(u) logger.info(f"Found {len(users)} users")return users
defdownload_photo(login:str, avatar_url:str): response = requests.get(avatar_url)if response.status_code ==200: file =f"photos/{login}.jpg" logger.info(f"Downloading {file}...")withopen(file,"wb")as f: f.write(response.content)defdownloadPhotos(users): dirpath ="photos"if os.path.exists(dirpath)and os.path.isdir(dirpath): shutil.rmtree(dirpath) p = pathlib.Path(dirpath) p.mkdir(parents=True, exist_ok=True)# starmap needs an array of arguments mapped from the list# one mini-list matching arguments needed by the download function# So the structure [['bob', 'https://picture.jpeg'], ['alice', 'https://picture2.jpeg']] *just works* in this context.## https://stackoverflow.com/a/5442981 ThreadPool(10).starmap(download_photo, users)# Sequential blocking downloads# for u in users:# downloadPhoto(u['user'], u['photo'])defmain(): users = get_users() downloadPhotos(users)if__name__=="__main__": main()
Thanks to Repl.it version control I shared the incomplete exercises with candidates.
However, my version of the answers are there as well.
For example, here’s the music.sql answer I can up with.
And of course, the first person to look at this proposed a more direct version without the common table expression (CTE), which I was using for clarity’s sake.
/*
TODO - print the artist name and album count
ArtistName AlbumCount
---------- ----------
Lost 3
Creedence 2
The Office 2
ONLY those artists who have released:
- at least 2 albums
- each having at least 20 tracks on them.
Tip: the .tables and .schema [table] commands are handy!
*/.opensample.db.headerson.modecolumnWITHtracks_and_artistsAS(SELECTt.albumid,albums.artistid,artists.name,COUNT(t.trackid)astrack_countFROMtrackstINNERJOINalbumsonalbums.albumid=t.albumidINNERJOINartistsonartists.artistid=albums.artistidGROUPBYt.albumid,albums.title,albums.artistidHAVINGtrack_count>=20)SELECTnameasArtistName,COUNT(albumid)asAlbumCountFROMtracks_and_artistsGROUPBYArtistNameHAVINGAlbumCount>=2ORDERBYAlbumCountDESC;
I love, love, love the fact that Repl.it supports SQLite as a first class project type.
Makes it so much easier to share an idea without needing a whole dev environment.