This is a conversation that Jason McCollum and I recorded last week.
Light on technical content, but hopefully putting data engineering into context with the more widely known data science term.
While someone was tackling this coding exercise for a Woolpert role, I tried it out in Python.
First time attempt; I’m not sure how idiomatic or even performant the concurrent download bit is.
But I had fun doing it.
starmap is a very new concept to me.
Took a few minutes to get the correct syntax for calling a function with multiple arguments, but got there in the end.
Tried to use a dataclass at first to make the code easier to read.
Then figured I was making things too complicated, then wished I had stuck with it instead of the crummy key['value'] crap all over the place.
importrequestsimportloggingimportpathlibimportshutilimportosfrommultiprocessing.poolimportThreadPoolimportsys# https://www.delftstack.com/howto/python/python-logging-stdout/Log_Format="%(levelname)s - %(message)s"logging.basicConfig(stream=sys.stdout,filemode="w",format=Log_Format,level=logging.INFO)logger=logging.getLogger()defcall_github_api():URL="https://api.github.com/search/users?q=followers:%3E10000+sort:followers&per_page=50"r=requests.get(URL)ifr.status_code==403:logger.warn("Hitting a GitHub API usage error")return{}data=r.json()["items"]returndatadefget_users():data=call_github_api()users=[]foruserindata:u=[user["login"],user["avatar_url"]]users.append(u)logger.info(f"Found {len(users)} users")returnusersdefdownload_photo(login:str,avatar_url:str):response=requests.get(avatar_url)ifresponse.status_code==200:file=f"photos/{login}.jpg"logger.info(f"Downloading {file}...")withopen(file,"wb")asf:f.write(response.content)defdownloadPhotos(users):dirpath="photos"ifos.path.exists(dirpath)andos.path.isdir(dirpath):shutil.rmtree(dirpath)p=pathlib.Path(dirpath)p.mkdir(parents=True,exist_ok=True)# starmap needs an array of arguments mapped from the list# one mini-list matching arguments needed by the download function# So the structure [['bob', 'https://picture.jpeg'], ['alice', 'https://picture2.jpeg']] *just works* in this context.## https://stackoverflow.com/a/5442981ThreadPool(10).starmap(download_photo,users)# Sequential blocking downloads# for u in users:# downloadPhoto(u['user'], u['photo'])defmain():users=get_users()downloadPhotos(users)if__name__=="__main__":main()
Thanks to Repl.it version control I shared the incomplete exercises with candidates.
However, my version of the answers are there as well.
For example, here’s the music.sql answer I can up with.
And of course, the first person to look at this proposed a more direct version without the common table expression (CTE), which I was using for clarity’s sake.
/*
TODO - print the artist name and album count
ArtistName AlbumCount
---------- ----------
Lost 3
Creedence 2
The Office 2
ONLY those artists who have released:
- at least 2 albums
- each having at least 20 tracks on them.
Tip: the .tables and .schema [table] commands are handy!
*/.opensample.db.headerson.modecolumnWITHtracks_and_artistsAS(SELECTt.albumid,albums.artistid,artists.name,COUNT(t.trackid)astrack_countFROMtrackstINNERJOINalbumsonalbums.albumid=t.albumidINNERJOINartistsonartists.artistid=albums.artistidGROUPBYt.albumid,albums.title,albums.artistidHAVINGtrack_count>=20)SELECTnameasArtistName,COUNT(albumid)asAlbumCountFROMtracks_and_artistsGROUPBYArtistNameHAVINGAlbumCount>=2ORDERBYAlbumCountDESC;
I love, love, love the fact that Repl.it supports SQLite as a first class project type.
Makes it so much easier to share an idea without needing a whole dev environment.