This is a conversation that Jason McCollum and I recorded last week.
Light on technical content, but hopefully putting data engineering into context with the more widely known data science term.
Downloading photos using Python
While someone was tackling this coding exercise for a Woolpert role, I tried it out in Python.
First time attempt; I’m not sure how idiomatic or even performant the concurrent download bit is.
But I had fun doing it.
Learning parallelism on the fly with Python
starmap is a very new concept to me.
Took a few minutes to get the correct syntax for calling a function with multiple arguments, but got there in the end.
Tried to use a dataclass at first to make the code easier to read.
Then figured I was making things too complicated, then wished I had stuck with it instead of the crummy key['value'] crap all over the place.
importrequestsimportloggingimportpathlibimportshutilimportosfrommultiprocessing.poolimport ThreadPool
importsys# https://www.delftstack.com/howto/python/python-logging-stdout/Log_Format ="%(levelname)s - %(message)s"logging.basicConfig(
stream=sys.stdout, filemode="w", format=Log_Format, level=logging.INFO
)
logger = logging.getLogger()
defcall_github_api():
URL ="https://api.github.com/search/users?q=followers:%3E10000+sort:followers&per_page=50" r = requests.get(URL)
if r.status_code ==403:
logger.warn("Hitting a GitHub API usage error")
return {}
data = r.json()["items"]
return data
defget_users():
data = call_github_api()
users = []
for user in data:
u = [user["login"], user["avatar_url"]]
users.append(u)
logger.info(f"Found {len(users)} users")
return users
defdownload_photo(login: str, avatar_url: str):
response = requests.get(avatar_url)
if response.status_code ==200:
file =f"photos/{login}.jpg" logger.info(f"Downloading {file}...")
withopen(file, "wb") as f:
f.write(response.content)
defdownloadPhotos(users):
dirpath ="photos"if os.path.exists(dirpath) and os.path.isdir(dirpath):
shutil.rmtree(dirpath)
p = pathlib.Path(dirpath)
p.mkdir(parents=True, exist_ok=True)
# starmap needs an array of arguments mapped from the list# one mini-list matching arguments needed by the download function# So the structure [['bob', 'https://picture.jpeg'], ['alice', 'https://picture2.jpeg']] *just works* in this context.## https://stackoverflow.com/a/5442981 ThreadPool(10).starmap(download_photo, users)
# Sequential blocking downloads# for u in users:# downloadPhoto(u['user'], u['photo'])defmain():
users = get_users()
downloadPhotos(users)
if __name__ =="__main__":
main()
Thanks to Repl.it version control I shared the incomplete exercises with candidates.
However, my version of the answers are there as well.
For example, here’s the music.sql answer I can up with.
And of course, the first person to look at this proposed a more direct version without the common table expression (CTE), which I was using for clarity’s sake.
/*
TODO - print the artist name and album count
ArtistName AlbumCount
---------- ----------
Lost 3
Creedence 2
The Office 2
ONLY those artists who have released:
- at least 2 albums
- each having at least 20 tracks on them.
Tip: the .tables and .schema [table] commands are handy!
*/.opensample.db.headerson.modecolumnWITHtracks_and_artistsAS(SELECTt.albumid,albums.artistid,artists.name,COUNT(t.trackid)astrack_countFROMtrackstINNERJOINalbumsonalbums.albumid=t.albumidINNERJOINartistsonartists.artistid=albums.artistidGROUPBYt.albumid,albums.title,albums.artistidHAVINGtrack_count>=20)SELECTnameasArtistName,COUNT(albumid)asAlbumCountFROMtracks_and_artistsGROUPBYArtistNameHAVINGAlbumCount>=2ORDERBYAlbumCountDESC;
I love, love, love the fact that Repl.it supports SQLite as a first class project type.
Makes it so much easier to share an idea without needing a whole dev environment.