Vicki Boykis is exactly right.
I’m going to steal her post idea and give you my reasons to learn each: git, SQL, and the CLI.
SQL
I’m starting with SQL, because if we’re talking data-centric code, we’re talking SQL. Databases talk SQL. Data stores that don’t talk SQL have SQL interfaces. You will interact with databases everywhere you go. And importantly, even if you’re not writing SQL directly, SQL’s impact means that most APIs for interacting with tabular data borrow from SQL.
You will see select
, where
, group by
, and * join
everywhere. Sometimes there’s a word substitution (like dplyr
using filter
for where
), but understanding the basics of a SELECT
query in SQL will teach you how to access data anywhere.
SQL also teaches you about data organization and design by its very nature. By understanding how joins, filters, and aggregations work, you start to understand principles behind good data ways to structure and store data for analytic tasks.
A day spent writing SQL is almost always a good day.
Git
Do you want to understand your code? Do you want others to understand it? The one computer science class I took taught that the way you accomplished this was writing comments. This was wrong. When you change your code, the comments don’t change. When you write your comments, they may not actually describe what’s happening. Comments have a place, but they are far from your first line of defense.
First, you should strive to write really obvious, clear code. Use descriptive nouns for all of your variables. Name your functions with descriptive verbs. Don’t be clever. It should be obvious what your code is doing simply by reading the code itself.
But your second line1 of defense is git
, where true documentation lives. Why do we use git
? The main reason folks turn to distributed version control is because it makes it easy to work on the same code in the same files as someone else at the same time and make sure you can recombine that work. But the process of writing commit
messages means that git
can also serve as the best way to document your code. Think of a commit
message as a comment that is specific to a collection of code that can exist throughout multiple files, time stamped, and with author attribution. You can (and should) use a commit message to explain a logical collection of code changes meant to accomplish one goal. The result, combined with cleanly written code, is documentation about who did something, when, and why. Code comments too often simply try and describe an isolated how and end up being some kind of imperative pseudocode that adds very little the code itself doesn’t reveal. The limitation of comments living in-line in a single file strongly encourages the wrong behavior. A commit
let’s the code author define a unit of change and what is accomplished by that unit.
Git let’s you travel through time and see past code and changes as they happen, revert back to known good working state, try out new ways of doing things and easily discard that work, and make huge sweeping changes without ever having fear of finding your “last good known state”. Have you every edited a long piece of prose, moving around paragraphs and sentences to get things right? Do you paste sentences after a whole bunch of white space at the end of a document or hit undo frantically to try and get back to before you made things work? Git make all that easy for code.
If you’ve ever found a reproducible regression and written a failing test, then gone ahead and used git-bisect
to find out exactly which commit broke the behavior, then found a solid commit message explaining what was done and why, you’ve known true joy.
CLI
The CLI, or command-line interface, is a big area. When I say CLI (and I believe Vicki means the same), I’m talking about being proficient with Linux/Unix/POSIX etc style systems. There are two separate reasons I believe in the CLI. The first is Vicki’s reason:
As a data developer, you will spend most of your time SSHing into servers, looking at stuff, and running code. This is especially true for companies that have moved to the cloud, but the pattern of, “your code lives on some remote production server and you need to get to it” is universally true. Command line is your best friend here
As soon as you plan to let computers work with your code while you’re not around rather than requiring you to hit a button to run code on your local machine, you’re going to want to have the basics of the CLI. This is how we interact with machines not in front of us.
But my second reason is even more important and harder to capture: the CLI is magic. When you first learn to write programs, it feels magical to command your computer to solve hard problems. The CLI is filled with battle tested programs that solve a huge class of problems interacting with a computer. They are blazing fast and are easy to combine together. Learning the CLI is learning the programming language of computer operating systems 2.
Do you want to feel powerful? Learn how to setup an SSH tunnel on a local port on your machine, then use psql
to connect to a remote database, edit your SQL in vim
, and seamlessly read from and write to your local machine while running SQL on a server a thousand miles away without missing a beat. Schedule a cron
job that runs a small bash script that coordinates fetching, moving, and renaming files, processing gigabytes of text with awk
or sed
seemingly instantly and then loading that data into a database of a live application.
Almost all the programs I’ve written that save me so much time it feels like magic are actually just a series of command line tools to process text3, work with the file system, and/or operating system.
If you’re working with data, you’ll need to get it (SQL), you’ll want to process it, move it around, or use computer power somewhere you’re not and while you’re not pushing a button (CLI), and one day, you’ll want to know why you wrote the SQL or CLI script you did, how it works, get that code on another computer, and let someone else help out (git). This will be true literally everywhere you work.
-
The third line of defense is writing tests. Whenever possible, you should be writing tests. But often tests don’t make sense in the context of a data analytics process. ↩︎
-
It’s also learning ridiculously powerful text processing tools, which is great for programmers. ↩︎
-
I spent a really long time writing
awk -F ',' '{split($2, a, "."); new2 = ""; for(i=1; i<=9;i++) new2 = new2 "," a[i]; print $1 new2 $3 }'
once, but it chewed through 350K lines in less than a second (along with several othersed
andawk
one liners) and spit out the clean file I needed. ↩︎