Make your code understandable.
Heuristics, Hunches, & Why the Heck We Care.
We hear many horror stories about big names having their results over turned over because of problems in our code. The struggle is real. The best rules of thumb used in developer circles, even the simple ones, have large value added for social scientists. Especially since this stuff is seldom ever mentioned in any graduate curriculum. Abiding by a few norms can go a long way in making data-driven research reproducible, sharable, and readable by collaborators.
In this small tutorial I cover some coding norms used by clean coding gurus and R developers. While I am talking to R users, I’m sure some of this generalizable to Stata folks.
I’m not going to talk about relatively fancy schmany stuff — like unit testing or object oriented specifics. Most of us are social scientists and aren’t developing applications.
While “code-driven” research can learn a lot form from the craft of programming, our needs are bit a different. Gentzkow and Shapiro make a huge point in their recent work on big data practices for economists: if professionals are paying to do it is is likely important. However, I am not sure how much is practical for the social scientist. So
I’ll go out on a limb: researchers probably emphasize readability and reproducibility over writing slick, ultra-optimized code. We write to get the job done. The programming background of collaborators varies wildly, so understandability is a must. Seldom are we working with industrial scale projects. In fact, most big data people would probably laugh at what we consider “big.”
The Broad Stuff : humbling “bang for your buck” rules.
A lot of this will seem like plain common sense, of course. Then again, many of us never think to do it.
First things first, consider the clean code theorem, from their "The Art of Readable Code."
A clean code theorem : "Code should be written to minimize the time it would take for someone else to understand it."
Consistency is key.
Consistency goes a long way in making a code readable. This applies to the naming rules, syntax, capitalization, white spaces, how we indent, etc.. This type of rigidity makes our work more navigable. Consistency minimizes the “WTFs per minute” (address here) we face we staring into the black hole of code we wrote a year ago.
Comment & document like you’re at risk for a head injury.
It goes without saying that coding and documentation matters. Many people who started off as research assistants have been admonished for not commenting enough. However, advice often stops there.
Comment often, but be brutally concise and to the point. Elucidate complex tasks. Think abstractly about your audience. Since their backgrounds vary, seemingly simple tasks may have to be elucidated.
More is not always better however. For instance, comments easily become outdated. Clean code practice in other domains can reduces the need for us to explain everything to the user. As in the case of naming, code has the ability to speak for itself.
More generally, document your work. Make documentation consistent feature of your script layouts, keeping up-to-date descriptions in headers.
Make names meaningful, informative.
Informative names make code infinitely more readable. Importantly, smart naming forces us to think deeper about our code and reduces the possibility of errors.
Use concrete, descriptive words and avoid ambiguity. Nouns are used for variables—as well as for classes and attributes — and describe what they contain. Similarly, use verbs to describe functions and the action (hopefully singular) they perform. The names of script files explain what they do and end in capital R.
You would be surprised how much clean coding texts emphasize this. Consider an apt quote from the late-computer scientist, Phil Karton:
“There are only two hard things in Computer Science: cache invalidation and naming things.”
Try longer names, they hold more information and save people from mysterious abbreviations. Contemporary code guides and R convention are moving toward long names. After all, solid IDEs— and new versions of RStudio — automatically fill variable names and reduce the cost of typing long names.
Note: By variables I mean the objects in R/Python/etc., not to be confused by variable names used in the final output of cleaned data.
Structure your script in a coherent, organized way. A lot of time spent thinking about the structure of code — as well as writing documentation — can save heartache.
Consider Google’s R style guide suggestion for layouts, much of which can be applied to Stata and other languages:
Copyright statement comment Author comment File description comment, including purpose of program, inputs, and outputs source() and library() statements Function definitions Executed statements, if applicable (e.g., print, plot)
Write D.R.Y. Code: Don’t Repeat Yourself.
Avoid repetition and duplicated code. The habit of pasting giant chunks of code is ubiquitous in economics. However, this practice is a cardinal sin among developers. Errors propagate and multiply. Fixing errors becomes complicated.
Consider a bastardization of the well know "rule of three" from Martin Fowler's seminal book on refactoring: First, we write code to get the job done. Second, we shudder and duplicate what we did. The third time, we think a little more deeply about how to rework (in coding parlance, "refactor") code so that it is more streamlined. Ask yourself: can I generalize what I'm doing in a concise way?
Breaking code into understandable, re-usable, independent chunks translates into concise code that is easier to debug.
Functions play a key role in modularization. Use them often, keeping them short and specific to a task. (Note: I recommend Cosma Shalizi’s notes on writing good R functions and the Clean Code github’s function tutorial ).
Limit your actual script files. Split them into two files if necessary. At minimum, code should divide analysis and data preparation. Jonathan Nagler of NYU Polisci explains why:
"Separating data-manipulation and data-analysis is an example of modularity. … The logic for this is simple. Lots of things can go wrong. You want to be able to isolate what went wrong. You also want to be able to isolate what went right."
Refine and Refactor.
Code should improve through time. Clean code gurus repeat a code of conduct adopted from the U.S. Boy Scout dictum:
campgroundcode cleaner than you found it”