Economic development, economic history, political economy, & messy data from messy places.

By Nathan Lane: an accidental economics PhD candidate

Tutorial: R Code Style for Empirical Economists

Make your code understandable. IBM’s data center in the 1960s, Toronto.

Heuristics, Hunches, & Why the Heck We Care.

We hear many horror stories about big names having their results over turned over because of problems in our code. The struggle is real. The best rules of thumb used in developer circles, even the simple ones, have large value added for social scientists. Especially since this stuff is seldom ever mentioned in any graduate curriculum. Abiding by a few norms can go a long way in making data-driven research reproducible, sharable, and readable by collaborators.

In this small tutorial I cover some coding norms used by clean coding gurus and R developers. While I am talking to R users, I’m sure some of this generalizable to Stata folks.

I’m not going to talk about relatively fancy schmany stuff — like unit testing or object oriented specifics. Most of us are social scientists and aren’t developing applications.

While “code-driven” research can learn a lot form from the craft of programming, our needs are bit a different. Gentzkow and Shapiro make a huge point in their recent work on big data practices for economists: if professionals are paying to do it is is likely important. However, I am not sure how much is practical for the social scientist. So

I’ll go out on a limb: researchers probably emphasize readability and reproducibility over writing slick, ultra-optimized code. We write to get the job done. The programming background of collaborators varies wildly, so understandability is a must. Seldom are we working with industrial scale projects. In fact, most big data people would probably laugh at what we consider “big.”

The Broad Stuff : humbling “bang for your buck” rules.

A lot of this will seem like plain common sense, of course. Then again, many of us never think to do it.

First things first, consider the clean code theorem, from their "The Art of Readable Code."

A clean code theorem : "Code should be written to minimize the time it would take for someone else to understand it."

Consistency is key.

Consistency goes a long way in making a code readable. This applies to the naming rules, syntax, capitalization, white spaces, how we indent, etc.. This type of rigidity makes our work more navigable. Consistency minimizes the “WTFs per minute” (address here) we face we staring into the black hole of code we wrote a year ago.

Comment & document like you’re at risk for a head injury.

It goes without saying that coding and documentation matters. Many people who started off as research assistants have been admonished for not commenting enough. However, advice often stops there.

Comment often, but be brutally concise and to the point. Elucidate complex tasks. Think abstractly about your audience. Since their backgrounds vary, seemingly simple tasks may have to be elucidated.

More is not always better however. For instance, comments easily become outdated. Clean code practice in other domains can reduces the need for us to explain everything to the user. As in the case of naming, code has the ability to speak for itself.

More generally, document your work. Make documentation consistent feature of your script layouts, keeping up-to-date descriptions in headers.

Make names meaningful, informative.

Informative names make code infinitely more readable. Importantly, smart naming forces us to think deeper about our code and reduces the possibility of errors.

Use concrete, descriptive words and avoid ambiguity. Nouns are used for variables—as well as for classes and attributes — and describe what they contain. Similarly, use verbs to describe functions and the action (hopefully singular) they perform. The names of script files explain what they do and end in capital R.

You would be surprised how much clean coding texts emphasize this. Consider an apt quote from the late-computer scientist, Phil Karton:

“There are only two hard things in Computer Science: cache invalidation and naming things.”

Try longer names, they hold more information and save people from mysterious abbreviations. Contemporary code guides and R convention are moving toward long names. After all, solid IDEs— and new versions of RStudio — automatically fill variable names and reduce the cost of typing long names.

Note: By variables I mean the objects in R/Python/etc., not to be confused by variable names used in the final output of cleaned data.

Layouts Matter.

Structure your script in a coherent, organized way. A lot of time spent thinking about the structure of code — as well as writing documentation — can save heartache.

Consider Google’s R style guide suggestion for layouts, much of which can be applied to Stata and other languages:

Copyright statement comment
Author comment
File description comment, including purpose of program, inputs, and outputs
source() and library() statements
Function definitions
Executed statements, if applicable (e.g., print, plot)

Write D.R.Y. Code: Don’t Repeat Yourself.

Avoid repetition and duplicated code. The habit of pasting giant chunks of code is ubiquitous in economics. However, this practice is a cardinal sin among developers. Errors propagate and multiply. Fixing errors becomes complicated.

Consider a bastardization of the well know "rule of three" from Martin Fowler's seminal book on refactoring: First, we write code to get the job done. Second, we shudder and duplicate what we did. The third time, we think a little more deeply about how to rework (in coding parlance, "refactor") code so that it is more streamlined. Ask yourself: can I generalize what I'm doing in a concise way?


Breaking code into understandable, re-usable, independent chunks translates into concise code that is easier to debug.

Functions play a key role in modularization. Use them often, keeping them short and specific to a task. (Note: I recommend Cosma Shalizi’s notes on writing good R functions and the Clean Code github’s function tutorial ).

Limit your actual script files. Split them into two files if necessary. At minimum, code should divide analysis and data preparation. Jonathan Nagler of NYU Polisci explains why:

"Separating data-manipulation and data-analysis is an example of modularity. … The logic for this is simple. Lots of things can go wrong. You want to be able to isolate what went wrong. You also want to be able to isolate what went right."

Refine and Refactor.

Code should improve through time. Clean code gurus repeat a code of conduct adopted from the U.S. Boy Scout dictum:

“Leave the campground code cleaner than you found it”

-Bob Martin’s "Clean Code: A Handbook of Agile Software Craftsmanship".

1972 Thailand & Burma

Copyright the Nick DeWolf Foundation from their fantastic Nick DeWolf Archive Flickr page here.

I stumbled upon the Nick DeWolf Archive, a project from the Nick DeWolf Foundation, which housed a magnificent collection of photo dumps from his travels across Southeast Asia in the 1970s.

Wozniakian Destruction – The History of Phone Hacking and its Influence on 1970s Silicon Valley

A bundle of blue boxes from

FiveThiryEight Signal’s Series had a great small documentary piece on the history of phone hacking in 1960s-1980s, through the lens of Steve Wozniak and Steve Job’s pre-Apple bedroom-based manufacturing of “blue bloxes,” devices used for manipulating the nation’s telecommunication infrastructure. The tidy punchline: “if we hadn’t made those little blue boxes, there might never have been an Apple computer.” Importantly, piece pays homage to the classic Esquire piece on the rise of analog phone hacking , the predecessor to PC-based hacking culture.

From FiveThirtyEight Science:

Quick Notes – Coding Stata do-files with Sublime in Unix/Linux

I am used to writing code in notepad programs, such as N++ and the fantastic Sublime Text 3. Here’s a quick note on connecting a powerful coding notepad in Linux to Stata.

Sublime, like many of these programming-oriented editing notepads, have massively powerful tools that crush Stata’s default editor. Moreover, since many people are simultaneously juggling Python, R, and Stata (and more) scripts for a single project, the ability to work from one programming-oriented environment is nice.

While it is straightforward to run Stata do-files from Sublime Text in Mac OS and Windows, using packages like Sublime Stata Enhanced, it wasn’t obvious how to do so in Linux. The following is a little integration guide, which is indebted to this Github howto here.

Sublime, Stata & Unix Walk Into a Bar:

First, from your terminal create symbolic links for xStata and Stata commands. The gist of creating a link in the terminal is the following,

ln -s [target-filename] [symbolic-filename]
sudo ln -s /usr/local/stata14/xstata /usr/local/bin/xstata && sudo ln -s /usr/local/stata14/stata /usr/local/bin/stata
#[sudo will prompt you for your password]

Of course you can edit this to match the version of Stata (and flavor) you are using.
The following Stata package definitely works in Linux, so we’ll use it! Download it from .
Within the ZIP file from is a /Stata directory–find it and place it in the Sublime /Packages directory on your Linux system. If you’re new to Linux, this file is likely in the folder /[your user name]/.config/sublime-text-3/Packages. Notice, sometimes these files are hidden from the user in the terminal so they may be hard to find. Confirm the appearance of these files from your terminal with, ls -ld .?* in the command line.
Last, open the Stata.sublime-build file located in /.config/sublime-text-3/Packages/Stata/ directory. Replace all the text with the following,

 { "cmd": ["xstata do $file"], "file_regex": "^(...?):([0-9]):?([0-9]*)", "selector": "source.stata", "shell": true, } 

Seriously–just copy and paste over the stuff in the original text file. Save, restart Sublime Text for safe keeping, and you’re good to go.
Now when you use Sublime Text, , simply typing ctrl+b executes Stata externally and runs the do-file you’re currently editeing.
Note: for some reason I have run across some issues running do files in batch mode from the Unix terminal and such. I found adding an extra space at the end of my code, or a superfluous log close does the trick.


  • Sublime Text 3:
  • Rhocon’s github article for a similar approach:

  • State Enhanced for Sublime from rpowers (used on Linux systems):
  • Symbolic links in Unix:
  • Sublime+Stata usage in Window and OSX:
  • Data Janitors & Data Carpentry: value in the nitty gritty?

    “Report on the investigation of engineer and janitor service, Board of education, city of Chicago” (1913)

    The fantastic machine learning-oriented podcast, Talking Machines, had an interview with computer scientist, David Mimno (also: his course syllabus on text mining for historians is awesome.). They spent some time discussing a recent essay by Mimno, that riffed off the New York Times “data janitorial” piece, arguing that data wrangling, data munging, or data janitorial work is not trivial grunt labor, but rather integral to the craft of research–especially in fields utilizing machine learning, etc.. Particularly, the intensive process of creating usable data sets is much less janitorial work and more akin to carpentry (a term already rolling through the data science lexicon):
    From Data Carpentry,

    Every data set has its idiosyncrasies. You can streamline the process, but you can’t avoid it. To draw out the analogy a bit more: sure, there’s Ikea, but the best furniture is still made by Amish carpenters.

    More broadly, on Talking Machines Mimno argues that knowing intimate minutiae of data—in the same manner than humanists know an obscure corpus of work or the finalities of administrative Dutch—has broader benefits to scholars. This intimacy can inform the questions we ask.
    I couldn’t help thinking this view has application in economics, especially since I have been reading work from the great Zvi Griliches, who was a champion of the insights to be gleaned from the process of data collection. His emphasis on the value of data in economics is reflected in his 1994 presidential address to the American Economic Association:

    Zvi Griliches,

    “We ourselves do not put enough emphasis on the value of data and data collection in our training of graduate students and in the reward structure of our profession.”

    With that in mind, it’s hard to imagine many of Griliches insights, the least of which his work on productivity and technological adoption, without a granular appreciation weighing through the muck of data. You certainly get a sense of this from his interview with Alan Krueger.

    Found Photos from Vietnam – a Digitization Project

    More at


    Newly digitized photos from Vietnam war photographer, Charlie Haughey, appearing in his fantastic new book, A Weather Walked In. See also recent coverage in the Atlantic.


    Video and story from A Weather Walked In’s crowd-source project page.

    Visualizing Interlocking Directorates, 1913

    Visualizing Interlocking Directorates, 1913
    From the St.Louis Fed. Fraser site.


    Exhibit 243: Diagram Showing Affiliations of J.P. Morgan & Co., National City Bank, First National Bank, Guaranty Trust Co. and Bankers Trust Co. of New York City with Large Corporations of the United States” from the FED economic history blog. From the Money Trust Investigation : Investigation of Financial and Monetary Conditions in the United States Under House Resolutions Nos.429 and 504, Before a Subcommittee of the Committee on Banking and Currency

    Taking geospatial data to R (& how to ditch ArcGIS)

    For R users it’s very straightforward to ditch ArcGIS (for most tasks) in favor of doing everything through an R script. There are many reasons to do this:


    • First, if you can do GIS work on your Linux system or Mac without having to run things through a lame emulator.
    • Second, you can cut yourself loose from dealing with the clunky ArcGIS licensing system.
    • Third, the GIS/R user community is pretty dang big, with a growing collection of resources and libraries.
    • Fourth, you can escape the mysterious, temperamental nature of ArcGIS and have full control over data outputs. Most quant folks I know try to minimize their time processing things on ArcGIS, outsourcing data as soon as possible to Stata or R. By working entirely in R lets you skip the murky black box of ArcGIS.
    All this means there are many reasons to dump ArcGIS–something I should have done before my pal called me out on twitter.


    Here are just some aspects on working with raster and vector data in R for those wanting to migrate from ArcGIS. Plus some tools that helped me with scripts to manipulate “large” data sets–say a couple gigs of raster data, etc..

    He’s totally right.

    To get started working with GIS data, a couple of R packages cover most ArcGIS tasks. I’d install sp, Raster, rgeostats, maptools,and rgdal packages, which cover a surprising number of bases (Also:
    this is helpful to note if you’re a Linux user

    Starting with Raster Data

    Let’s consider working with raster files first. You can think of loading GIS-based data just like you would any object, such as .csv file. Specifically, library(raster) is enough to load raster-based images directly into R.
    weatherfile <-"/home/user/population_raster.tif"
    # Crop to the size of Europe shapefile; the extent() function helps with this.


    Above, I used the extext() function to automatically use the dimensions of another file in our memory. Since we’re in R, we can easily save an extent to an object and re use it.


    One thing that ArcGIS has over R, however, is that it is based thoroughly on a graphical user interface and allows you to see multiple layers seamlessly. Nonetheless, the raster package (as well as staples such as GGPLOT2) allow you to eyeball and visualize GIS tasks. To plot an individual R layer, plot(rasterpopfile):


    Similarly, it is fairly easy to plot multiple layers. Of course there are all sorts of wacky things you can do to visualize GIS objects, but this is pretty much what you need to graphically verify nothing wacky is going on. Hence, manipulating GIS data programmatically in R doesn’t mean flying blind.


    # Superimpose a rasters and vectors by using "add=TRUE"
    plot(rasterpop); plot(countryshape, add=TRUE)


    Moreover, it’s pretty straightforward to perform common manipulations of raster data, such as changing the changing the CRS. One can change the projection by using the reprojectRaster() function followed by the resample() function.

    # Say we have another raster with a different coordinate system.
    # We can save this coordinate system using the proj4string() function.
    target_raster<-raster("/home/user/target_raster.tif"target_crs<-proj4string(target_raster )
    # Reproject using the projectRaster() function and the target_crs.
    re_rasterpopfile<-projectRaster(rasterpopfile,crs=target_crs,method = "bilinear")
    # Reproject using the projectRaster() function and the target_crs.
    re_rasterpopfile <-resample(re_rasterpopfile,target_raster,method = "bilinear")
    The first manipulation changes coordinate system of a current raster, changing it to match the target coordinate system; the second function is necessary so that the grid of the starting raster matches the grid of the target raster.
    Alternatively, you can easily specify nearest neighbor methods if you are working with categorical raster data. Now, both the target and starting raster layers should have the same resolution.
    While resample() allows us to align the grids of the two files, if the target raster is much more coarse–at a much lower resolution–we should use the aggregate() function, which lets us aggregate the cells of the fine raster to the larger raster; disaggregate() does just the opposite.


    Shapefiles & Vector Manipulations


    The rgdal package is fantastic for reading vectorized data and shapefiles into R. The package’s readOGR() function is fantastic for loading shapefiles directly into R.


    Besides liberating yourself from ArcGIS wackiness, you can manipulate shapefile objects similar to the way you manipulate dataframes. This is because points, lines, and polygon shapes can be recognized as special SpatialPointsDataFramesSpatialĹinesDataFrames, or SpatialPolygonsDataFrames classes. Each type, or class, of layer contains an attributes table. An advantage of this is that you can use these attributes to select parts of the shapefile as you would select a subset of a dataset.
    The rgdal assists in loading vector-based GIS data into R, and comfortably handles ESRI shapefiles. The libary’s readOGR() function is enough to get started. Here we load a file a standard shapefile of country polygons and create a shapefile layer for Sweden:


    # Read in with readOGR since it preserves CRS projections.
    globeshape<-readOGR(dsn="/home/user/countries.shp", layer = "countries") 

    # Subset the European files. 
    swedeshape <- globeshape[globeshape$COUNTRY=="Sweden", ]



    Ingredients to Manipulating GIS Data En Masse


    Sure R is a free, programmatic solution to working with spatial data. Sure it also gives you transparent control over transforming spatial data. However, a big benefit of using R is being able to manipulate giant chunks of geographic data–and to do so in a way that is reproducible via a script.

    Work with Brick and Stack Objects 

    RasterBricks and RasterStacks are your friend when working with big datasets. For instance, weather data often comes in the compact NetCDF format, where a common NetCDF file may contain hundreds of layers of daily weather data, each dimension representing geocoded raster data for a single day. RasterBricks are useful in this case, and store a multi-layered raster file in a single object that can be manipulated.


    With the ncdf4 library, you can load a 365 layer NetCDF raster file directly into R. Together with the brick() function (from the raster package), you can work with large, multi-dimensional raster files as if they were one single raster file. In other words, you can apply raster manipulations to a block of raster data directly by defining it as a brick (or stack() as well, though raster bricks and raster stacks are treated a bit differently in memory). This is handy when you want to resize, crop, or transpose an entire set of raster layers all at once instead of looping through each individual raster layer.


    For instance, using the ncdf4 library I can load a giant NetCDF file directly; together with the raster library’s brick() function, you get load an entire NetCDF file and recognize it as a RasterBrick with minimal fuss.


    netdata<- nc_open("/home/user/")
    netdata<- get(netdata)


    And we chop hundreds of layers to an appropriate size in one go,




    We can also multiply a multi-dimensional brick object by a single raster layer, effectively multiplying a hundred rasters contained in the brick with the singleton layer. I find this extremely useful for calculation population-based weights.

    weatherXpop <-overlay(crop_dailyweather,population_raster)


    Parallelization, foreach/plyr, apply, and data.table

    If you’re trying to programmatically manipulate many files at once, you can speed things up tremendously with a a number of R libraries and features, especially those that support parallelization.

    For instance, the omnipresent plyr package and the handy foreach package allow you to parallalize time consuming manipulations of GIS data, especially GIS tasks that you would normally loop, such as computing repetitive calculations of zonal statistics and such. I’m not sure what geoprocessing tools are supported by ArcGIS’ own parallel processing environment, but R certainly allows you to flexibly use the power of your multi-core processors for intensive tasks.

    Moreover, if you are manipulating GIS data and assembling the results into a dataset–e.g. `growing’ a panel dataset of annual mean temperature readings across municipalities–the data.table package can be very helpful in speeding things along and reducing processes that are usually quite inefficient in R.

    Soekarno (Indonesia) and Khrushchev (USSR), 1960

    Indonesian President, Soekarno, and USSR's Khrushchev, 1960. Life Magazine.

    Quick Note – Growing Datasets (More) Efficiently in R

    From the Field Museum collectionFrom the Field Museum archives, 1920, Photographer Herbert P. Burtch, Oriental Institute. “Men moving Totem Pole outside Field Museum by train.”.

    The Usual Mumbo Jumbo.

    Note: This was originally some notes to RAs but I figured it may be useful for other people out there.
    I’ve had some discussion with econ folks and RAs who are working with giant datasets in R for the first time. In particular, those having to “harvest” or “grow” unweildy datasets. R is notoriously slow when it comes to expanding datasets, such as when you want to increntally append rows to a file with results from a scraping API, or combine a giant stack of raw text files from another text mining project.
    The usual “good” method for concatination uses a function with the rbind function. This method essentially takes a list of stuff and passes them as arguments all at once to rbind. In other words, you can take a list of data.frame names and bind the rows together in one motion:"rbind", <<<A list of data.frame names>>>).
    A common task I encounter is grabbing a chunk of files from a directory and combining them into a dataset. Such a task requires three steps. First, generating a list of files from a directory that match a pattern (e.g. all the .csv files in a directory) using the list.files function. Next, looping over this list of files and loading them into R with with lapply, applying the read.csv function to a list of files. Then, finally, using to rbind, or stack, all the loaded .csv files into a single dataset.

    Something like this:
    # Grab the list of files in the directory "/home/user/foo" that end in ".csv"
    csvlist<-list.files(path = "/home/user/foo",
    pattern = ".csv", all.files = FALSE,
    full.names = TRUE, recursive = FALSE)
    # "Apply" the read.csv function to the list of csv files.
    csvloaded<-lapply(csvlist, read.csv)
    # Append the loaded .csv files into a list.

    This is all great, but it can still take a ton of time. Below I condense the lapply function and the line into one:

    ptm <- proc.time()
    dataset1<"rbind",lapply(csvlist, read.csv))
    proc.time() - ptm
    > user system elapsed
    > 48.840 0.148 50.241

    If you’re doing more complicated tasks or working with large sets of data, processing time can balloon.

    A faster method.

    Using the data.table package can speed things along if we’re trying to get big data into R efficiently (I highly recommend checking out the github for the project).
    The rbindlist function included in the package is incredibly fast and written in C. In addition the fread function is built to efficiently read data into R.
    Below I replace the normal read.csv function with fread, and replace with rbindlist.

    dataset2<-rbindlist(lapply(csvlist, fread))
    proc.time() - ptm
    > user system elapsed
    > 4.044 0.084 4.144

    Both methods deliver identical datasets but there are some real efficiency gains when using fread and rbindlist from the super useful data.table package.

    > TRUE

    This can have pretty amazing payoffs when working trying to load massive data sets into R to process.

    Deng Xiaoping in the U.S., 1979

    Deng Xiaoping in the US

    Deng Xiaoping speaks in D.C., the year relations between the People’s Republic of China and the U.S. normalized.

    [Citation: via the Asia Society blog]

    Hanoi, 1989

    David A Harvey of Magnum Photos

    Workers commute: Hanoi, 1989.

    [Photo: David Alan Harvey of Magnum Photos, 1989]

    Useful new R programs from Stockholm.

    There’s a lot useful R programming that comes out of the Stockholm University economics community, like Mahmood Arai’s code for estimating clustered standard errors–small programs that go a long way in making R more comfortable for Stata-minded econometrics folks.

    Whelp, my pal Sirus Dehdari, a metrics guy and fellow Ph.D. candidate in economics, has some fresh code for producing both regression tables (with spatially correlated errors and other useful stuff) and regression discontinuity plots–in the vein of Outreg and Binscatter in Stata, respectively. Check out rddplot.R and rdd.R.

    Whew–the first post after 4 months of trauma after my computer crash.

    Data cleaning music: Nina Simone – Baltimore

    Suicides & Churches in Early 20th Century Seattle

    From the Making Maps blog. Fantastic plots from old sociology research, originally published in Calvin F. Schmid “Notes on Two Multiple-Variable Spot Maps” Social Forces, Vol. 6, No. 3. (Mar., 1928), pp. 378-382.

    Review: Ezra Vogel’s The Four Little Dragons – The Spread of Industrialization in East Asia

    Singapore, 1967.  Location: “North Bridge road just after Capitol theatre.” Copyright David Ayer.


    Sometimes the jacket covers of books are so dated that the obscure contents that still hold up in 2014. The Four Little Dragons was published in 1991, but the narrative still matters. Ezra Vogel’s The Four Little Dragons is a comparative primer (lecture) that distills key narratives and lessons from four “Asian miracle” economies. Vogel skillfully narrates the rapid growth episodes of Hong Kong, Singapore, South Korea, and Taiwan in a slim volume. Remarkably, he synthesizes the varied experiences of these late developers, drawing key insights from a rich comparative setting. This guide is a solid starting place for those wanting to understand “what happened.”

    While concise, and written in an almost aphoristic style, there is no shortage of ideas and bibliographic material. Western scholars can view the Asian growth miracle as one big blob of statist industrial strategies. But paces, policies, and timelines very, and Vogel’s brevity amplifies differences among the four dragons. For instance, his juxtaposition of Singapore and Hong Kong will frustrate grand theorizers of the post-war Asian experience and those looking for crisp models of structural change.

    Vogel wrangles insights from these experiences, but you wont find a simple, grand explanation here. His final theoretic chapter is written in a chunky, pragmatic style, much like the preceding case studies, and highlights a number of key takeaways. Common patterns have bite: The post war era found societal hierarchies reshuffled and old elites were often supplanted. Each regime presided over industrialization with a sense of urgency; trauma undergirded the rise of the region’s KMTs or PAPs and binding political threats loomed. Importantly, new political elites had a template for industrialization: Japan’s Meiji-era. Finally, as a sociologist, as well, Vogel forcefully highlights the importance of the Confucian past in allowing the four dragons to produce competitive bureaucracies.

    In sum, Four Little Dragons is a great introduction to “what happened” across post-war Asia. Instead of a giant comparative tome or slick theories, Vogel deliveries the key issues any scholar of industrialization has to confront. For this reason, The Four Little Dragons is a great reading for popular economic readers or for a college/graduate course syllabus.

    The Four Little Dragons
    Published: 1991, Harvard University Press
    Author: Ezra Vogel


    Micro Maps of Conflict & the Weak States of Williamsburg


    NYTimes infographics ahead of the times: a 1970s geography of Brooklyn street gangs.

    From this New York history blog and a spin-off piece from & upcoming documentary, A Most Violent Year.

    From Iceland with Love

    Tutorial: Training an OCR Engine

    In a previous tutorial I covered the basics of digitizing old stats with ABBYY FineReader (& alternative digitization tools). Now, I dig into some important digitization nitty gritty: training optical character recognition software to properly read historical content.
    Most historical digitization projects will entail training. Old statistical documents often use long-gone proprietary typefaces. While modern OCR software can easily read Arials and Times New Romans, it needs help with more exotic typography; this is where training come in. If you have ever worked with machine learning-type projects and/or text analysis, training software to properly classify stuff is a familiar concept. And luckily, training ABBYY FineReader’s engine is pretty easy.

    Gotta train 'em.

    Say you’re working with an old scanned document and you wish to extract its tables. If you OCR the document using the default pattern recognition settings you will likely be disappointed. Below is an example of  historic document OCRd using FineReader’s default recognition schemes; many numbers have been replaced by letters or strange characters.

    FineReader gone wrong.
    Training is used to tell FineReader, “Hey, don’t do that. That British pound sign is really a number. Same with that W. Hey, don’t do that.” Anyone who has ever had to get their hands dirty with machine learning will immediately recognize the intuition. And if not, don’t worry.
    Using custom User Patterns, we train ABBYY FineReader’s OCRing algorithm to correctly classify characters in our historic data. Old documents will have many peculiarities that will be missed by FineReader unless we point the software in the right direction.

    Note: Somewhere buried in the sparse FineReader User Guide is a line reading, “oh no, you seldom have to train FineReader” etc., etc.. This left us scratching our heads. I have never worked with historical data that did not entail some degree of training.

    1. Setup.

    It is a good idea to select a handful (or more) representative pages from the document you wish to analyze. We will use these pages to train ABBYY’s recognition capabilities.
    Save your “training pages” as a distinct file and open them in FineReader.
    Pre-process these test pages, but do not “Read” them yet. Instead, click the Options button on the main toolbar (or Tools > Options), and then click the Read tab.

    From the Training section, select either Use built-in and user patterns or Use only user patterns. I typically go with the built-in and user patterns options. You can (and should) experiment to see which is best for your job.
    In the Training section, click the Pattern Editor button. The Pattern Editor box will open. From the Pattern Editor box, click New. The Create Pattern box will open; enter a name for the User Pattern. Click OK.
    Last, click the Read with training box. This will put ABBYY FineReader in a type of “training mode.” Click OK.

    2. Reading in Pattern Training mode.

    ABBYY is now in a zombie-like training mode!
    Click the main Read button.
    Instead of reading the document per usual, a Pattern Training mode box appears and ABBYY FineReader asks your advice about characters it is unsure about.
    Adjust the green frame around each character FineReader selects, making sure the box completely encompasses each character. The program is usually good about selecting the full character, but sometimes it needs some help.


    In many old documents, printed letters may be misread using FineReader’s default OCRing pattern. In my document (seen in the image above), the letter D is always poorly printed. Without direction, the OCR engine systematically reads these letters as Us or Os.
    Confirm or correct the highlighted character using the Enter the character enclosed by the frame box and click Train. FineReader saves the corrected character and moves to the next one.
    Repeat the Pattern Training process for your training pages, until there are no more characters left to train.
    Important: If working with old data, FineReader often mistakes 1s for Is, zeros and letter Os, etc.. Hence, it is worth making sure that FineReader has been extensively trained to recognized your document’s digits. These are potentially tedious errors to correct in raw output, so it is worth making sure that FineReader really differentiates things.
    People argue about the optimal amount of pages to train ABBYY on, but it will probably take at least two pages to train FineReader to read the document adequately. However, since we’re dealing with a messy environment, we may wish to train many additional pages. It’s really an iterative process.
    There may be gross characters you don’t want ABBYY FineRead to store. For instance, in the image below the word “TABLE” is cut. Feeding ABBYY bad examples can diminish recognition. If letters/numbers are cut, use the Skip button.

    Err on the side of caution and skip weird stuff. For example, if you are uncertain what a letter/number is, also Skip it. It is better to skip a letter you are unsure about versus training FineReader incorrectly.
    Once you’re done, Save all your hard work. Click the Option button and select the Read tab; click the Save to File under the Training and Language section.
    When you’re finally done with training (and have checked your pattern, see Section 3.), Load the document you wish to analyze. Now when you press Read, ABBYY will use your User Pattern to in its OCRing process. Recognition should improve!

    3. Editing Patterns, Rooting Out Errors.

    Just like ABBYY, we make mistakes. Before we implement the pattern we trained, make your User Pattern is correct.
    Return to the Options window ( Options button or Tools > Options) and click the Read tab. Then click the Pattern Editor… button.
    The User Pattern box will open. In the training process we may accidentally mis-characterize letters. Browse the patterns we have recorded during the training process and root out typos. These typos are important to delete! We don’t want ABBYY FinerReader’s OCR engine to interpret characters incorrectly.
    If you find an incorrect character (for example, the L that has been mis-characterized as E below), click the letter image and press Delete. Be sure to really check for these types of errors, or any matches that seem iffy.
    Press OK when you are done, then save the corrected User Pattern.

    The VC & ARVN, re-enacted.


    We’re sifting through historic Vietnamese data and Pablo, my co-author, sent this along the way.

    A new model of the labor market?

    Link: Watch a stampede of idiots endlessly run straight into a spinning metal thing.

    “Artist Dave Fothergill—who’s done visual effects for a bunch of National Geographic documentary films—has created a simple 3D animation of a crowd stampeding straight into a rotating metal thing. Watch as they all get threshed into a chaotic, flailing pile. It’s mesmerizing…”

    From Death and Taxes Mag.

    Tutorial: Manipulating PDFs in Python (to Scrape Them).

    When digitizing old data, we often start with a pile of scanned documents we must reorganize. Much time is spent manually trudging through scans, deducing what variables exist, and selecting the tables we eventually wish to turn into machine-readable data. When you have hundreds of multi-page PDFS, this can be a painful experience. However, automating PDF manipulation with Python can save major time.

    The Problem

    We start with scans of old, provincial statistical yearbooks for a Southeast Asian country.

    Each yearbook page corresponds to a variable we want: page 1 has land statistics, page 2 has rice statistics, page 3 irrigation, etc..

    We want to reorganize this stack of yearbooks into something that is easier to digitize and organized by variable, not province. In essence (to the data scientist) we need to “reshape” our scanned PDF data.

    To restate the problem.

    • Have: Province x Variable PDFs

      . Most old historical data comes in the following format: hard copy volumes organized by province, state, region, etc., each with the same set of tables (cough variables).

    • Need: Variable x Province PDFs

      . And most of the time we want to pull certain pages from each geographic volume and create a new document for each variable.

    The Code

    The following is written for Python 3.* and uses the PyPDF2 package (which is a fork of the original PyPdf package), as well as the OS module for directory manipulation.

    The code starts with a directory containing our multi-page PDFs and creates a sub-folder to store individual pages, /splits.

    #The only two modules you need:
    import os
    from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
    #The directory of your (multipage) PDF files.
    start_dir = "D:\\Vietnam Provinces" # Main working directory with PDFs to chop/clean
    #Make the following dirs if it doesn't exist.
    splits = os.path.join(start_dir, "splits1")
    if not os.path.exists(splits): os.makedirs(splits)

    Second, our stack-o-PDFs are read, chopped, and their pages are placed in (page) numbered folders.

    The following code chunk begins at the /start_dir, the file containing our original PDF files. We read each scan and then loop over its pages with the line, for i in xrange(in_file_pdf.numPages). Each page i is saved to a variable folder, corresponding to its page number: first pages are saved in start_dir/splits/1; second pages, into start_dir/splits/2 folder, etc..

    for filename in os.listdir(start_dir):
        #Run the following on PDFs only.
        if filename.endswith('.pdf'):
            #Show current multi-page PDF.
            print("Splitting "+filename)
            #Define input files, paths.
            in_file = os.path.join(start_dir,filename)
            in_file_pdf = PdfFileReader(file(in_file, "rb")) #(be explicit about binary file)
            for i in xrange(in_file_pdf.numPages):
                output = PdfFileWriter()
                #Make subfolder for each page, but only once.
                num_path = os.path.join(splits,str(i))
                if not os.path.exists(num_path): os.makedirs(num_path)
                #Add i page to output, define output path, save, close outputstream.
                out_file_pdf = os.path.splitext(filename)[0]+str(i)+".pdf" #Add i number to new name.
                out_file = os.path.join(splits,str(i),out_file_pdf)
                print("Saving "+out_file)
                outputStream = file(out_file, "wb")

    Third, after chopping and saving, we combine the separated pages into variable-based PDFs.

    The following code loops over each page folder (/splits/1, /splits/2,...). Using pyPDF2′s PdfFilerMerge function, we combine pages within each folder into a single PDF file.

    Hence, the first page of each provincial yearbook is combined into a new file (i.e. the pages in /splits/1 become 1.pdf), which we can then scrape/digitize/pre-process/whatever.

    for root, dirs, filenames in os.walk(splits):
        for dir in dirs:
            merger = PdfFileMerger()
            dirname = os.path.join(splits, dir)
            for filename in os.listdir(dirname):
                in_file_pdf = os.path.join(splits, dir, filename)
                merger.append(PdfFileReader(file(in_file_pdf, "rb")))
            out_file_pdf = str(dir)+".pdf"
            out_file = os.path.join(splits, out_file_pdf)
            outputStream = file(out_file, "wb")

    Importantly, your project will probably look much different from this, but combining the OS module with the pyPDF2 package in Python can make many splitting/merging tasks trivial. Digitizing old data often entails mind-numbing file manipulation, so a little Python can go a long way.

    On developmentalism, planning, & early big data in Allende’s Chile.

    “In Allende’s Chile, a futuristic op room was to bring socialism into the computer age.”

    Link: From the New Yorker – `The Planning Machine’ By Evgeny Morozov

    “The consultant, Stafford Beer, had been brought in by Chile’s top planners to help guide the country down what Salvador Allende, its democratically elected Marxist leader, was calling `the Chilean road to socialism.’ Beer was a leading theorist of cybernetics—a discipline born of midcentury efforts to understand the role of communication in controlling social, biological, and technical systems. Chile’s government had a lot to control: Allende, who took office in November of 1970, had swiftly nationalized the country’s key industries, and he promised “worker participation” in the planning process. Beer’s mission was to deliver a hypermodern information system that would make this possible, and so bring socialism into the computer age. The system he devised had a gleaming, sci-fi name: Project Cybersyn.”


    B-29 Attacks in WWII Japan

    (From the Maps on the Web blog.)

    Tutorial: A Beginner’s Guide to Scraping Historic Table Data

    This is a simple introduction to scraping tables from historic (scanned) documents. It is by no means definitive. Instead, this is a broad overview aimed at researchers with minimal programming experience tackling smaller digitization projects—say, nothing more than 200 pages.  I focus on OCRing materian with ABBYY FineReader, a popular commercial program, for OCRing; it has a gentle learning curve and, importantly, straightforward table functionality.

    For those more comfortable with the command line and programming, or for open source advocates, I suggest free programmatic alternatives for each tutorial step. Larger complex digitization projects often entail more technical elbow grease and advanced use of such tools.


    Oh, the enraging heritage of old data. (From the Mad Men Mondays “Data’s First Class Economy Set” Repository: Hartman Center, Rubenstein Library, Duke University.)


    First, Some OCRing Tips.

    • Try to work with scans that are at least 300 DPI, saved in TIFF format. PDFs are often unavoidable, but use less “lossy” formats when possible.
    • The older the text, the harder OCRing will be. This is especially true for text from the early 1900s and prior.
    • Cleaning improves text recognition. Pre-process scans to remove stains and borders; fix page orientation; deskew; and normalize page illumination.
    • The straighter your page, the more likely programs are to recognize tabular content.
    • Experiment with different color settings for your project. People debate the efficacy of binarized (black-white) versus color formats. Explore what works best for your project, the benefits are high.
    • OCR quality suffers at very low and very high resolutions. If you scan at a higher resolution, drop the resolution before OCRing.
    • OCR software is poor at reading small text.
    • OCR software needs your help, especially for weird type faces: invest in training software prior to OCRing.

    1. Convert Scans: split PDFs & convert to TIFF.

    Most digitization projects don’t start with clean TIFF files. They start with a nasty, multi-page PDFs produced from harried library scanning sessions. Before OCRing our scans, we must pre-process (i.e. batch clean) our images—and before we pre-process our images, we must break apart our multi-page PDF and convert it to TIFF format. This is the format used by our pre-processing software, as well as by other OCRing tools:

    Using Adobe Acrobat Pro.

    If you have the luxury of Adobe Acrobat Pro, you can easily convert multi-page scans or a set of combined PDF images into TIFFS. (File: Save As > Image > TIFFs). Better yet, you can use their GUI to extract and export subsets of the PDF document to TIFF format.



    Splitting and converting a multi-page PDF to TIFF in Acrobat Pro.


    Open source alternatives:

    If you don’t have access to Acrobat Pro, you’re in luck. There are oodles of open source tools for breaking apart PDFs and/or batch conversion. ImageMagick is the workhorse open source tool for command line-based image manipulation and is incorporated into many digitization projects. It is cross-platform and can be integrated into most major programming languages. For OSX, Pdf-Splitter by Pro-Publica’s Jeff Larson is a simple command line tool that utilizes native OSX libraries. Born from the frustration of dealing with PDFs, there are many Python packages for deconstructing documents. For instance, the pdfserenity package converts multi-page PDFs into TIFFs.

    The pyPDF2 package in Python is especially useful at manipulating PDFs, which I cover in this post here. Also–it’s pretty darn fast!

    2. Pre-process TIFFS with ScanTailor.

    Pre-processing images–straightening documents, splitting pages, removing distortions, and de-staining pages–can make or break character recognition. In the preceding step we converted our files so that we could pre-process our images using ScanTailor, an awesome open source tool specifically for batch cleaning scanned documents. It only accepts TIFF and does a limited number of basic automated tasks, but it does them damn well. Hence, it has become a staple of the digitization and hacker/text scraping community.

    While many OCRing suites, like ABBYY FineReader, also have solid pre-processing tools, ScanTailor has key advantages. Importantly, I have found it is much better at straightening pages than most tools—this is crucial for recognizing table structure. In addition, ScanTailor also comes with a command line version that be used to script larger tasks.


    Batch pre-processing images in ScanTailor’s GUI-based version.

    Other open source and programmatic tools for pre-processing.

    While most people use ImageMagick for basic image conversion, many people utilize its powerful features for batch document cleaning scriptsGIMP, the popular open source graphic suite, also has promising batch pre-processing capabilities: people have had success with Nuvola tools for cleaning up greyscale scans.

    3. OCRing with ABBYY FineReader 12.

    Now that we have a pile of cleaned TIFFs, we load the files into ABBYY FineReader for further  1) pre-processing, 2) training, 3) table analysis/OCRing, and 4) error verification:

    First, use ABBYY’s pre-processing tools to further clean and select the optimal OCR resolution.

    Once loaded into FineReader, you will likely want to further clean the scans using the built-in pre-processing tools (“Edit Image”). One tool that is particular useful is the FineReader’s optimal resolution tool, which scales the resolution of the image to maximize recognition.

    But before you OCR, train.

    Training is the next crucial step. With historic data, you will likely get poor results if you neglect to train the OCR software and jump straight into OCRing. The ability to fully train OCR engines distinguishes professional software from less sophisticated OCRing tools.

    In general, training improves the ability of OCR algorithms to correctly classify characters by “tuning” the algorithm on your document’s typeface. In FineReader, training is a trivial task, where you walk the program through recognizing a sample set of characters from your document. You can easily append and save these training files, called “User Patterns.”


    Training ABBYY FineReader’s OCR engine on a sample document.

    “Analyzing” & “Reading” – Table recognition & OCRing in FineReader.

    In FineReader, layout recognition and OCR are known as “analyzing” and “reading”, respectively. Unlike straight forward digitization of textual material, we want to make sure FineReader recognizes our table layouts and correct mishaps before it reads the content of individual table cells: First, the Analyze Selected Pages command detects the content of our pages (i.e. finds our tables). We then confirm that FineReader has recognized tables and table cells correctly, adjusting mistakes “by hand” with the built-in table editing tools. Second, we OCR the table contents with the Read Selected Pages command.


    A properly recognized table in ABBYY FineReader with OCRd content.

    Check for mistakes, tweak, & repeat.

    OCRing is never perfect. Once FineReader has “read” your document, you will want to check for errors. For each page, FineReader gives  a rough error rate, reporting the number of characters it is uncertain about on each document page.  The “Verify Text” tool allows us to easily check and correct each uncertain character.

    In general, after the first OCRing it is best to get a sense of how successful character recognition was and the types of errors that occur. Often, exploring pre-processing and improved training can improve text recognition.

    5. Post-processing.

    Ultimately, FineReader will spit out .csv or .xlsx files, but newly digitized content still needs to be tidied up.

    Especially if you’re working with old documents, OCRing produces some junk output; dust, scratches, and page discoloration can get picked up as weird symbols: *, ^, \, etc.. You can easily correct these blemishes using regular expressions in your preferred scripting language (using sub/gsub commands in R, re.sub type commands in Python). Better yet, OpenRefine provides some extremely flexible tools for wrangling OCRd output, making most cleaning tasks trivial while also supporting advanced regular expression use.

    Clean up weird OCR output using OpenRefine and regular expressions.


    Ending Note. A bit more on OCR software and getting advanced.

    Why ABBYY FineReader? First, the learning curve is lower than other open source options. The closest open source competition comes from Google’s tesseract OCR program, which is powerful and useful for those comfortable with the command line or OCRing from a preferred programming language. For advanced projects, tesseract offers unmatched flexibility and customization. However in my experience, the tesseract is frustrating to train and has poor table recognition ability.

    Although it is relatively easy, ABBYY FineReader has downsides. For instance, the Mac version isn’t as functional as the complete “Professional” PC version. Moreover, multi-core support is limited for both Professional and basic Corporate versions, making large projects slow and unwieldy (in my experience).

    While FineReader provides tools for table area recognition, other times we have to pursue more programmatic methods of extracting table. Common approaches can be seen in Dan Nugyen’s ProPublica guide and Dr. Rex Douglass’ (UCSD Polisci) method, who use computer vision techniques to “cut up” tables, OCRing individual cells before reassembling the table. I recommend taking a peak at both to understand alternative workflows for table scraping. Other users have opted to detect tables after OCRing: first, recognizing text in PDF files and then stripping the OCRd content using PDF table extraction tools like Tabula. These methods hint to the growing hacker community interested in scraping PDF content. The recent PDF Liberation Hackathon website features some great tools to this end.

    Feel free to shoot me any feedback or share your experiences with digitizing historic data: nathaniel.lane AT

    Text Analysis, (Non-) Experts, and Turning “Stuff” into Social Science Data

    Social scientists often use experts to “code” datasets; experts read some stuff and code that stuff using a coding rule (a right-left political spectrum, etc.). But this a slow, painful process for constructing a dataset

    Below is an awesome, succinct Kickstarter seminar on how to categorize (political) text using crowd-sourcing from quant-y political scientist, Drew Conway (

    The great lil’ seminar is related to his recent working paper, “Methods for Collecting Large-scale Non-expert Text Coding”:


    The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified “experts”. While productive, this method is time consuming, resource intensive, and introduces bias. In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output…

    A Deep Learning Bibliography

    A fantastic and extensive bibliography plus github cataloging deep learning resources/code/libraries, etc. from An amazing time vortex.


    Political Economy and Rock n’ Roll in Pre-Pol Pot Cambodia

    (A respite from data- – ) In the 1960s and 70s Cambodia was home an emergent rock scene, inspired by Western rock that was rolling its way into the region. The rise of the Khmer Rouge put a swift end to a subversive subculture. As in the clips below, youth culture isn’t quite compatible with authoritarianism…much less a utopian regime attempting a hard reboot of society to “year zero.”

    A new documentary, “Don’t Think I’ve Forgotten: Cambodia’s Lost Rock and Roll”, has captured the full story:

    A nod to the archivist and digitization effort–and how (literal) records survived a regime that wiped out property and artists alike:



    Photo from: KI Media blog.
    Project, one of many: Don’t Think I’ve Forgotten – Cambodia’s Lost Rock & Roll

    A great primer on cleaning OCRd data with Python & Regular Expressions.

    Link: Cleaning OCR’d text with Regular Expressions

    Often the pain of optical character recognition isn’t the OCRing procedure itself, it is cleaning the tiny, little inconsistencies that plague OCRd content. This is especially true when we OCR historical material: even high quality scans can have a speckle or two that get recognized as gibberish.

    Adept use of Regular Expressions (regex) coupled with simple Python (or Ruby scripts–or heck, even Notepad++) can be a powerful means of removing nasty errors from OCRd text/CSV files.

    Here’s an awesome little primer from Laura O’hare at The Programming Historian on using Python to clean nasty OCRd content using regexs. Great sample (verbose) code for helping turning mush into data. Importantly, they break down a lot of the regex components, which is helpful for those getting started with this brand of data cleaning.

    Of course, regex+Python won’t be perfect. While there are preferred ways of using regex to wrangle text, Python gives most of us quick, programmatic means of cleaning nasty OCRd spreadsheets and the like. Most importantly, however, errors in our OCRd content are seldom systematic, which makes completely automating OCRd data cleaning tricky. There will be hand polishing involved. But as the primer notes, the point is isn’t perfection; it’s to let regex+Python “do the heavy lifting.”

    An Investigative Journalist’s Guide to Geolocating Media

    Link: An Investigative Journalist’s Guide to Geolocating Media

    bell?ngcat, a crowd-funded start-up of Middle East wonks, investigative journalists, and researchers, has made some waves for demonstrating how to geo-locate an Iraqi ISIS training camp.

    Here are a couple of fantastic guides to some of the techniques they use to geolocate media, combining picture/video data alongside common geographic tools. One of the interesting techniques used lately consists of extracting and mapping metadata from photographs using tools like Panoramio

    From the Historical Times

    historicaltimes:19 inch color TV draws a crowd in Shantou, China. 1983


    Letter from Senor Don Enrique Dupuy de Lôme to Senor Don Jose Canelejas, 1898.

    Item From: General Records of the Department of State. (09/1789-)

    This letter, written by the Spanish Ambassador to the United States, criticized President McKinley by calling him weak. The publication of the letter caused the public to support a war with Spain over the independence of the colony of Cuba.



    Two men struggle to free their scooter from a barbed-wire barricade in Saigon, South Vietnam, 1965.Photograph by W. E. Garrett, National Geographic Creative

    Guides:’s Archives Reviews

    The Fresh from the Archives Guide – from

    [Photo: Poor students gazing into microfilm from the Special Collections Department, ISU Library]

    From the US NARA tumblr.


    Berlin Wall Reinforced. Under The Watchful Eye of Communist Police, East German Workers Near The Brandenburg Gate Reinforce The Wall Dividing The City, 10/1961.

    Item From: Records of the U.S. Information Agency. (1982-1994).

    A signature of an era, the Berlin Wall symbolized the descending of the Iron Curtain. Construction on the Wall started on this date in 1961.



    Bikini Kill’s Kathleen Hanna reads the The Riot Grrrl Manifesto.

    The data on white anxiety over Hispanic immigration

    The data on white anxiety over Hispanic immigration

    Historic Aggregate Data for Korea, 1910-1945 — and beyond.

    Historic Aggregate Data for Korea, 1910-1945 — and beyond.

    A bibliography on land, peasants, and politics for Malaysia, Indonesia and the Philip­pines

    A bibliography on land, peasants, and politics for Malaysia, Indonesia and the Philip­pines

    Dataset: Philippine Municipalities Created by Executive Order

    Oh boy. Doing panel econometrics and economic history in developing countries is awful. One problem, ALWAYS, is when we want to match historical data to contemporary data. Doing so entails historic shapefiles or datasets that allow us to track administrative boundary changes; these things are seldom available for places like the Philippines. 

    So here is a little dataset that lists new municipalities created by executive order (only) from 1936-1965

    Philippine Executive Order Municipalities, 1936-1965 – Table on Mode.

    Note this does not list the “mother” municipalities. 

    A great little bit on the practice/work style of my friend, Erin Riley, a textile artist. (

    How do you start your day? Is there something you do every morning?

    I usually just make sure I have a decent amount of coffee and water within reach and make sure I have all the colors I need to work for a while. I turn on all the lights, fans, take my vitamins, check my email and then get to weaving.

    Non-parametric Econometrics and Quantile Regressions Online.

    On the tail of some cool new econometric papers are a couple of cool new Stata programs.

    The Bastards Book of Regular Expressions by Dan Nguyen

    The Bastards Book of Regular Expressions by Dan Nguyen