03 January 2016

Programming

Part of my writing this past summer and autumn involved a lot of programming. In fact, that's been going on for a long time. I won't take you all the way back to 1995. In spring 2014 I decided that I was going to do all of my dissertation analyses in Python, and I started at the beginning with the free interactive course at Codecademy. It was a good course and I highly recommend it. One of the exercises was to write a little Battleship game, which I took up again after I finished the course and expanded to include most of the rules in the real board game. It even tells you if you get a near-miss!

Anyway, I chose Python because it could be a catch-all for the various things I needed to do for my research work. Parts of my workflow use external programs, image processing, mathematical modeling, statistical analyses, visualization, large-array I/O, etc. all requiring repeatability for different locations and times. ArcGIS can't do all that, and it's expensive anyway. Perhaps I could do it in R, but I'd be learning that from scratch as I already did with Matlab, which can't do all of what I want. Fortran, my "home language" for a long time, would be too cumbersome. Python has this interactivity, this flexibility, and this lack of a compile step that was greatly appealing. It's called a "glue language" by many programmers for good reason. One Codecademy course and one O'Reilly book (Python for Data Analysis) and I was off and running up the learning curve.

One other consideration: this autumn my advisor and I decided together that my analysis code would be published with the papers that I would be producing. We're going for transparency and repeatability, but we're also looking to make something useful for the community. The things that I'm looking at around Lake Superior, other people might want to examine around Lake Michigan or in the Appalachians or who knows where. Something with Python that you really don't get with compiled programming languages is readability. If you know the basics (again, the Codecademy course is just about enough) then you can read the whole script and know what it's doing.

Now, of course, you might say "Don't you want to keep that for yourself, get those publications for yourself, have the University sell it to generate research money?" Nope. I don't have time to analyze every interesting place on Earth, I'll get enough publications through my own research interests (which are constantly evolving, too), and money (specifically, lack of) is so often an obstacle to good science. Lots of universities have strict rules about publishing software—the creator of the Gnu linux OS has a good story about MIT's license requirements and why he quit his job there before developing the OS. I've noticed that anything developed within the California state university system (including my Mac's underlying Debian linux OS) is copyrighted to the university system Regents, not the people that actually wrote the software. As for me, I prefer to put my brain out there for free, and my University allows that. It's just how I want the science to be.

So, for that paper I mentioned a couple days ago, we will be posting the accompanying analysis scripts to a GitHub repository for anyone and everyone to pull down, reproduce our results (with the provided raw dataset, same as I used to develop the paper), generate new results for other places, find bugs, make improvements, add features and methods, collaborate to make it even better. 5500 lines of Python in 24 scripts, through which about 60 MB in input data generates something like 280GB of analytical output (including tons of plots and maps). While also writing the paper, I spent this past autumn prepping, commenting, reorganizing, modularizing, simplifying, and streamlining that package so that it will be as easy as I can make it for a new user to get started. I'm still working on the README (that is, instructions for use) but I figure I'll have the time while the paper is in review to get that finished.

We intend to do the same for the next two papers in The Plan—this is just "Part 1." Publishing the code is certainly more work than just getting the results and publishing the paper. But this way, interested researchers can dig in to their hearts' content. Reproduce my analyses and tell me if I missed something. Use it for your own analyses for that area you're interested in, cite my paper in your own publication, and then take the time to tell me how you think the code can be improved for more general use. Everybody wins.

Oh, and about Fortran, I resurrected from my archives some code that I wrote for a paper that I published in 2008, and I replaced a rather opaque 300-line F90 module with an equivalent 20-line Python/NumPy function for the analyses that went into this new paper. I'm pretty sure I have a new "home language" now.

No comments: