Type Slowly

Feb 15
“regexes and our use of them suggest that we’re not allergic to density in information per character but to something else. One culprit is is simply unfamiliarity. Regexes are okay because we’re familiar with them, other forms of density are bad because we’re not familiar with them.”

James Iry has some interesting thoughts on the subject of verbosity and clarity in programming languages - arguing that the continued popularity of regular expressions shows that density isn’t an intrinsic barrier to readability, and rather, that our understanding of the underlying concepts encoded in a certain language construct is the greater culprit.

I think this is a key point - regexes simply encode finite state machines, a readily understood concept. Similarly, the ‘+’ and ‘*’ operators of arithmetic encode the ideas of sums and products, readily understood by everyone from primary school onwards.

Density becomes a problem then, when then underlying concepts aren’t familiar to the reader. For instance, sums and products on non-numeric data types can be confusing to people with no previous exposure to the idea of them as generalised algebraic operations. Similarly, the combinators associated with typeclasses inspired by category theory (as popularised in Haskell), seem foreign to many who haven’t come across the concepts before. However, in this instance, use of ‘bind’ as opposed to ‘»=’ or something like ‘liftedFApply’ instead of ‘<*>’ would not improve the readability of the code, as the underlying concepts are still foreign.

However, this isn’t to suggest that these underlying concepts should be avoided because of this - the increase in both expressiveness and opportunity for code reuse afforded by these sorts of generic structures is incredibly valuable in my opinion. However, this increases the importance of writing decent documentation and comments, in order to aid the understanding of programmers who might be less familiar with the concepts (and therefore notation) being used.

Having said that though, this isn’t meant to deny that there’s a tradeoff between familiarity and genericity that needs to be taken into account - and the optimum balance might well fall at a different point for others. (For instance, I find scala’s ‘:/’ and ‘/:’ syntax for foldLeft and foldRight baffling). However, I think James is absolutely right in stating that brevity isn’t in itself a bad thing.


Jan 30

Thrift installation woes part II: Using Thrift with a kerl’d erlang.

More notes on thrift setup, mostly for my own reference, but hopefully they’ll be useful to anyone else in the same situation.

So, following my last post, you’ve got thrift set up on your local machine. However, you’re using the highly useful kerl tool to work with different erlang versions, and you want to use both this and thrift with your project. This means you need to recompile and install the thrift erlang libraries to work with your kerl installation. 

Thankfully this isn’t too painful, and doesn’t involve recompiling all of thrift again, you just need to pass a few extra options to the configure script, and ensure that you only make and install the erlang bindings, rather than the whole thing:

First, you need to run the configure script, setting environment variables that refer to the location of erl, erlc, and your erlang installation’s lib directory, and also explicitly disabling all the libraries apart from erlang:

ERL=$PATH_TO_YOUR_KERLD_ERLANG/bin/erl \\
ERLC=$PATH_TO_YOUR_KERLD_ERLANG/bin/erlc \\
ERLANG_INSTALL_LIB_DIR=$PATH_TO_YOUR_KERLD_ERLANG/lib/ \\
bundle exec ./configure --without-cpp --without-c_glib --without-csharp \\
--without-java --without-python --without-perl --without-php \\
--without-php_extension --without-ruby --without-haskell --without-go

Then, cd into the lib/erlang directory in the thrift source tree, before running make and make install as normal (don’t use sudo for this if you’re installing into a directory inside your home):

cd lib/erlang
make
make install

Jan 17

Jruby on Elephants

A few months ago, I gave a talk at the excellent Ruby Manor conference about using Jruby with Mahout to do machine-learning (More specifically, to detect twitter messages containing pictures of kittens). There’s now a video of it online, as well as slides and code.

One small additional note - my explanation of Logistic Regression is a little confused, and is subtly but fundamentally incorrect. In the simple one dimensional example I show, the Y-axis corresponds to the feature to be learned, while the X-axis corresponds to the feature we’re learning from - Watching the video back, It looks like I’m describing the Y-axis as a feature,which would be silly, as then we wouldn’t actually be predicting anything. So, the logistic curve, in this 1D example gives us a means of predicting the classification (Y-high or Y-low) from the (real) value of the X axis. Sorry!


Installing Thrift on Ubuntu Oneiric, with Ruby 1.9.2 (via RBenv) and Haskell

A few notes, to save my sanity if and when I run into this again.

Get the thrift sources from SVN, untar the file, then add a Gemfile to the root of the source tree:

source "http://rubygems.org"
gem "mongrel", "=1.2.0.pre2"
gem "rspec", "=1.3.2"
gem "rake", "0.9.2"

Then run bundle install. The Haskell bindings require the Binary library to be installed in cabal. This needs to be done globally, as cabal by default doesn’t look at packages installed per-user:

 sudo cabal install binary --global

Then you need to do the configure / make / make install dance, running each command with bundle exec:

 bundle exec ./bootstrap
 bundle exec ./configure
 bundle exec make

However, there’s one last gotcha. ‘make install’ needs to be run as root, but in order to ensure it uses the bundled rake and gems, and rbenv’s selected ruby, you need to become root first, then run make install:

 sudo -s
 bundle exec make install
 exit

That should do it!


Jun 8

Automatically backing up a postgres database

I’ve got a web server I’m looking after that has a postgres database with some important information in it. The server’s filesystem itself is backed up nightly, but I want to make sure that there’s a dump of the database on disk before this happens. This clearly requires a cron job, and the use of the pg_dump command. However, I’ve got a few other requirements: I need backups for the last n days to be stored and automatically rotated, and the backups should be gzipped for storage. The result is this handly little bash script:

#!/usr/bin/env bash
DATABASE_NAME=somedatabase
BACKUP_PATH=/path/to/backup/dir
BACKUP_LIMIT=7
pg_dump $DATABASE_NAME | gzip -c > "$BACKUP_PATH"/"$DATABASE_NAME"_`date +"%Y%m%d%H%M"`.sql.gz
BC=$(ls $BACKUP_PATH | wc -l)
if [ "$BC" -ge "$BACKUP_LIMIT" ]; then
    ls -t $BACKUP_PATH | tail -$(($BC-$BACKUP_LIMIT)) | xargs -I{} rm $BACKUP_PATH/{}
fi

Apr 25

What’s the difference between the strict and lazy identity monads?

I came across this question while reading  Simon Marlow et al’s paper “Seq no more: better strategies for parallel Haskell”, which describes the Eval monad (representing the result of a parallel computation) as being equivalent to the Strict identity monad. 

Now, it may seem obvious that it’s possible to have identity monads with either lazy or strict semantics, but I’d never come across them in the wild before, and it took a little bit of thinking to fully understand the difference between them.

First, it’s important to note that when we describe a monad as being lazy or strict, we’re describing the semantics of the bind operation with respect to the monadic value it’s operating on. So, a strict monad when bound to a function will always evaluate its argument, whereas a lazy monad will only evaluate its argument if the bound function references it.

The key to implementing this lies in the fact that pattern-matching arguments to a function forces strict evaluation - in order to carry out a pattern-match, the argument must be evaluated. Therefore, if we define our bind function to pattern-match against its first argument, we are implicitly affording strict semantics to our monad. If, on the other hand, we do not pattern match against the first argument, our monad will have lazy semantics. In code:

import Control.Monad

data LazyID a = LazyID { runLazyID :: a }

instance Monad LazyID where
  return = LazyID
  m >>= f = f (runLazyID m)

data StrictID a = StrictID { runStrictID :: a }
instance Monad StrictID where
  return = StrictID
  (StrictID v) >>= f = f v

Mar 24

Making Nationwide’s online banking play nicely with Freeagent

I don’t like Barclays Bank. I don’t like their President, I don’t like their attitude to tax payments, and I definitely don’t like the extent to which they invest in the sale and manufacture of weapons. As a result of this, I’ve moved my bank account to Nationwide, who so far have been most impressive, apart from one small issue. I use the wonderful FreeAgent software to handle all my accounting, tax, and the like, and one of the awesome things it can do is import your bank statements in OFX format, which saves me doing a hell of a lot of very boring data entry, and leaves me with only a small amount of slightly less boring book-balancing to do.

However, Nationwide don’t support export of your banking data as an OFX file, opting instead for a rather arbitrary CSV-based format, delivered in an arcane character encoding that appears to be mostly ASCII, apart from £-signs, which are encoded as Latin-9.

All’s not lost though, since Freeagent accept CSV files (in their own, rather simpler format), it’s possibile to convert between the two with this handy Ruby script. Simply download it, set the executable bit, and run it with:

convertBankStatement your_nationwide_statement.csv file_to_write_to.csv

You’ll need Ruby 1.9.2. This is how it works:

#!/usr/bin/env ruby
# encoding: UTF-8
require 'csv'
require 'iconv'

@converter = Iconv.new("ISO8859-15", "utf-8")

date = /\d\d [A-Z][a-z][a-z] \d\d\d\d/
string = /.*/
nothing = /^$/
amount = /£\d+\.\d\d/
money_in = [date, string, nothing, amount, amount]
money_out = [date, string, amount, nothing, amount]

def normalise(t)
  t.map { |s| s.to_s.gsub("\xA3".force_encoding(Encoding.aliases["ISO8859-15"]), "£").force_encoding(Encoding.default_external) }
end

def transaction_type(array, pattern)
  !array.empty? && array.zip(pattern).inject(true) {|m, (string, pattern)| m && string =~ pattern}
end

def date(transaction)
  Date.parse(transaction[0]).strftime("%d/%m/%Y")
end

def description(transaction)
  transaction[1].strip[0..31]
end

def outgoing(transaction)
  transaction[2].gsub("£", "-")
end

def incoming(transaction)
  transaction[3].gsub("£", "")
end

CSV.open(ARGV[1], "wb") do |o|
  CSV.foreach(ARGV[0], :encoding => Encoding.aliases["ISO8859-15"]) do |t|
    t = normalise(t)
    if transaction_type(t, money_in)
      o << [date(t), incoming(t), description(t)]
    elsif transaction_type(t, money_out)
      o << [date(t), outgoing(t), description(t)]
    end    
  end
end


Mar 20
I normally try not to wear dorky t-shirts, but this one is just too good to pass up. The Y-combinator, set in the style of the other Y-combinator&#8217;s logo. If you like the lambda calculus, and either like or want to poke fun at web-startup VC type stuff, you can get one here.

I normally try not to wear dorky t-shirts, but this one is just too good to pass up. The Y-combinator, set in the style of the other Y-combinator’s logo. If you like the lambda calculus, and either like or want to poke fun at web-startup VC type stuff, you can get one here.


Mar 17

Fans of Arrested Development will doubtless already be familiar with Gob’s Program - but who would have thought it’d make such a good bit of instructional Haskell, covering infinite lists and lazy evaluation, list comprehensions, pattern matching, as well as Monadic IO and control flow, all in only 10 lines?

import System.Exit
import Control.Monad
main = do
  putStrLn "Gob's Program: Y/N?"
  ans <- getLine
  case ans of
    "Y" -> sequence_ $ map putStr penuses
    _   -> exitSuccess

penuses = ["Penus " | _ <- [1..]]

Feb 10

Monads, Mo’ Problems.

I don’t think I’ve really got anything new to offer that existing monad tutorials don’t, so I’m going to skip the seemingly compulsory step in the life of a fledgling Haskell programmer of writing a tutorial on monads. However, having recently started to properly ‘get’ what monads are and how they work, I thought I might write a little about how I got to that point - sort of like a literature review of the monad tutorials that are already out there; a ‘meta-tutorial’ of some sort.

So, in a few steps, here’s how I got to a point where I’ve started to feel like I properly understand and am comfortable using monads as an abstraction in my code:

  • Prerequisites: This stuff only properly started to fall into place once I had a good grasp of lambda abstractions and list comprehensions or their distant mathematical cousin, set-builder notation. (the latter for reasons that will become clear in a few steps time). Take some time to properly cover the basics first.
  • Read a monad tutorial in the language you use most day-to-day. This helped me understand the definitions of ‘bind’ and ‘return’, but how monads are actually a useful abstraction (and what on earth they have to do with state, IO, and all the other magic they do in Haskell) still eludes me.
  • Spend a while constantly coming across claims that tools I use everyday are actually Monads in disguise. Remain vaguely confused about why, but develop an intuitive sense of how to spot a Monad in the wild in any case.
  • Read Philip Wadler’s ‘Comprehending Monads’ paper (Did I mention that understanding list comprehensions will help you massively?). 
  • Attain a state of blissful Monad nirvana. Realise that Monads aren’t a technique for implementing stateful imperative programming in a pure functional language, but rather that imperative stateful programming is itself a monad.
  • Write a monad tutorial about your discovery.

…and if spurious real-world metaphors are your sort of thing, I found that Burritos were the most useful analogy (seriously, they are).


Feb 8

Testing Sunspot with Test::Unit

So, after singing the praises of Cucumber at LRUG last year, I’ve actually become a bit more ascetic with my testing practices of late, and am using Test::Unit for everything, including acceptance testing (albeit with a layer of contest and stories sugar on top). I could write about why I chose to do this in detail, but it really comes down to one thing, which is ‘speed’. I’ve been programming in Ruby for years, and to be honest, I can spec out a bit of behaviour faster in Ruby than I can in cucumber’s DSL. However, since most of the Rail community seem to be using RSpec and Cucumber, this can make some things a bit of a chore, as with some fairly common or general tasks, you have to go it alone, rather than just grabbing someone else’s solution.

A case in point: I’ve been working on getting full-text search (powered by Sunspot and Solr) working with a site I’m working on. I’ve been really impressed by how easy this is - except for when it came to getting my tests working nicely with it. The problem here is that the Solr service needs to be running in order for the search functionality to work,  but if it’s running for all your tests it slows everything down massively. Clearly, I need a way to only run Solr for those tests that need it, leaving the others to run without it. Since all my test cases are just plain ruby classes (this is Test::Unit), the idiomatic way of doing that seems like a mixin to me, and here it is: 

 module TestSunspot
  class << self
    attr_accessor :pid, :original_session, :stub_session, :server
    
    def setup
      TestSunspot.original_session = Sunspot.session
      Sunspot.session = TestSunspot.stub_session = Sunspot::Rails::StubSessionProxy.new(Sunspot.session) 
    end

  end
  def self.included(klass)
    klass.instance_eval do
      def startup
        Sunspot.session = TestSunspot.original_session
        rd, wr = IO.pipe
        pid = fork do
          STDOUT.reopen(wr)
          STDERR.reopen(wr)
          TestSunspot.server ||= Sunspot::Rails::Server.new
          begin
            TestSunspot.server.run
          ensure
            wr.close
          end
        end
        TestSunspot.pid = pid
        ready = false
        until ready do
          ready = true if rd.gets =~ /Started\ SocketConnector/ 
          sleep 0.5
        end
        rd.close
      end

      def shutdown 
        Sunspot.remove_all!
        Process.kill("HUP",TestSunspot.pid)
        Process.wait
        Sunspot.session = TestSunspot.stub_session
      end
    end
    def teardown
      Sunspot.remove_all!
    end
  end
end

We use this by calling TestSunspot.setup at the time our testsuite starts running, then including TestSunspot into any test cases that require Sunspot’s functionality.

Let’s have a closer look at what’s going on.

TestSunspot.setup saves the original sunspot session, and replaces it with a new stubbed session - this ensures that when CRUD operations on models that are indexed are called, they don’t fail as the server isn’t running. We could of course run the server for every test case, but this’ll slow down our test suite due to all the unnecessary updating of the index that’s going on. It’s not terrifically good practice either. The whole point of Unit tests is that they test Units of your codebase in isolation, and having all these tests depend on this massive cross-cutting concerns is both inelegant and almost certain to cause you headaches.

With that done, we don’t need to worry about sunspot interfering in our other tests, but we still haven’t solved the problem of how to get our tests that are actually concerned with searching to run Sunspot properly.

This is where the self.included method definition comes in. This allows us to execute an arbitrary block of code when the module is mixed into a class, and in our case, we’re using instance_eval to define two new class methods, startup and shutdown, which, as the Test::Unit documentation will tell you, are run once at the beginning of the set of tests defined in that class, and once at the end, respectively. This gives us our hooks to startup and shutdown the sunspot server, which is where things get a bit tricky.

First of all, we need to replace the current sunspot session with the ‘real’ one we stubbed out earlier. In addition, we need to start Sunspot in a different thread or process, as we obviously don’t want it to block our tests from executing. However, it takes a while to start up, and we need to delay executing our tests until it’s ready. So, we fork a new process to run Sunspot (storing the PID so that we can shut it down later), but we redirect it’s STDOUT and STDERR into a pipe. Then, the other end of the process polls the other end of this pipe, waiting for the message “Started SocketConnector” (which is helpfully  output by Sunspot when it’s ready) to appear, at which point it stops blocking and we can run the tests. In our shutdown callback we clear the index, terminate the sunspot process (waiting for it to close), and switch back to our stubbed session, so that any successive tests will run as normal.

This is a bit of a hack (in particular the STDERR redirecting business), but it’s working reliably for me, and offers a clean simple interface to testing search functionality in my application. 


Jan 26

Making happs-tutorial stop whinging about encodings

(I’m reposting this little tip, first posted back in May because I’ve somehow managed to delete it since then, and it appears to have been useful to a few other people. I’m afraid I never did submit that patch.)

So, as of last week, all of us at Harmonypark have started taking 10% time every Friday afternoon to work on our own projects, learn new things, and generally do interesting stuff that will make us better at our work.

This afternoon was my first ten percent time session, and my plan was to work my way through the Happstack tutorial, then to make something interesting for the web in Haskell (I’ve got some idea of what, but want to see how I get on before talking about it). However, for reasons that are unclear to me, my OS (Ubuntu 10.04 Lucid AMD64), version of GHC (6.12.1) and locale settings (en_GB.utf8) all conspired to stop the tutorial server from running, failing with the message:

./introductiontomacid.st: hGetContents: invalid argument (Invalid or incomplete multibyte or wide character)

The way to get around this is to batch convert all the template files to UTF-8 using iconv. As iconv won’t edit in place, this is a three-step operation. First convert all the files, creating new files with the extension ‘.utf8’, then delete all the old ones, then move the new ones into their place. To accomplish this, run these three commands from inside your checked-out happs-tutorial darcs repository:

find .  -name '*.st' -exec sh -c 'iconv -f ISO-8859-1 -t UTF-8 "$1" > "$1.utf8"' -- {} \;

find . -name '*.st' -exec rm {} \;

find . -name '*.utf8' -exec sh -c 'echo "$1" | sed s/\.utf8// | xargs mv "$1"' -- {} \;

This wasn’t all wasted time though, while I haven’t learnt much about Haskell and Happstack, I’ve got to grips with ‘find’ and learnt about some interesting bits of the shell I didn’t know about before. I’m gonna make a patch and submit it back to the maintainers shortly (I’m getting back to actually doing the tutorial for the rest of the afternoon instead), but hopefully this will help out anyone in the meantime.


Jan 2

Epic win: why surjective Set-functions are Epimorphisms.

I’ve been (slowly) working my way through Benjamin Pierce’s Basic Category Theory for Computer Scientists recently, with the imediate aim of understanding enough category theory to get to grips with David Spivak’s Simplical Databases paper, as well as for my general edification. It’s a really good, clear introduction to the subject, and is structured in such a way that it’s possible to pick it up for fifteen minutes and get to grips with a single concept (defined, explained and proved in a few paragraphs) at a time. However, I came a bit unstuck when I got to the assertion that the Epimorphic arrows in Set are the surjective total functions. It’s actually very simple to prove this, using the same reductio ad absurdum strategy as Pierce uses to prove that the Monomorphic Set-arrows are the injective total functions, however the proof is left out of the book:

By definition, an arrow $ f: A \longmapsto B $ is Epimorphic if, for any pair of arrows $ g:B \longmapsto C $ and $ h: B \longmapsto C $: $ g \circ f = h \circ f $ implies that $ g = h $:

$$ \exists f: \forall g, h: g \circ f = h \circ f \Longrightarrow g = h $$

So, let’s assume the opposite. Let $ f: A \longmapsto B $ be a surjective function, but let g and h be functions $ g: B \longmapsto C $ and $ h: B \longmapsto C $, such that $ g \circ f = h \circ f $ but $ g \neq h $:

$$ \exists f: \forall g, h: g \circ f = h \circ f \Longrightarrow g \neq h $$

What the last clause here stipulates is:

$$ \exists b \in B: g(b) \neq h(b) $$

Now, by the definition of surjectivity, there must be an element of A that f maps to this b, because f completely covers its codomain:

$$ \exists a \in A: f(a) = b $$

Putting these two together, we have the following:

$$ \exists a \in A: g(f(a)) \neq h(f(a)) $$

or alternatively:

$$ \exists a \in A: g \circ f(a) \neq h \circ f(a) $$

…which directly contradicts our original implication:

$$ \exists f: \forall g, h: g \circ f = h \circ f \Longrightarrow g \neq h $$

… Therefore, the surjective functions must form the Epimorphisms within Set!


Page 1 of 72