sysd.org
2jun/122

Google Refine + Perl (English)

(Portuguese version here)

Google Refine is awesome. If you're unaware of what it is, access their official page and watch at least the first screencast. You'll see it can be helpful for several ETL-related tasks.

Currently, I use it a lot, specially for simple (but boring) tasks, like loading a CSV, trimming out some outliers and saving as JSON to be imported into MongoDB. Nothing a Perl one-liner couldn't do.

However, the opposite is not true: Perl one-liners are a lot more flexible than Google Refine. Now, what if we could merge both?

  1. Google Refine could be easily integrated with any RESTful API.
  2. Perl transforms one-liners into RESTful webservices.
  3. PROFIT!!!

As a practical example, I'll use some georeferenced data I was working at. Let's suppose I have to deduplicate registers, and one of "duplicate" rules is their proximity on the map. Google Refine is far from a full-featured GIS, and is unable to handle bidimensional coordinate system. Enter the GeoDNA: an algorithm to lower geospatial dimensions. As it's FAQ says,

GeoDNA is a way to represent a latitude/longitude coordinate pair as a string. That sounds simple enough, but it's a special string format: the longer it is, the more accurate it is. More importantly, each string uniquely defines a region of the earth's surface, so in general, GeoDNA codes with similar prefixes are located near each other. This can be used to perform proximity searching using only string comparisons (like the SQL "LIKE" operator).

Another interesting property of GeoDNA is that when ordening a set of records by their GeoDNA code, close locations are likely to appear in adjacent rows (sometimes, close locations will share very different prefixes, but similar prefixes always represent close locations).

To incorporate GeoDNA into Google Refine, we'll use the Add column by fetching URLs option, clicking on the header of any column (which column it will be doesn't matter as we'll use two of them, anyway):

As the expression, we'll paste the following code (here, pay attention to the correct latitude/longitude column names):

'http://127.0.0.1:3000/?lat='+
row.cells['latitude'].value
+'&lon='+
row.cells['longitude'].value

Throttle delay can be zeroed, as our webservice is local. The final configuration should look like this (don't push the OK button, yet):

Now, check if you have Mojolicious and Geo::DNA Perl modules (install them via CPAN, if not) and paste into your terminal:

perl -MGeo::DNA -Mojo -E 'a("/"=>sub{my$s=shift;$s->render(json=>{geocode=>Geo::DNA::encode_geo_dna($s->param("lat"),$s->param("lon"))})})->start' daemon

If you prefer a "human-readable" version, paste the following code into geocode-webservice.pl:

#!/usr/bin/env perl
use Geo::DNA qw(encode_geo_dna);
use Mojolicious::Lite;
 
any '/' => sub {
    my $self = shift;
    $self->render(json => {
        geocode => encode_geo_dna(
            $self->param('lat'),
            $self->param('lon'),
        ),
    });
};
 
app->start;

Once you started a webservice, it will report Server available at http://127.0.0.1:3000. Now, click OK on Google Refine dialog and wait. Even without delay, it could be a bit slow; however, even then this hack saved me a lot of time 😉

Comentários (2) Trackbacks (0)
  1. Hey!

    I just stumbled across this. I’m the author of Geo::DNA. It’s so cool to see it in use like this!

    One tip: what you say about Geo::DNA code prefixes being the same means that the locations are close is totally true, and as I mention on the site, it can sometimes be tricky when your datapoints cross boundaries of the bounding boxes – but you can easily get around that using the neighbours_within_radius () function to get all the codes within an area – any codes with those radius codes as their prefixes *are* location nearby.

    There’s an interactive JS demo of this feature on http://www.geodna.org if you hadn’t already seen it.

    Cheers and thanks for using it!

    Kyle

  2. Is geoDNA any faster than simply picking a bounding box on lat/lon and finding all entries that fall in the min/max, lat/lon ranges? Trig doesn’t seem to be a problem for most applications, but if you’re really math-averse, you can still pick your own “rectangles” of any size you like and just scan the lists. Such an approach has the advantage of having just one “disconnect” – the prime meridian, instead of disconnects at many levels.


Leave a comment

Sem trackbacks