Sites, Applications, Solutions since 1995

Psyphi Blog v5

[Latest Entries] [Entries by Author] [Entries by Tag]
Picking up momentum Posted by rmp at 22:20 4th Jul 2007 It seems people are fairly taken with the BarCamb idea. It's been lightly advertised internally at Sanger and has been picking up some interest via that and also on the upcoming page http://upcoming.yahoo.com/event/208327/ .

I wonder how many of the people already signed up actually have something to present. Having been at the WTSI for nearly eight years now I've a number of things I could talk about, it's just a case of deciding which of them would be more interesting for people and that really depends on where attendees are coming from.

So... one or more of the following, of the things I've been working on recently - Bio::Das::Lite & Bio::Das::ProServer, ClearPress or the new sequencing technology. Now I'm not a biologist or a chemist either by trade or by hobby and I'm pretty certain that talking about NST is going to be asking for a whole bunch of biology and chemistry-question trouble. I guess DAS-related things are the most useful to present as they have the widest scientific application.

Though there's nothing like a good bit of self-promotion so maybe something short on ClearPress would be a good thing too. Might need to improve the application builder and test-suite a bit more for that.

In related news, not wanting to be outdone by Matt's BarCamb I coauthored and submitted a venue proposal for YAPC::Europe 2008 last week. Woohoo! Nail-biting stuff. The genome campus would be a great place to host it for all sorts of reasons - integrated and well supported conference centre; secured financial committment; great science to talk about and a tremendous perl resource to tap into just to list a few.

All I need to do now is submit my travel application for YAPC::Europe Vienna later this year and see how it's done (again). It's been a while since I've been to a YAPC::Europe!
(0 comments)

DECIPHERing Large-Scale Copy Number Variations Posted by rmp at 22:09 24th Sep 2007 It's strange.. Since moving from the core Web Team at Sanger to Sequencing Informatics I've been able to reduce my working hours from ~70-80/week all the way down to the 48.5 hours which are actually in my contract.

In theory this means I've more spare time, but in reality I've been able to secure sensible contract work outside Rentacoder which I've relied on in the past.

The work in question is optimising and refactoring for the DECIPHER project http://decipher.sanger.ac.uk/ which I used to manage the technical side of whilst in the web team.

DECIPHER is a database of large-scale copy number variations (CNVs) from patient arrayCGH data curated by clinicians and cytogeneticists around the world. DECIPHER represents one of the first clinical applications to come out of the HGP data from Sanger.

What's exciting apart from the medical implications of DECIPHER's joined-up thinking is that it also represents a valuable model for social, clinical applications in the Web 2.0 world. The application draws in data from various external sources as well as its own curated database. It primarily uses DAS http://biodas.org/ via Bio::Das::Lite and Bio::Das::ProServer and I'm now working on improving interfaces, interactivity and speed by leveraging MVC and SOA techniques with ClearPress and Prototype .

It's a great opportunity for me to keep contributing to one of my favourite projects and hopefully implement a load of really neat features I've wanted to add for a long time. Stay tuned...
(0 comments)

Hiring Perl Developers - how hard can it be? Posted by rmp at 21:27 28th Sep 2007 All the roles I've had during my time at Sanger have more or less required the development of production quality Perl code, usually OO and increasingly using MVC patterns. Why is it then that very nearly every Perl developer I've interviewed in the past 8 years is woefully lacking, specifically in OO Perl but more generally in half-decent programming skills?

It's been astonishing, not in a good way, how many have been unable to demonstrate use of hashes. Some have been too scared of them (their words, not mine) and some have never felt the need. For those of you who aren't Perl programmers, hashes (aka associative arrays) are a pretty crucial feature of the language and fundamental to its OO implementation.

Now I program in Perl sometimes more than 7-8 hours a day. For many years this also involved reworking other people's code. I can very easily say that if you claim to be a Perl programmer and have never used hashes then you're not going to get a Perl-related job because of your technical skills. With a good, interactive and engaging personality and a desire for self-improvement you might get away with it, but certainly not on technical merit.

It's also quite worrying how many of these interviewees are unable to describe the basics of object-oriented programming yet have, for example, developed and sold a commercial ERP system, presumably for big bucks. Man, these people must have awesome marketing!

Frankly a number of the bioinformaticians already working there have similar skills to the interviewees and often worse communication skills, so maybe I'm simply setting my standards too high.

I really hope this situation improves when Perl 6 goes public though I'm sure it'll take longer to become common parlance. As long as it happens before those smug RoR types take over the world I'll be happy ;)
(0 comments)

7 utilities for improving application quality in Perl Posted by rmp at 23:10 8th Oct 2007 I'd like to share with you a list of what are probably my top utilities for improving code quality (style, documentation, testing) with a largely Perl flavour. In loosely important-but-dull to exciting-and-weird order...

Test::More . Billed as yet another framework for writing test scripts Test::More extends Test::Simple and provides a bunch of more useful methods beyond Simple's ok(). The ones I use most being use_ok() for testing compilation, is() for testing equality and like() for testing similarity with regexes.

ExtUtils::MakeMaker . Another one of Mike Schwern's babies, MakeMaker is used to set up a folder structure and associated 'make' paraphernalia when first embarking on writing a module or application. Although developers these days tend to favour Module::Build over MakeMaker I prefer it for some reason (probably fear of change) and still make regular mileage using it.

Test::Pod::Coverage - what a great module! Check how good your documentation coverage is with respect to the code. No just a subroutine header won't do! I tend to use Test::Pod::Coverage as part of...

Test::Distribution . Automatically run a battery of standard tests including pod coverage, manifest integrity, straight compilation and a load of other important things.

perlcritic, Test::Perl::Critic . The Perl::Critic set of tools is amazing. It's built on PPI and implements the Perl_Best_Practices book by Damien Conway. Now I realise that not everyone agrees with a lot of what Damien says but the point is that it represents a standard to work to (and it's not that bad once you're used to it). Since I discovered perlcritic I've been developing all my code as close to perlcritic -1 (the most severe) as I can. It's almost instantly made my applications more readable through systematic appearance and made faults easier to spot even before Test::Perl::Critic comes in.

Devel::Cover . I'm almost ashamed to say I only discovered this last week after dipping into Ian Langworthy and chromatic's book 'Perl Testing'. Devel::Cover gives code exercise metrics, i.e. how much of your module or application was actually executed by that test. It collates stats from all modules matching a user-specified pattern and dumps them out in a natty coloured table, very suitable for tying into your CI system.

Selenium . Ok, not strictly speaking a tool I'm using right this minute but it's next on my list of integration tools. Selenium is a non-interactive, automated, browser-testing framework written in Javascript. This tool definitely has legs and it seems to have come a long way since I first found it in the middle of 2006. I'm hoping to have automated interface testing up and running before the end of the year as part of the Perl CI system I'm planning on putting together for the new sequencing pipeline.
(0 comments)

Great pieces of code Posted by rmp at 15:25 3rd Feb 2008 A lot of what I do day-to-day is related to optimisation. Be it Perl code, SQL queries, Javascript or HTML there are usually at least a couple of cracking examples I find every week. On Friday I came across this:

SELECT cycle FROM goldcrest WHERE id_run = ?


This query is being used to find the number of the latest cycles (between 1 and 37 for each id_run) in a near-real-time tracking system and is used several times whenever a run report is viewed.

EXPLAIN SELECT cycle FROM goldcrest WHERE id_run = 231;
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
| id | select_type | table     | type | possible_keys | key     | key_len | ref   | rows   | Extra       |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
|  1 | SIMPLE      | goldcrest | ref  | g_idrun       | g_idrun |       8 | const | 262792 | Using where | 
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+


In itself this would be fine but the goldcrest table in this instance contains several thousand rows for each id_run. So, for id_run, let's say, 231 this query happens to return approximately 588,000 rows to determine that the latest cycle for run 231 is the number 34.

To clean this up we first try something like this:

SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = ?


which still scans the 588000 rows (keyed on id_run incidentally) but doesn't actually return them to the user, only one row containing both values we're interested in. Fair enough, the CPU and disk access penalties are similar but the data transfer penalty is significantly improved.

Next I try adding an index against the id_run and cycle columns:

ALTER TABLE goldcrest ADD INDEX(id_run,cycle);
Query OK, 37589514 rows affected (23 min 6.17 sec)
Records: 37589514  Duplicates: 0  Warnings: 0


Now this of course takes a long time and, because the tuples are fairly redundant, creates a relatively inefficient index, also penalising future INSERTs. However, casually ignoring those facts, our query performance is now radically different:

EXPLAIN SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL |    NULL | NULL | NULL | Select tables optimized away | 
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
+------------+------------+
| MIN(cycle) | MAX(cycle) |
+------------+------------+
|          1 |         37 | 
+------------+------------+
1 row in set (0.01 sec)


That looks a lot better to me now!

Generally I try to steer clear of the mysterious internal workings of database engines, but with much greater frequency come across examples like this:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;
  my %clone_type;
  if($clone and $clone_type) {
    $clone_type{$clone} = $clone_type;
    return $clone_type{$clone};
  }
  return;
}


Thankfully this one's pretty quick to figure out - they're usually *much* more convoluted, but still.. Huh??

Pass in a clone_type scalar, create a local hash with the same name (Argh!), store the clone_type scalar in the hash keyed at position $clone, then return the same value we just stored.

I don't get it... maybe a global hash or something else would make sense, but this works out the same:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;
  if($clone and $clone_type) {
    return $clone_type;
  }
  return;
}
and I'm still not sure why you'd want to do that if you have the values on the way in already.

Programmers really need to think around the problem, not just through it. Thinking through may result in functionality but thinking around results in both function and performance which means a whole lot more in my book, and incidentally, why it seems so hard to hire good programmers.
(0 comments)

ClearPress-99 Posted by rmp at 22:10 3rd Mar 2008 Last week saw the latest release of ClearPress, http://search.cpan.org/~rpettett/ClearPress/ . ClearPress is a basic, RESTful, MVC Perl application framework I've developed in tandem with my work at the Sanger Institute http://www.sanger.ac.uk/ .

The original aim of ClearPress was to provide a RESTful MVC framework which integrated with the Sanger's website single sign on. Having proved its usefulness with the first release of the tracking system I developed, ClearPress was spun off into a project of its own together with dependencies abstracted out of the Sanger-specific environment.

ClearPress sports a MySQL-backed ORM, automatic, extensible content-negotiation and easily-templated HTML, XML, Atom, RSS, JSON, iCal, YAML, PNG and other format views. It can run standalone, as CGI or under ModPerl::Registry.

I'm using ClearPress in most of my projects these days, both work and non-work. Blogs, document management, laboratory tracking and various other standalone apps. Hopefully soon there'll even be a dedicated site together with examples. For now you can check out the application-builder and example distributed with the package.
(0 comments)

Web frameworking Posted by rmp at 23:47 31st Mar 2008 It seems to be the wrong time to be reading such things, but over on InfoQ there's a nice_article introducing web development of RESTful_services using Erlang and the Yaws high performance web server.

I say "the wrong time" as this week has kicked off the "Advancing with Rails" course by David_A._Black of Ruby_Power_and_Light fame. The course is fairly advanced in terms of required rails knowledge so it's a bit of a baptism by fire for me and a few others having never written any Ruby before.

Rails is proving moderately easy to pick up but as I've remarked to a couple of people, it doesn't seem any easier coding with Rails than with Perl. Perhaps it's because I've never done it before but I reckon it's a lot harder spending my time figuring out what the heck DHH meant something to do than it is doing it myself.

Even though it's nowhere near as mature, I do reckon my ClearPress framework has a lot going for it - it's pretty feature-complete in terms of ORM, views and templating ( TT2 ). It has similar convention over configuration features meaning it's not designed for plugging in other alternative layers but it is absolutely possible to do (and I suspect without as much effort as is required in Rails). I still need to iron out some wrinkles in the autogenerated code from the application builder and provide some default authorisation and authentication mechanisms, some of which may come in the next release. But in the meantime it's easy to add these features, which is exactly what we've done for the new sequencing run tracking app, NPG to tie it to the WTSI website single sign on (MySQL and LDAP under the hood).

(0 comments)

ClearPress-146 Posted by rmp at 23:15 29th Apr 2008 Latest release of ClearPress (v146) out to the CPAN yesterday. The ClearPress data model now implements belongs_to_through, belongs_to, has_many and has_many_through entity relationships for all you ActiveRecord lovers.

Two ClearPress-derived projects are using a half-decent test fixture system. It's really making a big difference to the development of both DECIPHER and NPG so I'm planning to bundle what can be bundled with an upcoming release.
(0 comments)

Atom
10,000 brains for hire