There are many solutions for distributed batch processing like Hadoop. However, in this article I will present much simplier and equally stable approach which might be enough for simple tasks where you don't:

  • operate on big data
  • don't need data warehouse capabilities

GNU Parallel

Some time ago I discovered GNU parallel. It allows you to run any of your scripts in parallel. It will:

  • Collect your input
  • Execute jobs on any list of servers
  • Load balance the load according to the number of cores and computation time
  • Collect results on standard output keeping care it doesn't get mixed

It is really simple and powerful.

You can run it also locally. The following example will download Hadoop and Parallel websites in parallel. {} is used as a placeholder that is going to be replaced with one of parameters specified after :::.

parallel wget {} 2>/dev/null ::: 'http://hadoop.apache.org/' 'http://www.gnu.org/software/parallel/'

Problem

I currently work in advertising industry at Fyber so I chose some example from my domain.

Let's imagine you have a file with list of pairs <user_id>:<offer_id>. Your goal will be to generate list of how many times users clicked offer A after clicking offer B. As a result you expect a list <offerA_id>-<offerB_id>:<count>. For purpose of this article I created some naive solution for this problem. It consists of 3 scripts:

  1. divide.rb - divides the input file into smaller files based on user ids. [link] Complexity: O(n)
  2. aggregate.rb - does the main computation part. Complexity: O(n^2)
  3. collect.rb - combines the results from computation. Complexity O(n * log n)

(2) is the part that I want to run in parallel. In this article I won't cover how this algorithm actually works.

Solution with GNU Parallel

My input file with initial list is called clicks_10_000_000.txt. I will perform my processing with this script:

#/bin/bash

INPUT_FILE=$1
NUMBER_OF_JOBS=$2

rm input_*.txt # Remove old input files

# List of servers on which jobs will be run. ':' points to local machine.
SERVERS=":,192.168.1.10,sandbox2.carriles.pl"

for server in $SERVERS; do
  scp aggregate.rb $SERVER:$HOME # Transfers job script to the servers
done

ruby divide.rb $INPUT_FILE $NUMBER_OF_JOBS \
&& parallel \
    --eta \
    --transfer \
    -S $SERVERS \
    ruby aggregate.rb \
    ::: input_*.txt \
| ruby collect.rb \
| sort --field-separator=':' --numeric-sort --key=2 \
> output_parallel.txt

What is what

Firstly, I need to transfer script to all servers on which I want to run computation. All dependencies for the script must already be installed there (in my case ruby).

for server in $SERVERS; do
  scp aggregate.rb $SERVER:$HOME # Transfer script to the servers
done

Secondly, I prepare the set of input files. divide.rb script will create series of input_*.txt with clusters of users from $INPUT_FILE

ruby divide.rb $INPUT_FILE $NUMBER_OF_JOBS

Fun starts here. Parallel application is executed.

parallel \
  --eta \
  --transfer \
  -S $SERVERS \
  ruby aggregate.rb \
  ::: input_*.txt \
  1. ssh connections to each server is established and the number of cores on each server is detected
  2. jobs are run on servers and files are being transfered
  3. after each job is finished its result is sent to standard output

Used options:

  • --eta makes parallel display some progress information
  • --transfer makes parallel upload input files to the server
  • -S option allows to specify the list of servers on which script should be executed
  • ruby aggregate.rb script that should be executed on remote server
  • ::: <list of files> input files that should be processed one by one by aggregate.rb script

As the last step I need to reaggregate the results from all jobs as I am interested in final counts and not those from clusters. collect.rb will receive all cluster results on standard input.

ruby collect.rb

I sort the results to get nicer output:

sort --field-separator=':' --numeric-sort --key=2

The nice feature of parallel is that the output of each job is sent to standard output as soon as job is finished. On the other hand, outputs from different jobs are never mixed.

Load is correctly distributed. Each server will get new job only if it has a free "CPU slot".

Because parallel writes to standard output I can use pipe operator to execute even more computations in parallel. Parallel can accept input on standard input which makes it possible to run a few parallel processes which pipes information to each other.

Showtime

In my benchmark I used my desktop (4 cores), laptop (4 cores / LAN connection) and remote server (8 cores / DSL connection). Parallel script was executed on the desktop.

Let's see the run without parallel. I execute aggregate.rb on the whole set of data before clustering. This way I don't have to reaggregate.

➜  parallel git:(master) ✗ time ruby aggregate.rb clicks_10_000_000.txt | sort --field-separator=':' --numeric-sort --key=2 > output_non_parallel.txt
ruby aggregate.rb clicks_10_000_000.txt  1321,47s user 1,27s system 99% cpu 22:03,42 total
sort --field-separator=':' --numeric-sort --key=2 > output_non_parallel.txt  0,09s user 0,01s system 0% cpu 22:03,50 total

With parallel

➜  parallel git:(master) ✗ time ./run.sh clicks_10_000_000.txt 500

Computers / CPU cores / Max jobs to run
1:192.168.1.10 / 4 / 4
2:local / 4 / 4
3:sandbox2.carriles.pl / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s 0left 0.40avg  192.168.1.10:0/85/17%/2.4s  local:0/232/46%/0.9s  sandbox2.carriles.pl:0/183/36%/1.1s    
./run.sh clicks_10_000_000.txt 500  728,61s user 33,93s system 369% cpu 3:26,53 total

Does it work?

➜  parallel git:(master) ✗ diff -s output_parallel.txt output_non_parallel.txt
Files output_parallel.txt and output_non_parallel.txt are identical

It works! The point of this article is not to provide accurate benchmark. However, you can see expected speedup ~ 6 times.

Conclusion

I find GNU parallel very useful in my professional life. Typical use case for me is ad hoc information retrieval from log files. I usually use group of testing servers that nearly always stay idle.

The first selling point for me is almost no dependencies. Parallel is written in perl. I managed to run it just by copying parallel script from desktop to server without installation in all the cases so far.

Secondly, simplicity is mind blowing. Parallel is shipped with tutorial man parallel_tutorial that you can go through in one hour. Majority of use cases can be implemented as one liners. No scripts and configuration is needed. What is more, your jobs can be written in any language as it uses shell to execute them. Whole interaction with jobs is done with shell arguments and standard input / output.

Published on 11/08/2014 at 05h26 .

0 comments

Previous week I wrote my first application using Backbone.js and Compojure. Compojure is a web framework written in Clojure. My application, which code name is Burning Ideas, is very simple and it does not even allow to create user accounts neither sessions. It allows multiple users to brainstorm together and somehow rate their ideas. Rating of ideas will decrease with time. All users are brainstorming together (for now ;)) Burning Ideas Code base is available at github. I will write about some conclusions from this not yet finished project. My first attempt was to use SproutCore. After writing some frontend code I resigned as I found it quite inconvenient to use my custom html markup there. I switched to Backbone.js. Backbone.js is very simple, clear and understandable. Well, at least for somebody with my backgrounds. Out of the box, it talks to restful JSON APIs which are easily implementable using majority of modern web frameworks. It uses jQuery which I also found very cool. Compojure is also very simple and clear. It is rather Sinatra than Rails but it does its job. For development I used leiningen with some plugins which bootstraps clojure project and allows to perform different tasks on it. We can run our server with simple command:
lein ring server
What amazed me, it can also prepare war file automatically:
lein ring uberwar
This allowed me to deploy application using CloudBees in a few clicks. As I prefer Sass, Haml and CoffeeScript I used Sprockets for compiling and serving my assets in development mode. I found myself quite productive using Compojure / Backbone combination and will definitely use it more in the future. Feel free to submit any feedback or write about your experience with Backbone.js and Compojure.

Published on 30/01/2012 at 17h46 .

0 comments

While working on my current project I use Bundler. There are many resources on the web which says how to use it in pair with Capistrano.

Most of them, however, fails when it comes to

  • rollback Capistrano transaction as well as
  • rollback Capistrano task.

Let’s say you used already a solution similar to this.

namespace :bundler do
  task :create_symlink, :roles => :app do
    shared_dir = File.join(shared_path, 'bundle')
    release_dir = File.join(current_release, '.bundle')
    run("mkdir -p #{shared_dir} && ln -s #{shared_dir} #{release_dir}")
  end
 
  task :bundle_new_release, :roles => :app do
    bundler.create_symlink
    run "cd #{release_path} && bundle install --without test"
  end
end

1. Rollback Capistrano transaction

Default deployment path includes transaction with update_code and symlink tasks.

task :update do
  transaction do
    update_code
    symlink
  end
end

If you hook bundle_new_release task before these tasks or within the transaction and do not provide bundler_install rollback action you will get into trouble.

If any of the tasks ran after bundle_new_release fail you will end up with bundle required / installed versions mismatch.

Why bother? Maybe update_code and symlink are not the most likely tasks to fail but you have probably already plenty hooks placed before and after them which may fail…

2. Rollback Capistrano transaction

You can rollback last Capistrano deployment by simply running

cap deploy:rollback

Simple, huh?

Unfortunately, if you made any change to your bundle in just deployed commits it will not be reverted what will make your application fail to restart.

So now it is time for the solution.

namespace :bundler do
  task :create_symlink, :roles => :app do
    shared_dir = File.join(shared_path, 'bundle')
    release_dir = File.join(release_path, '.bundle')
    run("mkdir -p #{shared_dir} && ln -s #{shared_dir} #{release_dir}")
  end

  task :install, :roles => :app do
    run "cd #{release_path} && bundle install"

    on_rollback do
      if previous_release
        run "cd #{previous_release} && bundle install"
      else
        logger.important "no previous release to rollback to, rollback of bundler:install skipped"
      end
    end
  end

  task :bundle_new_release, :roles => :db do
    bundler.create_symlink
    bundler.install
  end
end

after "deploy:rollback:revision", "bundler:install"
after "deploy:update_code", "bundler:bundle_new_release"

Here are some points to notice:

  • it is useful to put bundle_new_release task within the transaction
  • provide the code run on the transaction rollback
  • provide the code run on the capistrano deploy:rollback task

As a hint I can also say that I find it useful to hook my BackgrounDRb restart task into transaction in the default deployment path.

In this way I can be sure that when the transaction is finished the bundle is installed correctly and the application boots up. It is important as I use Unicorn and restart it with sending USR2 signal which will make Unicorn to die silently in case of unbootable application…

Published on 04/08/2010 at 14h58 .

0 comments

I was very surprised to see that my model is missing some of its attributes after first request in development mode.

The error occurred while evaluating nil.include?

With the backtrace ending with:

/var/lib/gems/1.8/gems/activerecord-2.3.2/lib/active_record/attribute_methods.rb:142:in 'create_time_zone_conversion_attribute?'

Such a magic… ;-) Here is the recipe. Place in your controller:

def index
  @posts = Rails.cache.fetch(:all_posts) { Post.all }
end

Go to your view page and press refresh.

The reason is that ActiveRecord stores some information in so called class inheritable attributes. These are stored as your model class variable (Post in that case). Let’s see…

>> before_refresh = Post
>> before_refresh.object_id
=> 70091079977500
>> before_refresh.inheritable_attributes
=> {:skip_time_zone_conversion_for_attributes=>[], :record_timestamps=>true, :reject_new_nested_attributes_procs=>{}, :default_scoping=>[], :scopes=>{:scoped=>#<Proc:0x00007f7eb4450c10@/var/lib/gems/1.8/gems/activerecord-2.3.2/lib/active_record/named_scope.rb:87>}}
>> before_refresh.inheritable_attributes.object_id
=> 70091079977380

>> reload!

>> Post.object_id
=> 70091079300300 # different!
>> before_refresh.inheritable_attributes
=> {} # different!
>> before_refresh.inheritable_attributes.object_id
=> 70091099674080 # different! but similar ;-)

It looks like ActiveRecord clears inheritable_attributes before reload… But why? I do not know… I am still learning ;-)

Anyway, here are some other reasons why you should not cache your models but described with the problem of requiring models.

If you still want to cache your models (why not?), here is the tip for disabling caching in your development mode (not so easy). You can set memcached as caching store and give it invalid port number. Queries will be missed all the time without any exception raised while writing to cache.

Edit:
Michał Kwiatkowski found better solution.

Published on 29/06/2009 at 18h28 . Tags , ,

0 comments

Some time ago I ran into performance problems connected with acts_as_list.

My acts_as_list in EmailAddress model was defined as follows:

acts_as_list :scope => :user

Unfortunately, most email addresses in database had user_id set to null.

Acts_as_list takes all records from table which has null scope attribute and creates one huge list from all of them… Ups… In my situation it means almost

SELECT * FROM email_addresses
query on each new EmailAddress creation what causes very low performance.

Here is my fix. It makes possible to write:

acts_as_list :scope => :user, :ignore_nil => true

After doing so, newly created EmailAddress will not be added to any list unless user is set to non null.

In my opinion original behaviour is quite odd as acts_as_list supports listable records which are not on any list…

Published on 21/04/2009 at 20h49 . Tags , , ,

1 comment

Powered by Typo – Thème Frédéric de Villamil | Photo Glenn