debuggable

 
Contact Us
 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Programmer Productivity

Posted 3 weeks, 5 days ago by Felix Geisendörfer

90% of the code is written by 10% of the programmers.

-- Robert C. Martin

I have decided to go on a journey for greater productivity. This is my first post on the subject, and I intend to write about it every day for at least 30 days now.

Just in case you wonder, this whole series won't be a magic guide of universal wisdom that you just have to inhale in order to become a Super Mutant Ninja Developer. Nor do I plan on an artificial motivational boost that will stop as soon as the series is over.

No, my plan is to go deep and fix certain anti-patterns I have developed over the years. I want to look into time / task management, flow, sleep, music, diet, tools, email, work environment, free time, social life, projects, vacations, meditation, habits, money, etc. and achieve a greater understanding how these things affect my daily output.

A lot of what I write might be very individual, so should you be inclined to read it, please draw your own conclusions. This whole thing is going to be more about the process and analysis, than any particular strategy I might derive from it.

To put some science into the whole thing, my quest for today will be to determine a few good metrics for measuring my productivity and implementing tools that will help collecting them. After that I will set some initial goals for improvement and perform various productivity experiments throughout the coming weeks.

--fg

PS: If you are interested in joining me, just link me to any blog posts or leave comments about your own productivity experiences, I will gladly highlight them.

 

Announcing transloadit.com

Posted on 13/7/10 by Felix Geisendörfer

Today we are thrilled to announce the commercial availability of transloadit.com.

If you have a web (or mobile) application that needs file uploading you should consider integrating transloadit. Transloadit will handle the upload process, resizing of images, encoding of videos, and final storage of your content on Amazon S3 for you.

Our plans start at $19 / month which includes 3.5 GB of usage. This is enough for ~72 video encodings or ~717 image resizes per month.

This project has been almost two years in the making, with over 150 people participating in testing various versions. The version we are shipping now has already executed 55.000 internal jobs, each spawning 2-5 command line scripts on our servers.

We are also one of the first commercial software / infrastructure as a service product built on node.js. After experimenting with various technologies, we found it to be the perfect fit for our uploading and processing requirements.

Another thing we are very proud of is the ~95% of test coverage of the service's code base. We have an extensive suite of unit, integration and system tests that have already proven incredibly reliable for detecting problems, be it in our code, or changes to our stack.

If you are a long time reader of this blog, we would feel incredibly grateful if you would spread the word about our service to your boss, co-workers and geek-friends.

Otherwise we would be very happy to hear as much feedback, ideas and questions as you can come up with!

--fg

PS: I also want to use this opportunity to thank my co-founders Tim and Kevin for being the best partners in this business I can imagine. I love you guys.

 

Parsing file uploads at 500 mb/s with node.js

Posted on 31/5/10 by Felix Geisendörfer

A few weeks ago I set out to create a new multipart/form-data parser for node.js. We need this parser for the new version of transloadit that we have been working on since our setback last month.

The result is a new library called formidable, which, on a high level, makes receiving file uploads with node.js as easy as:

var formidable = require('formidable')
  , http = require('http')
  , sys = require('sys');

http.createServer(function(req, res) {
  if (req.url == '/upload' && req.method.toLowerCase() == 'post') {
    // parse a file upload
    var form = new formidable.IncomingForm();
    form.parse(req, function(fields, files) {
      res.writeHead(200, {'content-type': 'text/plain'});
      res.write('received upload:\n\n');
      res.end(sys.inspect({fields: fields, files: files}));
    });
    return;
  }

  // show a file upload form
  res.writeHead(200, {'content-type': 'text/html'});
  res.end
    ( '<form action="/upload" enctype="multipart/form-data" method="post">'
    + '<input type="text" name="title"><br>'
    + '<input type="file" name="upload" multiple="multiple"><br>'
    + '<input type="submit" value="Upload">'
    + '</form>'
    );
});

Essentially this works similar to other platforms where file uploads are saved to disk before your script is invoked with a path to the uploaded file.

What's nice about this however, is that you can hook into the whole thing on a lower level:

form.parse(req);
form.addListener('file', function(field, file) {
  // file looks like this:
  // {path: '...' , filename: '...', mime: '...'}
});

We use that interface for processing HTML5 multi-file uploads as they come in, rather than waiting for the entire upload to finish.

You could even overwrite the onPart handler, which gives you direct access to the raw data stream:

form.onPart = function(part) {
  part.addListener('data', function(chunk) {
    // do cool stuff, like streaming incoming files somewhere else
  });
};

All of this is possible thanks to the underlaying multipart parser which makes heavy use of node.js buffers.

Buffers in node are basically just (efficient) arrays of raw memory that you can access byte by byte. The parser works by looping over each incoming buffer chunk, while maintaining as little state as possible to do its work:

// simplified excerpt from MultipartParser.write
// chunk = Buffer of incoming data

for (var i = 0; i < chunk.length; i++) {
  var character = chunk[i];
  switch (this.state) {
    case 'BOUNDARY-BEGIN':
      if (character != this.boundary[i]) {
        // unexpected character, abort parsing
        return 0;
      }

      if (i == this.boundary.length) {
       // emit event, advance to next state
       this.onPartHeaderBegin();
       this.state = 'HEADER-BEGIN';
      }
      break;
    case 'HEADER-BEGIN':
      // ...
      break;
  }
}

But, as you can imagine, this approach turned out to be somewhat limited in speed. I was only able to get about 16-20 mb/s out of this. My goal however was to get somewhere around 125 mb/s, enough to saturate a 1 gbit network connection.

So I started to look for ways to optimize. The data I was parsing looks like this:

--AaB03x
content-disposition: form-data; name="title"

A picture of me on my unicycle!
--AaB03x
content-disposition: form-data; name="pic"; filename="unicycle.jpg"
Content-Type: image/jpeg

... binary data ...
--AaB03x--

The sequence here being:

  1. Series of boundary characters (--AaB03x), announcing the beginning of a new part
  2. Headers for that part
  3. \r\n\r\n
  4. Data for this part
  5. Series of boundary characters, announcing the end of the part or end of the stream

What stands out is the fact that there is no parsing needed for step #4. All the data of a part itself is a plain octet stream. So after talking to Ryan about it, he recommended me to look into the Boyer-Moore algorithm to speed things up.

The algorithm is usually the fastest method to find a sub string within a string of text. It basically works by analyzing the needle string and building a lookup table of its characters to efficiently skip over parts of the haystack.

But implementing it, was not exactly easy. The algorithm is not trivial, and many of the example implementations I found were wrong. That however was not the problem. The real challenge was that I was working with a stream of data, rather than a big string I had full access to.

This meant keeping lots of additional state in the parser, as well as creating a very complicated piece of code. I like challenges, but I also like efficiently using my time, so I started looking for a shortcut.

And then it hit me. Most of the boyer-moore algorithm is designed to improve the performance of the worst-case scenario. The worst-case scenario for this problem is the case where you hit a character in your haystack that is also a character in the needle string. Boyer-moore deals with this case by knowing the offset of each character in the needle string, so it can maximize the number of characters to skip in any case.

But file uploads rarely cause these worst-case scenarios! With human text, character repetition is pretty high. But file uploads are binary data, so most bytes are likely to fall outside the ASCII range of characters usually used for the boundary.

That made the solution much simpler. All I had to do was generating a list of all characters in the boundary, and whenever I hit a character that was not in that list, I knew I could safely skip the full length of the boundary:

while (i + boundary.length <= chunk.length) {
  if (chunk[i + boundary.length - 1] in boundaryChars) {
    // worst case, go back to byte by byte parsing until a non-matching char occurs
    break;
  }
  i += boundary.length;
}

This resulted in an incredible speed up, allowing to parse uploads at 500 mb/sec. The parsing can be even faster if a longer boundary sequence is used. Short boundaries and hitting the worst-case scenario frequently will slow things down.

The benchmark I created is using an actual boundary created by Mozilla Firefox. Your milage may vary slightly.

The whole thing could still be optimized further, but at this point I believe it is fast enough to make other parts of the system more likely to become the bottleneck.

Formidable is licensed under the MIT license. Questions & suggestions regarding the library, node.js or the parser would be most welcome.

--fg

 

Streaming UTF-8 (with node.js)

Posted on 18/5/10 by Felix Geisendörfer

UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.

But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:

11000010

As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:

11100010

The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.

As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.

To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.

This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).

As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:

var Buffer = require('buffer').Buffer;

var Utf8Decoder = exports.Utf8Decoder = function() {
  this.charBuffer = new Buffer(4);
  this.charReceived = 0;
  this.charLength = 0;
};

Utf8Decoder.prototype.write = function(buffer) {
  var charStr = '';
  // if our last write ended with an incomplete multibyte character
  if (this.charLength) {
    // determine how many remaining bytes this buffer has to offer for this char
    var i = (buffer.length >= this.charLength - this.charReceived)
      ? this.charLength - this.charReceived
      : buffer.length;

    // add the new bytes to the char buffer
    buffer.copy(this.charBuffer, this.charReceived, 0, i);
    this.charReceived += i;

    if (this.charReceived < this.charLength) {
      // still not enough chars in this buffer? wait for more ...
      return;
    }

    // get the character that was split
    charStr = this.charBuffer.slice(0, this.charLength).toString();
    this.charReceived = this.charLength = 0;

    if (i == buffer.length) {
      // if there are no more bytes in this buffer, just emit our char
      this.onString(charStr)
      return;
    }

    // otherwise cut of the characters end from the beginning of this buffer
    buffer = buffer.slice(i, buffer.length);
  }


  // determine how many bytes we have to check at the end of this buffer
  var i = (buffer.length >= 3)
    ? 3
    : buffer.length;

  // figure out if one of the last i bytes of our buffer announces an incomplete char
  for (; i > 0; i--) {
    c = buffer[buffer.length - i];

    // See http://en.wikipedia.org/wiki/UTF-8#Description

    // 110XXXXX
    if (i == 1 && c >> 5 == 0x06) {
      this.charLength = 2;
      break;
    }

    // 1110XXXX
    if (i <= 2 && c >> 4 == 0x0E) {
      this.charLength = 3;
      break;
    }

    // 11110XXX
    if (i <= 3 && c >> 3 == 0x1E) {
      this.charLength = 4;
      break;
    }
  }

  if (!this.charLength) {
    // no incomplete char at the end of this buffer, emit the whole thing
    this.onString(charStr+buffer.toString());
    return;
  }

  // buffer the incomplete character bytes we got
  buffer.copy(this.charBuffer, 0, buffer.length - i, buffer.length);
  this.charReceived = i;

  if (buffer.length - i > 0) {
    // buffer had more bytes before the incomplete char, emit them
    this.onString(charStr+buffer.slice(0, buffer.length - i).toString());
  } else if (charStr) {
    // or just emit the charStr if any
    this.onString(charStr);
  }
};

I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!

--fg

PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!

 

Understanding node.js

Posted on 29/4/10 by Felix Geisendörfer

Node.js has generally caused two reactions in people I've introduced it to. Basically people either "got it" right away, or they ended up being very confused.

If you have been in the second group so far, here is my attempt to explain node:

  • It is a command line tool. You download a tarball, compile and install the source.
  • It let's you run JavaScript programs by typing 'node my_app.js' in your terminal.
  • The JS is executed by the V8 javascript engine (the thing that makes Google Chrome so fast).
  • Node provides a JavaScript API to access the network and file system

"But I can do everything I need in: ruby, python, php, java, ... !".

I hear you. And you are right! Node is no freaking unicorn that will come and do your work for you, sorry. It's just a tool, and it probably won't replace your regular tools completely, at least not for now.

"Get to the point!"

Alright, I will. Node is basically very good when you need to do several things at the same time. Have you ever written a piece of code and said "I wish this would run in parallel"? Well, in node everything runs in parallel, except your code.

"Huh?"

That's right, everything runs in parallel, except your code. To understand that, imagine your code is the king, and node is his army of servants.

The day starts by one servant waking up the king and asking him if he needs anything. The king gives the servant a list of tasks and goes back to sleep a little longer. The servant now distributes those tasks among his colleagues and they get to work.

Once a servant finishes a task, he lines up outside the kings quarter to report. The king lets one servant in at a time, and listens to things he reports. Sometimes the king will give the servant more tasks on the way out.

Life is good, for the king's servants carry out all of his tasks in parallel, but only report with one result at a time, so the king can focus. *

"That's fantastic, but could you quit the silly metaphor and speak geek to me?"

Sure. A simple node program may look like this:

var fs = require('fs')
  , sys = require('sys');

fs.readFile('treasure-chamber-report.txt', function(report) {
  sys.puts("oh, look at all my money: "+report);
});

fs.writeFile('letter-to-princess.txt', '...', function() {
  sys.puts("can't wait to hear back from her!");
});

Your code gives node the two tasks to read and write a file, and then goes to sleep. Once node has completed a task, the callback for it is fired. But there can only be one callback firing at the same time. Until that callback has finished executing, all other callbacks have to wait in line. In addition to that, there is no guarantee on the order in which the callbacks will fire.

"So I don't have to worry about code accessing the same data structures at the same time?"

You got it! That's the entire beauty of JavaScripts single-threaded / event loop design!

"Very nice, but why should I use it?"

One reason is efficiency. In a web application, your main response time cost is usually the sum of time it takes to execute all your database queries. With node, you can execute all your queries at once, reducing the response time to the duration it takes to execute the slowest query.

Another reason is JavaScript. You can use node to share code between the browser and your backend. JavaScript is also on its way to become a really universal language. No matter if you did python, ruby, java, php, ... in the past, you've probably picked up some JS along the way, right?

And the last reason is raw speed. V8 is constantly pushing the boundaries in being one of the fastest dynamic language interpreters on the planet. I can't think of any other language that is being pushed for speed as aggressively as JavaScript is right now. In addition to that, node's I/O facilities are really light weight, bringing you as close to fully utilizing your system's full I/O capacities as possible.

"So you are saying I should write all my apps in node from now on?"

Yes and no. Once you start to swing the node hammer, everything is obviously going to start looking like a nail. But if you're working on something with a deadline, you might want to base your decision on:

  • Are low response times / high concurrency important? Node is really good at that.
  • How big is the project? Small projects should be fine. Big projects should evaluate carefully (available libraries, resources to fix a bug or two upstream, etc.).

"Does node run on Windows?"

No. If you are on windows, you need to run a virtual machine (I recommend VirtualBox) with Linux. Windows support for node is planned, but don't hold your breath for the next few months unless you want to help with the port.

"Can I access the DOM in node?"

Excellent question! No, the DOM is a browser thingy, and node's JS engine (V8) is thankfully totally separate from all that mess. However, there are people working on implementing the DOM as a node module, which may open very exciting possibilities such as unit testing client-side code.

"Isn't event driven programming really hard?"

That depends on you. If you already learned how to juggle AJAX calls and user events in the browser, getting used to node shouldn't be a problem.

Either way, test driven development can really help you to come up with maintainable designs.

"Who is using it?"

There is a small / incomplete list in the node wiki (scroll to "Companies using Node"). Yahoo is experimenting with node for YUI, Plurk is using it for massive comet and Paul Bakaus (of jQuery UI fame) is building a mind-blowing game engine that has some node in the backend. Joyent has hired Ryan Dahl (the creator of node) and heavily sponsors the development.

Oh, and Heroku just announced (experimental ) hosting support for node.js as well.

"Where can I learn more?"

Tim Caswell is running the excellent How To Node blog. Follow #nodejs on twitter. Subscribe to the mailing list. And come and hang out in the IRC channel, #node.js (yes, the dot is in the name). We're close to hitting the 200 lurker-mark there soon : ).

I'll also continue to write articles here on debuggable.com.

That's it for now. Feel free to comment if you have more questions!

--fg

*: The metaphor is obviously a simplification, but if it's hard to find a counterpart for the concept of non-blocking in reality.

 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9