Sunday, May 29, 2011

What is MapReduce ?

MapReduce is a parallel programming technique made popular by Google. It is used for processing very very large amounts of data. Such processing can be completed in a reasonable amount of time only by distributing the work to multiple machines in parallel. Each machine processes a small subset of the data. MapReduce is a programming model that lets developers focus on the writing code that processes their data without having to worry about the details of parallel execution.

MapReduce requires modeling the data to be processed as key,value pairs. The developer codes a map function and a reduce function.

The MapReduce runtime calls the map function for each key,value pair. The map function takes as input a key,value pair and produces an output which is another key,value pair.

The MapReduce runtime sorts and groups the output from the map functions by key. It then calls the reduce function passing it a key and a list of values associated with the key. The reduce function is called for each key. The output from the reduce function is a key,value pair. The value is generally an aggregate or something calculated by processing the list of values that were passed in for the input key. The reduce function is called for each intermediate key produced by the map function. The output from the reduce function is the required result.

As an example , let us say you have a large number of log files that contain audit logs for some event such as access to an account. You need to find out how many times each account was accessed in the last 10 years.
Assume each line in the log file is a audit record. We are processing log files line by line.The map and reduce functions would look like this:
map(key , value) {
// key = byte offset in log file
// value = a line in the log file
if ( value is an account access audit log) {
account number = parse account from value
output key = account number, value = 1

reduce(key, list of values) {
// key = account number
// list of values {1,1,1,1.....}
for each value
count = count + value
output key , count
The map function is called for each line in each log file. Lines that are not relevant are ignored. Account number is parsed out of relevant lines and output with a value 1. The MapReduce runtime sorts and groups the output by account number. The reduce function is called for each account. The reduce function aggregates the values for each account, which is the required result.

MapReduce jobs are generally executed on a cluster of machines. Each machine executes a task which is either a map task or reduce task. Each task is processing a subset of the data. In the above example, let us say we start with a set of large input files. The MapReduce runtime breaks the input data into partitions called splits or shards. Each split or shard is processed by a map task on a machine. The output from each map task is sorted and partitioned by key. The outputs from all the maps are merged to create partitions that are input to the reduce tasks.

There can be multiple machines each running a reduce task. Each reduce task gets a partition to process. The partition can have multiple keys. But all the data for each key is in 1 partition. In other words each key can processed by 1 reduce task only.

The number of machines , the number of map tasks , number of reduce tasks and several other things are configurable.

MapReduce is useful for problems that require some processing of large data sets. The algorithm can be broken into map and reduce functions. MapReduce runtime takes care of distributing the processing to multiple machines and aggregating the results.

Apache Hadoop is an open source Java implementation of mapreduce. Stay tuned for future blog / tutorial on mapreduce using hadoop.

Tuesday, May 24, 2011

Comparison of different Paas providers

Found this interesting link on comparison of different Paas providers. Check the link here

Monday, May 23, 2011

Vmware CloudFoundry Architecture

Few days back we discussed a brief introduction to CloudFoundry . Lets now try to explore the architecture of VMware's latest Paas offering called CloudFoundry.

Just to clarify, this article is totally based on my understanding and is not an official document about CloudFoundry in any way, feel free to let me know if my understanding is wrong.

Let's first try to answer a question, since we learned previously that Paas is a super cool solution, one might wonder..

Why aren't lot of companies providing Paas solutions?

Building Platform as a Service (PaaS) is fairly complicated since it involves various complicated processes of building, deploying, or maintaining of various activities like orchestration of all the services internally, then abstracting all of that work, and finally, having to market, sell it, and maintain it. Due to the involvement of heavy investment, very few companies have considered building their own Paas solution. Vmware has interestingly made Cloud Foundry service open source.

What has CloudFoundry been orchestrated in?

Interestingly it is orchestrated entirely in Ruby! No Erlang, no JVM's, all Ruby under the hood.
For a nice technical overview, checkout this webinar

Who orchestrates all the components in CloudFoundry?

To orchestrate all of these moving components, the "brain" of the platform is a Rails 3 application (Cloud Controller) whose role is to store the information about all users, provisioned apps, services, and maintain the state of each component. When you run your CLI (command line client) on a local machine, you are, in fact, talking to the Cloud Controller. Interestingly, the Rails app itself is designed to run on top of the Thin web-server, and is using Ruby 1.9 fibers and async DB drivers - in other words, async Rails 3!

Rails application works hand in hand with is the Health Manager, which is a standalone daemon, which imports all of the CloudController ActiveRecord models, and actively compares to what is in the database against all the chatter between the remaining daemons. When a discrepancy is detected, it notifies the Cloud Controller - simple and an effective way to keep all the distributed state information up to date.

How is Orchestration of the CloudFoundry platform done?

The remainder of the CloudFoundry platform follows a consistent pattern: each service is a Ruby daemon which queries the CloudController when it first boots, subscribes to and publishes to a shared message bus, and also exposes several JSON endpoints for providing health and status information. Not surprisingly, all of the daemons are also powered by Ruby EventMachine under the hood, and hence use Thin and simple Rack endpoints.

The router is responsible for parsing incoming requests and redirecting the traffic to one of the provisioned applications (droplets). To do so, it maintains an internal map of registered URL's and provisioned applications responsible for each. When you provision or decommission a new app server instance, the router table is updated, and the traffic is redirected accordingly. For small deployments, one router will suffice, and in larger deployments, traffic can be load-balanced between multiple routers.

The DEA (Droplet Execution Agent) is the supervisor process responsible for provisioning new applications: it receives the query from the CloudController, sets up the appropriate platform, exports the environment variables, and launches the app server.

Finally, the services component is responsible for provisioning and managing access to resources such as MySQL, Redis, RabbitMQ, and others. Once again, very similar architecture: a gateway Ruby daemon listens to incoming requests and invokes the required start/stop and add/remove user commands. Adding a new or a custom service is as simple as implementing a custom Provisioner class.

What glues all these moving pieces together?

Each of the Ruby daemons above follows a similar pattern: on load, query the Cloud Controller, and also expose local HTTP endpoints to provide health and status information about its own status. But how do these services communicate between each other? Well, through another Ruby-powered service, of course! NATS publish-subscribe message system is a lightweight topic router (powered by EventMachine) which connects all the pieces! When each daemon first boots, it connects to the NATS message bus, subscribes to topics it cares about (ex: provision and heartbeat signals), and also begins to publish its own heartbeats and notifications.

This architecture allows CloudFoundry to easily add and remove new routers, DEA agents, service controllers and so on. Nothing stops you from running all of the above on a single machine, or across a large cluster of servers within your own datacenter.

Distributed Systems with Ruby? Yes!

Building a distributed system with as many moving components as CloudFoundry is no small feat, and it is really interesting to see that the team behind it chose Ruby as the platform of choice. If you look under the hood, you will find Rails, Sinatra, Rack, and a lot of EventMachine code. If you ever wondered if Ruby is a viable platform to build a non-trivial distributed system, then this is great case study and a vote of confidence by VMware. Definitely worth a read through the source!

Another interesting read would be How Cloud Foundry works when a new Application is Deployed

Next time will try to explain CloudFoundry dynamics with a use case and go more into technical depth of each block.

Wednesday, May 18, 2011

What is Infrastructure as a Service?

Infrastructure as a service(Iaas) describes the distribution of access to a computer infrastructure through one management console. The computer infrastructure will be traditionally hosted offsite and companies will pay per use or account to access the service. Wait a minute.. I am getting all confused. First Paas, Saas and now Iaas. Give me a wider picture of where each of these fit.

Here you go, hope this picture answers all the questions..

Okay fine I get it, but what could one possibly gain from using Iaas?

In a traditional computer space in a company, computer infrastructure included personal machines, a server room full of computer systems, storage space and cooling devices, and specialized staff who can maintain the computer structure. This can run into the tens of thousands of dollars and until recently has been a required part of almost every business. Infrastructure as a service now provides companies with the chance to set up their own computing system without the initial expenditure involved in purchasing it. Workers simply login to their own computer and access everything they need to and site company at a pay per use price.

Looks interesting, can I benefit from Iaas if I want to run my own startup?

The benefits of using infrastructure as a service clearly make this a desirable option for many businesses. For a start, there is no initial cost in setting up computer infrastructure, which saves the company thousands of dollars a year. There are no costly upgrades or maintenance fees, as this is all included within the package price and is taken out of the hands of the company using the service. Staffing costs are kept down, as there is no need for an intensive IT department.

Is the server room in my office for me to play basket ball?

Physical space within the office is maximized, with large server rooms becoming obsolete and workers being able to log onto the infrastructure from their own laptops. This also means that infrastructure is a service makes it easier to work from home or when traveling. There are also environmental benefits, with less companies running their own resource intensive server rooms, and many of the companies who offer infrastructure as a service sell the space in a way that there is very little unused computer space or idle server systems.

The benefits of infrastructure as a service greatly outweigh the negatives, and even some previous issues such as information protection, security and downtime have been completely revolutionized, meaning that those working on infrastructure as a service not only save money, but can also increase their productivity. Many systems are being developed as green virtualization solutions so that clients using IaaS can know that they are doing their best to help the environment while they work.

The idea here is to build a base to understand what VMware vCloud Director(will try to explain in next post) is in simple terms.

Saturday, May 14, 2011

What is Cloud Foundry??

One can see lot of enthusiasm about Cloud foundry, Vmware's latest offering.

Okay cut it short, in layman's terms what is it?

To first understand this let's look at some terminologies

What is Saas and Paas?, I hear these too often these days

Using as an example - they offer the platform, which provides a database, a programming language, integration features and so on. You can use this platform to build whatever you need/like.

Salesforce also offer their own, prebuilt CRM applications - this is software-as-a-service as the application has been built for you, you simply start using it.

PaaS provides you with the components and tools to build something; SaaS provides you a prebuilt application you can pick up and use straight away. The line can be blurred - again, using the Salesforce example, you can tailor their SaaS offerings by using some of their PaaS technologies.

Okay fine we get an idea of Paas and Saas,

But what exactly are these public and private clouds?

A public cloud is offered as a service via web applications/web services( usually over an Internet connection). Private cloud and internal cloud are deployed inside the firewall and managed by the user organization.

There is another type of cloud, hybrid cloud. A hybrid cloud environment consisting of multiple internal and/or external providers will be typical for most enterprises.
Check this interesting link

Okay this warmed me up a bit, so what is Cloud Foundry now?, should I even care?

Well you should definitely care because of vmware's midas touch :) on any product. But on a serious note it is a great effort to bring multiple frameworks, services, clouds etc under a single roof. Let me explain..

For development of applications running in the cloud there are various options. Google App Engine, and Microsoft Azure are some big names. They all use their own development platform which means customers on that platform cannot switch to another cloud provider without having to rewrite the application code. The code is stuck to the platform of the provider. It is a bit like having an electric appliance which can only be used with a specific electricity company. If the electricity company has many outages or raises the price of power, you cannot switch to another company. The common name for this is vendor lock-on or Hotel California. You can check in but you cannot check out. The provider will lure you with low prices and when enough customers are in, prices are raised to be able to have a healthy economic model and deliver good return for investors and shareholders.

VMware announced it’s own application development platform called Cloud Foundry in April. It is based on open source software. It can be described as a layer which sits between the cloud provider infrastructure (computing resources) and the software used for development.

As it is open source the application can run on cloud providers which support Cloud Foundry. Even when the cloud provider does not run VMware vSphere as it’s underlying virtualization platform. So if the cloud provider does not comply to SLA’s or raises the price, the app can be moved quite easily to another provider without having to rewrite the app. Wow isn't this super cool, I feel it is kind of moving from dictatorship to democracy :).

Cloud Foundry supports MySQL, MongoDB and Redis [and in] coming months, they claim they would add support for other application services. In the initial release, Spring for Java, Rails and Sinatra for Ruby and Node.js are supported. The system also supports other JVM-based frameworks such as Grails.

Is it Open Source? But Of Course

Cloud Foundry is an open source project with a community and source code today at

You can also check this link

In the next post will try to go in more details about the Cloud Foundry architecture.

Again aim here is to explain in simple terms the technology. Feedback is always appreciated.