Hadoop And Friends

If you want to learn Hadoop try these.

This presentation originated as a presentation for a LUG. So it aims to educate a broad audience rather than training people in development on Hadoop. The target length was one hour. In the event, ran to about 40 minutes (but I did rush a little).

Creative Commons License (c) Robert Burrell Donkin

Please feel free to remix, copy and present!

This is arranged into three loose stanza with a rally point at the start of each:

  1. Utility Computing
  2. Friends And Relations
  3. Apache

A brief introduction to MapReduce lies between 1 and 2.

  1. Hadoop And Friends...

    To achieve the desired effect, it is necessary to open this page then press the back button. All icons will disappear leaving just Hadoop and setting up the first visual play on words. The green kitchener image needs to be tuned so that it appears in the middle of the screen.

    My name is Robert Burrell Donkin. I'm a Member of the Apache Software Foundation but not a committer on Hadoop or any of the other projects mentioned in this talk. I am excited about the open source nexus emerging around utility computing. By the end of this talk, I hope you'll know a little more about why this is and about these projects.

  2. SmartFrog And The Elastic Compute Cloud

    A Frog close up and personal

    Couldn't resist this as a starting line!

    Hopefully, start with a laugh. Be careful to tune the CSS so that this slide displays correctly. Might be necessary to scale the picture as well.

    As a term, grid computing has been around for a number of years. The initial hype has not yet been justified. The grid has not yet mainstreamed. Two important trends are likely to change this.

    The first is switch to multiple cores. It is now common for commodity servers to ship with 4 or 8 cores. Next year, they will probably ship with 16 or 32. Grid computing is no longer out of reach.

    The second is the rise of outsourced commodity computing. This now offers a real choice between running a single core in serial or multiple cores in parallel without worrying about capital costs. This section deals with this topic by example.

  3. SmartFrog...

    • Created by
    • Is Free Software (LGPL)
    • Describes...
    • Activates...
    • Manages...
    • ...Distributed systems
    • Including The Elastic Compute Cloud

    is an open source framework for describing, activating and managing distributed systems. Most of the software in this talk is developed at Apache and is licensed under the Apache License, Version 2. SmartFrog was created by and is licensed but it is an open development project.

    SmartFrog takes an interesting approach. It combines a domain specific language describing components and their operations with an extensible modular runtime written in .

    SmartFrog addresses the problem of managing hetrogeneous distributed systems. Once deployment, activitation and management of an instance is scripted, scaling the number of distributed instances devoted to a task becomes easy. All this can be automated.

  4. ...The Elastic Compute Cloud

    • Amazon EC2 is:
      • Controllable
      • Elastic
      • Cheap
      • How Cheap? June 2008 per instance hour:

        $0.10 - Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform

        EC2 Compute Unit (ECU) - One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

    Amazon web services. (Yes, that Amazon.) Why would an online bookshop offer on demand distributed computing? Because that's what they're good at.

    The Amazon Elastic Compute Cloud (EC2) is a experiment in mass server commoditisation. Amazon provides the virtual hardware, you provide the server image. Each server is managed through a web service interface. At justten cents for an instance hour.

    It's elastic: start new instances as needed, switch them off when you don't. If you automate with SmartFrog, then it's as easy to start and shutdown ten instances as one.

  5. As Many As You Want!

    • 10 nodes...
    • 100 nodes...
    • 1000 nodes...
    • 10,000 nodes...
    • 100,000 nodes...? (Maybe)
    • Processing power is a democratic commodity

    EC2 reduces friction. Amazon supplies utility processing power with Standard, non-discriminatory terms and conditions. No contract negotiations: just pay and play. No initial capital investment in servers. No need to hire expensive server rooms. No need to worry about UPS and backup generators. No need to wait for spare parts to be delivered.

    Amazon aren't alone. This is just the start of a major movement in the industry: the democratication of massively parrallel processing. Processing power as a commodity. It's now as easy to hire 1000 cores for one hour as 1 core for 1000 hours. Suddenly, parallelism is affordable but to exploit parallelism on this scale, we must travel beyond clustering.

  6. Beyond Clustering...

    Typical enterprise clusters are deployed on a relatively small number of nodes spanning a single digit number of instances on a single subnet. Enterprise clusters tend to be chatty and exploit subnet characteristics. The small number of nodes means that failure of a node is a rare event but has a high impact.

    This approach to clustering starts to run into difficulties as the number of nodes scales. Effectively, each peer needs to communicate with every other peer. The effort of maintaining the cluster scales poorly as the number of nodes rises.

    In order to exploit the parallelism available through multicore servers and commodity outsourced distributed computing, new approachs are need. As in the diagram, it's usual to label these grid computing. Throughout the rest of this talk, cluster will be used to refer to the classic enterprise clustering where grid will be used for approach with better scalability as the number of nodes increases.

    The diagram indicates roughly where some technological approaches lie in terms of nodes and data.

    At the top right hand of the diagram is GRID computing of which seti@home is a famous example. These approaches are characterised by the very large quantities of data and nodes involved. For example, on Sunday, 15th June 2008 BOINC reported 346,870 volunteers contributing 588,532 computers with a 24-hour average of 1,271.53 TeraFLOPS. This is an order of magnitude bigger than anything else on the diagram (for example, reported running a Hadoop grid with 10,000 cores in early 2008).

    A good example of this approach is . BOINC is an open source framework for volunteer distributed computing. It is best known for massive public projects (such as seti@home) but is also suitable for desktop GRID computing in the enterprise. BOINC uses a classic client-server architecture. A client running locally drives the process: it downloads new work from the server and schedules it's execution.

    This suits the primary use case - volunteer computing - very well. It fits well related enterprise use cases in the commercial and academic sectors: exploiting spare CPU cycles on desktops. But does not fit well grid computing on servers dedicated to the task. The other applications on the diagram do address this use case.

    is an open source enterprise grid computing framework for . It is towards the enterprise end of the spectrum offering an environment familiar to enterprise developers. It is messaging based. This hits the sweet spot just above classic enterprise clustering but as the quantity of data rises, data locality starts to become important. As the number of nodes rises, all can no longer be conveniently located on a single quick subnet and network topology starts to become important.

  7. ...To Utility Computing

    • Grid computing is a broad church
    • Allen Wittenauer is the Senior Systems Architect for Yahoo!'s Grid Computing group
    • He defines utility computing as a grid
      sharing one owner's resources by multiple, independent customers
    • He runs grids with Petabytes of data and 1000's of nodes:
      • Data locality is important
      • Machines fail constantly
      • Machine utilisation is high

    I was very pleased to Chair Allen Wittenauer at ApacheConEU2008. If you ever have the opertunity to hear him speak, take it! His presentation is recommended and offers an insight into what it's like operating at this scale.

    Down nodes are a constant, not an occasional event. A classic enterprise cluster is expected to be robust in the event of a failure but these failures are expected to be infrequent. It is acceptable for servers to spend time rebuilding a cluster in the event of failure. Many techniques which may be appropriate when faced with occasional failures lead to poor performance.

    In addition, natural variability means that some machines take longer than other to complete the same task. Statistical approached tend to work better at this scale. For example, rather than actively monitor nodes, accept that there will be a long tail and reschedule tasks on servers that are taking longer than their peers.

    The high levels of server utilitisation achieved are impressive.

  8. Hadoop

    • Scaling Data
      • Distribution
      • Redundancy
      • Locality
    • Scaling Nodes
      • Compose simple functions
      • Long tail scheduling
    • Hadoop = MapReduce + DistributedFileSystem

    Time to start looking at how Hadoop approaches the problem. Just using messaging runs into issues as the data size rises. The distribution of the data becomes an increasing problem especially once network topology needs to be factored in. As well as being distributed, the data needs to be store in a redundant fashion over multiple nodes. Data locality - performing calculations close to the data - becomes important.

    File systems are well known and easy to understand. A distributed file system spreads file data across nodes. Hadoop uses a redundant distributed file system. Computations are located close (ideally co-located) to the data. This solution is easily understood but powerful.

    Providing utility computing at scale means simplifying the development model. It isn't worthwhile for a framework to attempt to understand complexity. Tasks need to be factored into the composition of finely grained operations which can be executed in parallel. These operations need to be free of local side effects. The MapReduce model has proved both simple and powerful. This is what Hadoop uses.

  9. MapReduce

    • Term popularised by Google reseachers Jeffrey Dean and Sanjay Ghemawat with architypical use cases:
      • Distributed grep
      • Log analysis
      • Word count
      • Inverted index
    • Turns out that lots more applications fit
      • Big Data Stores
      • Numeric analysis
      • Data mining
      • Machine learning
      • ...

    Here's the bit that everyone's been waiting for: an introduction to MapReduce.

    Google popularised the name with MapReduce: Simplified Data Processing on Large Clusters. This model builds parallel applications by composing just two simple functional operations. This simplicity means that MapReduce does not require the boutique programming skills associated with other approaches to parallelism.

    Four core machines are easily affordable today. Being able to scale by the addition of more cores is becoming increasingly important. It has been argued strongly that MapReduce is also a very suitable development paradigm for the programing of a single machine with many cores.

    There are a number of good MapReduce tutorial around. In particular see Google Code University.

  10. Two Minute Intro

    MapReduce is composed of

    • A Map Function
    • Followed by Reduce Functions

    MapReduce applications are composed of two simple functions. The map function takes data and produces key-value pairs. The reduce function merges all intermediary values with the same key to produce key-value pairs.

  11. Map

    • Takes a slice of input...
    • ...And produces intermediary key-value pairs

  12. Reduce

    • Merges values for a particular ...key
    • ...And produces intermediary key-value pairs
    • The output of a reduce can be used as input for later reduce stages

  13. MapReduce

  14. History: In The Beginning, There Was Jakarta...

    • Err... Well, actually no
    • But lucene started in the Jakarta project
    • Lucene
      • The Library
        • Indexes and searches documents
        • Primary langauge is java
        • C Port
        • .NET Port
      • The Project
        • Is home to search software
        • Graduated to top level in 2005

    This is the start of the second stanza: introducing Hadoop and related projects. Will return to MapReduce in the context of Hadoop later but start with a little history. See:

    A couple of reasons for choosing this oblique approach. Firstly, to give MapReduce a chance to sink in. Secondly, to emphasis the current momentum behind Hadoop and related projects. For example:

    Yahoo!'s involvement with Hadoop is considerable but analysts often make the mistake of contribution with control. The most important reason why corporations choose to involve themselves with is the open development model. Hadoop and the rest are open to contributions from anyone who's interested. More on this in the third section.

    and Lucene are the obvious places to start. Jakarta is well known to Java developers but may not be to others. May need to gauge the audience.

  15. The Lucene Project

    • Spawned many others...
    • Tika
      • Detects and extracts meta-data
      • Started Incubation in 2007
    • Solr
      • Extends Lucene
      • Is a Enterprise Search Framework
      • Is an XML-over-HTTP Enterprise Search Server
      • Started Incubation in 2006
      • Graduated in 2007

    Solr is a fully featured enterprise search framework which isn't as widely known as it deserves to be.

  16. Nutch

    • Nutch
      • Extends Lucene
      • Searches the web
      • Crawls links
      • Parses and indexes documents
      • Scores and ranks pages
      • 2002-2004 Scaling problems...

    Nutch is where the Hadoop story starts. Nutch is an interesting project with a big ambition: to create components from which an open source search engine capable of indexing the web can be built. Nutch uses Lucene to perform it's searching and indexing.

    But Nutch had problems. The web is really quite a big place. Scaling up to data that size proved to be tricky...

  17. Scaling Nutch

    • Scaling Nutch
      • 2004: Google Research publishes MapReduce And GFS Papers
      • And Nutch Implements...
      • ...and Hadoop begins
      • 2005 Nutch Graduates
      • 2006 Hadoop Split Out
      • 2008 Hadoop Promoted To Top Level Project
      • 2008 Yahoo! generates production search index with Hadoop

    The turning point comes when Google resource published papers describing how they approach this problem:

    MapReduce and distributed file system Implementations are created quick and quickly show promise. Suddenly, Nutch starts to scale. These implementations go on to become the basis of Hadoop.

  18. Hadoop Revisited

    • Hadoop Distributed File System stores input and output data on compute nodes
    • InputFormat
      • Validates input
      • Splits data into InputSplit's
      • Supplies RecordReader
    • OutputFormat
      • Validates output
      • Supplies RecordWriter

    Finally arrive back at Hadoop. Here starts a discussion of the general principles within the context of Hadoop.

    As might be expected, Hadoop is based around interfaces to allow extension.

    Input and output are similar so it's convenient to talk about them together. The initial data is loaded into and the output retrieved from the HDFS. Nodes both store data and perform computations. See

  19. MapReduce with Hadoop

    • Map functions implement Mapper interface
    • Reduce functions implement Reducerinterface
    • Optional combiner for local aggregation
    • Optional compression for intermediate outputs
    • Optional reporter to monitor progress

    Introduce some details about the Hadoop MapReduce implementation. See

  20. Languages

  21. Hadoop

    Pipes is a SWIG compatible C++ API. This means it isn't JNI based. The C++ code is executed in a separate process. Communication is socket based.

    The streaming API communicates to arbitrary executables through STDIN and STOUT. See:

  22. Pig

    Pig Logo
    • Analyses data
    • Is the Pig Latin language
    • And an expression evaluation infrastructure
    • Primary use case is Hadoop MapReduce

    Pig is a domain specific language for data analysis and a language evaluation infrastructure that runs on Hadoop.

    Accessable to those without high level programming langauges. Potential for automated optimisation. Pig is extensible.

  23. Task Management

    • Task control framework:
      • JobConf describes a task
      • JobClient runs and monitors tasks
    • Easy to understand
    • Basic control only
    • Single point of failure

    The job system is intentionally simple. Hadoop concerns itself only with the running and monitoring of tasks. Tasks are simple and easy to understand. Coordination is left to other applications.

  24. Zookeeper

    • Coordinates distributed systems
    • Hadoop is primary use case
    • Donated to Apache in 2008 by Yahoo!

    One question frequently asked by people with an enterprise background is: 'Why isn't Hadoop clusterable?'. The short answer is in the diagram: Hadoop occupies a space beyond clustering.

    Each Hadoop grid does have single points of failure: NameNode and JobTracker. This is not such a problem for grids as it would for classic clustering technology. The chances of failure is not high and the cost is low: once the nodes have been restarted, some calculations may need to be rerun. It is cleaner and clearer to separate these concerns and use a coordination infrastructure.

    Zookeeper aims to create a high availability coordination infrastructure for distributed systems with Hadoop as it's primary use case..

  25. HBase

    is a sparse, distributed, persistent multi- dimensional sorted map

    • Is not a relational database
    • Stores Big Tables...
    • ...Very Big Tables
    • Billions of rows by millions of columns (one day soon)
    • Is like Google's BigTable which powers:
      • Google Earth
      • Google Analytics
      • (and 58 others)


    the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data

    The principles of big tables are set out in Bigtable: A Distributed Storeage System for Structured Data.

    is a sparse, distributed, persistent multi-dimensional sorted map

    Sparse - though the total size of the data may be big, many (or perhaps most) elements are empty. Distributed - runs on a large number of commodity servers. A single table will not fit onto a single machine. Data locality becomes important. Persistent - reliable data storage.

  26. Mahout

    Mahout Logo
    • Launched by the Lucene Project in January 2008
    • Builds scalable machine learning on Hadoop
      • Machine learning is a branch of AI
      • Guesses patterns and rules
    • Is creating a library of algorithms
    • Taste collaborative filtering donated in April 2008
      • Estimates preferences about other items from past behaviour

    Mahout builds scalable machine learning libraries.

    Machine learning is quite a wide branch of AI. It has applications in areas such as data mining, rule discovery, planning, classification and recognition.

    It has been argued that the key to increased performance (and so adoption) is moving machine learning into the multicore era and that MapReduce is an entirely suitable medium. Over the next decade, it seems likely that AI techniques will increasingly move out of the laboratory and into the enterprise. Efficient performance on distributed, multicore machines will be key.

    The project started in January 2008 and is just starting to get going. So, now's a good opportunity to get involved. Should be interesting to see how Mahout progresses over the next few years.

    In April 2008, Taste was donated to Mahout. Taste is a collaborative filtering library. It is written in Java but exposes a web service interface for interoperability. A typical use case is recommending new purchases based on history.

  27. Hama

    Hama will develop a parallel matrix computational package, which provides an library of matrix operations for the large-scale processing development environment and Map/Reduce framework for the large-scale Numerical Analysis and Data Mining, which need the intensive computation power of matrix inversion, e.g. linear regression, PCA, SVM and etc. It will be also useful for many scientific applications, e.g. physics computations, linear algebra, computational fluid dynamics, statistics, graphic rendering and many more.
    • Incubator accepted this proposal in May 2008

    As of June 2008, Hama is something more than a proposal and something less than an open source project. It has been accept into the Apache Incubator but is still bootstrapping.

  28. Get Involved!

    • All projects at Apache
    • (Except For SmartFrog)
    • All welcome participation!
    • Lots of activity

    Lead into general information about Apache.

    Hadoop is on the curve. Related activity increasing both in bredth and width. Great opportunities to get involved at all levels.

  29. The Apache Software Foundation

    • 501(c)(3) Non Profit Foundation
    • Dedicated to the creation of Open Source Software
    • Language agnostic
    • In May 2008
      • 65 self-governing top level projects...
      • ...including HTTPD
      • 1765 committers
      • 662,663 commits
      • 5,995,132 requests, 2.9TB (daily average, www)
      • 1,572,052 requests, 11.96GB (daily average, svn)
      • 239 Members

    Some general facts about Apache.

  30. Open Development

    • In Public
    • Individuals not corporations
    • Elected meritocracy
    • Community not dictatorship
  31. Apache Incubator

    • Is the way that independent communities join Apache
    • Is Diverse. Typical use cases:
      • Closed source now opening up development and source
      • Existing open source moving to Apache
      • New open source with community within Apache
      • New open source with substantial community
    • Should I Use An Incubating Project?

    Many of the projects here are either still in the Apache Incubator or have graduated in the last year or two. The Incubator serves as a gateway for projects either new or new to Apache.

    At Apache, top level projects are self-governing communities responsible for their own oversight. Organic growth can happen within a project by division into sub-projects providing that the code is new and falls within the scope of the project. These are separate products developed by the same community with shared collective responsibility. Sometime the same lists are shared, sometimes not. Sub-projects which develop a sufficient level of size and independence may be promoted to top level. Hadoop was promoted to top level in January 2008.

  32. Hadoop User Group UK Meeting

    • Tuesday, August 19 2008 London UK
    • August 19th brings the first of many Hadoop User Group meetups in the UK. It will be hosted somewhere in London and we'll have presentations from both developers and users of Apache Hadoop.

      The event is free and anyone is welcome.

    • Speakers
      • Doug Cutting (Yahoo!, Hadoop, Lucene, Nutch, Apache Member)
      • Tom White (Lexemetech, Hadoop, Apache Member)
      • Steve Loughran (HP, Ant, Apache Member)
      • Martin Dittus and Johan Oskarsson (Last.fm)

    If you can make it, sign up. Hopefully, this talk will have whetted your appetite.

  33. This Presentation

    • Is Hadoop And Friends by Robert Burrell Donkin who is
      • A Member of the Apache Software Foundation
      • Involved in Jakarta, James, Commons and Incubator Projects
      • On the Legal Affairs committee
    • Is available under the Creative Commons Attribution 3.0 License.
      • Please:
        • Share
          • Copy
          • Perform
          • Distribute
        • Remix
      • Source online:
        • With Commentary
        • JSTW

    These days, it can be difficult to place work in the public domain. This work is therefore licensed under the liberal Creative Commons Attribution 3.0 license. The UK has interesting copyright laws so I opted for US jurisdiction. Minimal attribution in the source is fine by me.

    Note that many of the logos are trademarks and are the property of their respective owners. They are used here to help users clearly identity the product commented on. I believe that this consistutes fair use.

    Share, remix and have fun!

  34. S5

    • Is A Simple Standards-Based Slide Show System
    • Runs This Presentation
    • Is HTML+CSS+JavaScript
    • Requires no server
    • Is easy to customise

    S5 is great. Thanks to and the rest of the for creating it.

  35. Thanks!

    • To the Speakers About Hadoop And Friends at ApacheConEU2008
      • Allen Wittenauer
      • Grant Ingersoll
      • Isabel Drost
      • Owen O'Malley
      • Steve Loughran
      • Tom White
    • To the communities which bring these projects to life

    The Hadoop day at ApacheConEU was cool and I ended up chairing most of these sessions which was great. Most of the the content for this talk is inspired by those talks. All faults are mine alone.

  36. Thanks!

    • To

    Thank you and good night :-)