Monday, March 23, 2015

Hadoop Scalability Challenges

Hadoop is hot, not because it necessarily represents cutting edge technology, but because it's being rapidly adopted by more and more companies as a solution for engaging in the big data trend. It may be coming to your company sooner than you think.

The Hadoop framework is designed to facilitate the parallel processing of massive amounts of unstructured data. Originally intended to be the basis of Yahoo's search-engine, it is now open sourced at Apache. Since Hadoop now has a broad range of corporate users, a number of companies offer commercial implementations of Hadoop.

However, certain aspects of Hadoop performance, especially scalability, are not well understood. These include:

  1. So called flat development scalability
  2. Super scaling performance
  3. New TPC big data benchmark

See "Hadoop Superlinear Scalability: The Perpetual Motion of Parallel Performance" for a more detailed discussion.

Therefore, I've added a new module on Hadoop performance and capacity management to the Guerrilla Capacity Planning course material that also includes such topics as:

  • There are only 3 performance metrics you need to know
  • How performance metrics are related to one another
  • How to quantify scalability with the Universal Scalability Law
  • IT Infrastructure Library (ITIL) for Guerrillas
  • The Virtualization Spectrum from hyperthreads to hyperservices
  • Hadoop performance and capacity management
The course outline has more details.

Early bird registration ends in 5 days.

I'm also interested in hearing from anyone who plans to adopt Hadoop or has experience using it from a performance and capacity perspective.

No comments: