SAM SIG: Hadoop architecture, MapReduce patterns, and best practices w/Cascading





    Abstract: A rapid introduction to Hadoop architecture, MapReduce patterns, and best practices with Cascading.


    Hadoop is an open source implementation of the Google MapReduce processing model and has been widely embraced by startups and established companies like Yahoo! and Amazon. Cascading, also an open source project, is an alternative API to MapReduce that allows developers to rapidly create sophisticated applications on the Hadoop platform.

    Unfortunately the MapReduce model can be very complex to manipulate when attempting to perform tasks developers take for granted when using relational style databases, like joins and secondary sorting of grouped values.

    Further, integrating Hadoop with external systems requires a deep knowledge of its internals. But this is where Hadoop clusters offer the most value, of off-loading data cleansing and data migration tasks from traditional tools and expensive load sensitive systems.

    Cascading is an API that replaces the “Map” and “Reduce” primitives and their associated Key/Value algebra with functions, filters, and aggregators, and links them all together with a familiar columns and records model. And provides key processing primitives familiar to developers.

    In this presentation, we will present the Hadoop architecture, how MapReduce influences that architecture and is used for common tasks, and how Cascading helps developers rapidly build sophisticated data processing and orchestration applications that can be very simply tested and executed.

    Bio: Chris K Wensel has been a Software and Systems Architect for over 15 years. He is the founder of Concurrent Inc., and the author of the Cascading data processing open-source project. He’s also a Principal at Scale Unlimited, a professional services company offering commercial training and consulting for Hadoop and related large architectures.


    Over the last 7 years he has deployed large and sophisticated data processing applications for use by companies providing geo-spatial, web content, and financial data services in both the traditional enterprise data-center and on Amazon EC2.


    Cubberley Community Center
    4000 Middlefield Road, Room H-1
    Palo Alto, CA



    6:30 - 7:00 p.m. Registration/Networking/Refreshments/Pizza
    7:00 - 9:00 p.m. Presentations



    $15 at the door for non-SDForum members
    No charge for SDForum members
    No registration required