Emerging Tech SIG: Big Data and Data Science Are Growing Up
Topic: Big Data and Data Science Are Growing Up
Hadoop has achieved great success, enabling online companies to store and process multiple Petabytes of data at a speed far greater than ever before possible. In 2011, enterprises started to invest seriously in Hadoop, in HBase, and in the new analytic approaches known as data science. Think Big Analytics provides consulting, engineering, and data science services for Hadoop and related technologies, working with leading technology, advertising, and financial services companies.
In this talk we look at patterns of use in the enterprise using Big Data and Data Science :
• Programming: The Hadoop ecosystem offers a range of tools and techniques for development, such as Hive for SQL, Pig for data flow, and a number of frameworks to enhance productivity for MapReduce development. We look at commonly used approaches, when they are appropriate, and the skills they require.
• Integration into the Enterprise: Big Data needs to work well with existing management and monitoring tools and fit within security and privacy policies. Typically, enterprises want to import a variety of unstructured and structured data for analysis and to export results back out to external systems. What are the options for integration? How can one achieve High Availability and facilitate upgrades of software?
• Diverse Data: beyond website and advertising event data, enterprises are processing diverse data from devices, IT infrastructure logs, for security, enterprises are processing data such as server, network, and security data, financial time series, text analysis, and scientific data sets. These diverse data sets can be processed by the same core technologies, but there are differences in how one configures, develops, and tunes the resulting systems.
• Data Science: this represents new approach to data analysis, blending statistics, mathematics and machine learning with programming and investigation skills to find signal and build models to explain large-scale raw data sets. This is a marked shift from traditional top-down models built on pre-defined summary data sets that have been applied in isolated domains. The discipline of data science allows deeper investigation and creation of value, based on more sophisticated analysts and access to unprecedented data volumes and technology.
• Hosting strategies. Like online giants but unlike startups, most enterprises are building their own clusters, not running in the cloud. We look at workload requirements, typical machine capabilities and scale for initial clusters and growth over time.
• Blending Batch and Realtime: Hadoop grew up as a system for batch analysis of large-scale data sets. But it's increasingly common to process incoming events in near realtime using Big Data clusters, feeding data into NoSQL stores like HBase, and pushing derived models out to edge servers for immediate response. We look at patterns for integration between batch analysis clusters and realtime serving environments, and techniques for extending batch analysis to support near realtime analyses.
• Beyond Data Warehousing: Many organizations use Hadoop to extend existing data warehouses, relieving capacity and cost constraints for staging, cleaning, and transforming data before loading the warehouse. This allows for a new, flexible data, analysis environment where more raw data can be stored and data science analysis can be done to find patterns and answer ad hoc questions without the cost of building new ETL pipelines to expose data in a highly structured format.
• The Data Marketplace: The increase in data volume, variety, and velocity is fundamentally causing many companies to re-examine their role in industry value chains. What data do they have that others find valuable? How can they package and monetize it? What data is available that will allow them to enhance existing products and services, or create new ones? Big Data and Predictive Analytics promise to revolutionize a number of industries, such as technology, financial, energy, and government. We look at some examples of how.
• The Future: increases in data storage and parallel processing relative to the slower rate of disk access has led to new cluster-based architectures. Projecting from current growth rates, we project how clusters in 2015 will look.
6:30 - 7:00 p.m. Registration / Networking / Refreshments / Pizza
7:00 - 9:00 p.m. Presentation