COLA (A Cloud-Based System for Online Aggregation)
Cloud Group, WAMDM, Renmin University of China
[Home] [Seminars] [Academic Activities] [System] [Publication] [Download] [People]

      Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on massive data. To process large datasets on large-scale computing clusters, MapReduce has been introduced as a popular paradigm into many data analysis applications. However, typical MapReduce implementations are not well-suited to analytic tasks, since they are geared towards batch processing. With the increasing popularity of ad-hoc analytic query processing over enormous datasets, processing aggregate queries using MapReduce in an online fashion is therefore an emerging important application need. We present a MapReduce-based online aggregation system called COLA, which provides progressive approximate aggregate answers for both single table and multiple joined tables. COLA provides an online aggregation execution engine with novel sampling techniques to support incremental and continuous computing of aggregation, and minimize the waiting time before an acceptably precise estimate is available. In addition, user-friendly SQL queries are supported in COLA. Furthermore, COLA can implicitly convert non-OLA jobs into online version so that users don't have to write any special-purpose code to make estimates.


      In terms of structure, the architecture of COLA contains four tiers: data manager, online aggregation executor, query engine and user interface.Data manager makes use of HDFS to store and manage data.Online aggregation executor is the key module of COLA to perform our online query processing algorithm over MapReduce. It is called to process the sample data, produce approximate answers with their associated confidence intervals and progressively refine the answers. In addition, the module makes predictions about the residual completion time,and also estimates amount saved so far.Query engine is responsible for compiling the SQL query into directed acyclic graph of MapReduce jobs, and translating the non-OLA jobs to online version.User interface provides interactive and flexible interfaces, users can issue SQL query request through SQL interface or submit MapReduce program via shell interface.


COLA Architecture
      In terms of functionality, the architecture of COLA consists of the following five modules :


      1. OLA Translator [More...]

      2. Result Estimator [More...]

      3. State Manager [More...]

      4. Progress Predictor [More...]

      5. Data Sampler [More...]

WAMDM, Renmin University of China, All Rights Reserved CloudDB Last Updated : 2013/11/05