移动 & 云计算主题系列学术报告中国人民大学数据库研究组 WAMDM

Cloud Data Management

Introduction: Cloud Data Management

· Cloud-based Database System

Technology advances in communications, computation, and storage result in huge collections of data, capturing information of value to business, science, government, and society. Data volumes are currently growing faster than Moore's law. Looking forward, the exponential growth is not likely to stop. The huge size of data is imposing big challenges on infrastructure for data storage which can achieve economical scaling to even more than Petabyte, massively parallel query execution, and facilities for analytical processing. Meanwhile, the rise of large data centers and cluster computers has created a new business model, cloud-based computing, where businesses and individuals can rent storage and computing capacity, rather than making the large capital investments needed to construct and provision large-scale computer installations. Cloud-based data storage and management is a rapidly expanding business. We design Cloud-based Database System, which is a data management solution built to support the next generation of information management and large-scale analytics processing. This project aims at researching new database system which can handle the next generation big data application and applied various areas, like medicine/healthcare, mobile communications etc.

Research work and poster

· TaijiDB: A Cloud Data Management System

With the social development and the progressing of information technology, which lead directly to the explosive data growth, the era of Big Data arrives. Cloud computing, because of its powerful computing and storage capacity, is considered to be one of the huge amounts of data solutions. On this basis, the research on Cloud data management system, as a concrete manifestation of cloud computing, is becoming more and more popular among academic researchers and industry engineers.

· COLA: A Cloud-Based On-Line Aggregation System

Compared with batch-processing mode, online aggregation refers to returning running estimates of the final result from a job as it is being computed. Online aggregation is of paramount importance in the clout because of its "pay-as-you-go" payment model. We implement COLA - A Cloud-based On-Line Aggregation System, which is designed to save huge computing cost from the cloud by allowing users to stop early based on the nearly perfect accuracy of the approximate result

· Benchmarking the Cloud Data Management Systems

Cloud-based data management system is emerging as a scalable, fault tolerant and efficient solution to large scale data management. The implementations of existing cloud data management systems represent a wide range of approaches. We conducted comprehensive experiments on several representative cloud data management systems to explore relative performance of different implementation approaches, the results are valuable for further research and development of cloud data management systems.

· Index and Query Optimization on Cloud Data

In recent years, the data generated from web 2.0, Internet of things, e-commerce and other applications grows exponentially, so traditional database technology has many troubles in dealing with the large scale data management. Cloud computing has been widely used in many applications because of its unique advantages in massive data storage and processing. There are still many challenges in cloud data management, such as mass data storage, indexing, query optimization and query process estimation.

System

· Taiji: A Cloud Data Management System

TaijiDB is a native cloud-based database system, developed under Renmin University of China. TaijiDB takes HBase and Cassandra as the storage base and combines the Master/Slave structure and P2P structure to support efficient management and query on massive data. This just accords with the meaning of Taiji, so we name our system "TaijiDB".

[More]

· COLA: A Cloud-Based On-Line Aggregation System

Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on massive data. To process large datasets on large-scale computing clusters, MapReduce has been introduced as a popular paradigm into many data analysis applications. However, typical MapReduce implementations are not well-suited to analytic tasks, since they are geared towards batch processing. With the increasing popularity of ad-hoc analytic query processing over enormous datasets, processing aggregate queries using MapReduce in an online fashion is therefore an emerging important application need. We present a MapReduce-based online aggregation system called COLA, which provides progressive approximate aggregate answers for both single table and multiple joined tables. COLA provides an online aggregation execution engine with novel sampling techniques to support incremental and continuous computing of aggregation, and minimize the waiting time before an acceptably precise estimate is available. In addition, user-friendly SQL queries are supported in COLA. Furthermore, COLA can implicitly convert non-OLA jobs into online version so that users don't have to write any special-purpose code to make estimates.

[More]

Publication

l X. Zhang, Z. Wang, J. Ai, J. Lu, X. Meng: An Efficient Multi-Dimensional Index for Cloud Data Management. Accepted for publication in the proceedings of the CIKM Workshop on Cloud Data Management(CloudDB2009), November 2, 2009, Hong Kong, China. (full paper)

l Y. Shi, X. Meng, J. Zhao, X. Hu, B. Liu, H. Wang: Benchmarking Cloud-based Data Management Systems. In Proceedings of the CIKM Workshop on Cloud Data Management(CloudDB2010): 47-54, October 30, 2010, Toronto, Canada.

l X. Hu, J. Zhao, X. Meng, Z. Wang, Y. Shi, B. Liu, H. Wang: TaijiDB: A Dual-Core Cloud-based Database System. Journal of Computer Research and Development, Vol.47（Suppl）:433-437, Oct.2010(NDBC2010, Beijing)(the Best Demo Show Award)

l H.Wang, X. Meng, Y.Chai : Efﬁcient Data Distribution Strategy for Join Query Processing in the Cloud. In Proceedings of the CIKM2011 Workshop on Cloud Data Management (CloudDB2011):15-22, October 28, 2011, Glasgow, Scotland, UK.

l X.Han,W.Cao,X.Meng:Virtual Memory Management for Main-Memory KV Database Using Solid State Disk. Journal of Frontiers of Computer Science and Technology, Vol. 5, No.8: 686-694, Aug. 2011.(NDBC2011, Shanghai)

l Y. Gan, Y. Shi, X. Meng: COLA: A Cloud-based On-Line Aggregation System (Demonstration). Journal of Computer Research and Development. Vol.49(suppl.): 398-402, 2012, 10. (NDBC2012, Hefei)(the Best Demo Show Award)

l Y. Shi, X. Meng, F. Wang, Y. Gan : HEDC: A Histogram Estimator for Data in the Cloud. In Proceedings of the Fourth International Workshop on Cloud Data Management(CloudDB2012), pages: 51-58, Oct.29, 2012, Maui, USA.

l X. Han, M. Wang, X. Zhang, X. Meng: Differentially Private Top-k Query over Map-Reduce. In Proceedings of the Fourth International Workshop on Cloud Data Management(CloudDB2012), pages: 25-32, Oct.29, Maui, USA.

l Y. Shi, X. Meng, F. Wang, Y. Gan : You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proceedings of the 21st ACM Conference on Information and Knowledge Management(CIKM2012), pages: 1223-1232, Oct.29 - Nov.2, 2012, Maui, USA.(Full paper)

l Y. Ma, J. Rao, W. Hu, X. Meng, X. Han, Y. Zhang, Y. Chai, C. Liu: An Efficient Index for Massive IOT Data in Cloud Environment. In Proceedings of the 21st ACM Conference on Information and Knowledge Management(CIKM2012), page: 2129-2133, Oct.29 - Nov.2, 2012, Maui, USA.

l Y. Zhang, Y. Ma, X. Meng: Efficient Processing of Spatial Keyword Queries on HBase. Journal of Chinese Computer Systems. Vol 33(10): 2141-2146, 2012,10. (CCF CNCC2012, Dalian)

l B. Liu, X. Meng, Y. Shi: CloudBM: a Benchmark for Cloud Data Management Systems. Journal of Frontiers of Computer Science and Technology. Vol.6(6): 504-512,2012.

l Y. Shi, X. Meng, B. Liu: Halt or Continue: Estimating Progress of Queries in the Cloud. In Proceedings of the 17th International Conference of Database Systems for Advanced Applications (DASFAA 2012), pages: 169-184, April 15-19, 2012, Busan, South Korea.

l Y. Shi, X. Meng, F. Wang, Y. Gan : HEDC++: An Extended Histogram Estimator for Data in the Cloud. Journal of Computer Science and Technology(JCST2013).

l X.Meng ，X.Ci: Big data management: Concepts, techniques and challenges.Computer Research and Development.2013

l Y. Gan, X. Meng, Y. Shi: COLA: A Cloud-based System for Online Aggregation (Demonstration). Accepted for publication in the 29th International Conference on Data Engineering(ICDE2013). April 8-12, 2013, Brisbane, Australia.

l H.Wang, X.Ci, X.Meng:Fast multi-fields query processing in bigtable based cloud systems[C]. WAIM2013. June 2013：142-154.

l Y. Ma, Y. Zhang, Xiaofeng Meng: ST-HBase: A Scalable Data Management System for Massive Geo-tagged Objects. WAIM 2013: 155-166

l Y. Ma, X. Meng, D. Jiang: Mobile Application Integration:Framework, Techniques and Challenges. Chinese Journal of Computers.Vol.26 No.7 July 2013

l Y. Gan, X. Meng, Y. Shi: COLA: Processing Online Aggregation on Skewed Data in MapReduce. Accepted for publication in the 5th International Workshop on Cloud Data Management(CloudDB2013). October 27-November 2, 2013, San Francisco, USA.

Reference

l Timothy. (2008, July) Multiple experts try defining cloud computing. [Online]. Available: http://tech.slashdot.org/article.pl?sid=08/07/17/2117221

l M. Lynch. Amazon elastic compute cloud (amazon ec2). [Online]. Available: http://aws.amazon.com/ec2/

l IBM. Ibm introduces ready-to-use cloud computing. [Online]. Available: http://www-03.ibm.com/press/us/en/pressrelease/22613.wss

l J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Communications of the ACM, vol. 51, pages 107–113, 2008.

l F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, pages 205–218, 2006,.

l S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, pages 29–43, 2003.

l Hadoop. [Online]. Available: http://hadoop.apache.org

l HBase. Available at http://hadoop.apache.org/hbase/.

l HDFS. Available at http://hadoop.apache.org.

l HIVE. Available at http://hadoop.apache.org/hive/.

l A. Cassandra. Available: http://incubator.apache.org/cassandra/.

l G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazons highly available key-value store. In SOSP, pages 205-220, 2007.

l S. Ghemawat, H. Gobio®, and S.-T. Leung. The google ¯le system. In SOSP, pages 29-43, 2003.

Maintained by WAMDM Administrator()

Zhongyuan's Website