Technology advances in communications, computation, and storage result in huge collections of data, capturing information of value to business, science, government, and society. Data volumes are currently growing faster than Moore’s law. Looking forward, the exponential growth is not likely to stop. The huge size of data is imposing big challenges on infrastructure for data storage which can achieve economical scaling to even more than Petabyte, massively parallel query execution, and facilities for analytical processing. Meanwhile, the rise of large data centers and cluster computers has created a new business model, cloud-based computing, where businesses and individuals can rent storage and computing capacity, rather than making the large capital investments needed to construct and provision large-scale computer installations. Cloud-based data storage and management is a rapidly expanding business. We design Cloud-based Database System, which is a data management solution built to support the next generation of information management and large-scale analytics processing. This project aims at researching new database system which can handle the next generation big data application and applied various areas, like medicine/healthcare, mobile communications etc.
With the development of computer and communication technology, a large scale of data is produced. For example, in the industry of mobile communication, a great number of data items are produced every day, how to efficiently store and analyze these data becomes a challenging technical problem. The traditional techniques on data management fail to work well in these scenarios. To fit the requirements, a new platform Cloud Computing provides flexible and scalable services to users by building large computing centers. How to efficiently manage such large scale data based on cloud-platform becomes a key problem. This project focuses on tackling some key technologies on cloud-based data management.
Recently, the cloud computing platform is getting more and more attentions as a new trend of data management. Currently there are several cloud computing products that can provide various services. However, currently the cloud platforms only support simple keyword-based queries and can’t answer complex queries efficiently due to lack of efficient index techniques. In this paper we propose an efficient approach to build multi-dimensional index for Cloud computing system. We use the combination of R-tree and KD-tree to organize data records and offer fast query processing and efficient index maintenance. Our approach can process typical multi-dimensional queries including point queries and range queries efficiently. Besides, frequent change of data on big amount of machines makes the index maintenance a challenging problem, and to cope with this problem we proposed a cost estimation-based index update strategy that can effectively update the index structure. Our experiments show that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data. Our approach is quite general and independent from the underlying infrastructure and can be easily carried over for implementation on various Cloud computing platforms.
We survey existing cloud data management systems, compare the application environments, the data models, and different methods of these systems to guarantee scalability, high availability, fault tolerance, etc. We also do test using benchmarks of cloud-based DBMS to evaluate the performance of them.
Taiji is a Chinese cosmological term, which means two states can be a relatively uniform. In order to leverage the ad- vantages of cloud storage based on Master/Slave and P2P structure, we propose a project call Taiji, which is a dual- core cloud-based database system. This system can support SQL to manage the Big data.
This dual-core architecture of cloud-based database system is under research.
The concept design of cloud data management System contains three tiers: data storage tier, query processing and transaction tier, and application tier.
Storage Tier:
The bottom tier is a Storage Management Tier, which supports these features: Replicate, Parallelism, Fault tolerance, Key partitioning and Synchronization. This tier includes Master Cluster, Data Cluster, Service Management and Data Storage and Retrieval. A master server in Master Cluster that manages the file system namespace and regulates access to files by clients. Data nodes in Data Cluster are responsible for serving read and write requests from file system clients, and they also perform block creation, deletion, and replication upon instruction from the Master. Service Management has the functions, including Provisioning, Deployment and Health Monitoring. Data Storage and Retrieval Module provides data storage organized as ordered or hash tables.
Query Processing and Transaction Tier:
The Query Processing and Transaction Tier is RDBMS-like tier. In this tier, we will discuss how to manage massive data with high availability, fault tolerance and relaxed consistency guarantees. This tier includes Buffer Pool, Transactions Support, Query Optimization and Data & Schema API.
Application Tier:
The top tier is Front-end Application Tier. In this tier, we will design a SQL-like language to support complex data management. With this query language and data & schema API, a variety of web applications can be supported by our system. In addition, we will also provide parallel data analysis in this tier.
- X. Zhang, Z. Wang, J. Ai, J. Lu, X. Meng: An Efficient Multi-Dimensional Index for Cloud Data Management. Accepted for publication in the proceedings of the CIKM Workshop on Cloud Data Management(CloudDB2009), November 2, 2009, Hong Kong, China. (full paper)
- Timothy. (2008, July) Multiple experts try defining cloud computing. [Online]. Available: http://tech.slashdot.org/article.pl?sid=08/07/17/2117221
- M. Lynch. Amazon elastic compute cloud (amazon ec2). [Online]. Available: http://aws.amazon.com/ec2/
- IBM. Ibm introduces ready-to-use cloud computing. [Online]. Available: http://www-03.ibm.com/press/us/en/pressrelease/22613.wss
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Communications of the ACM, vol. 51, pages 107–113, 2008.
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, pages 205–218, 2006,.
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, pages 29–43, 2003.
- Hadoop. [Online]. Available: http://hadoop.apache.org
- HBase. Available at http://hadoop.apache.org/hbase/.
- HDFS. Available at http://hadoop.apache.org.
- HIVE. Available at http://hadoop.apache.org/hive/.
- A. Cassandra. Available: http://incubator.apache.org/cassandra/.
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazons highly available key-value store. In SOSP, pages 205-220, 2007.
- S. Ghemawat, H. Gobio®, and S.-T. Leung. The google ¯le system. In SOSP, pages 29-43, 2003.

| Maintained by Zhongyuan Wang( ) | Copyright © 2007-2008 WAMDM, All rights reserved |