COLA (A Cloud-Based System for Online Aggregation)
Cloud Group, WAMDM, Renmin University of China
[Home] [Seminars] [Academic Activities] [System] [Publication] [Download] [People]

 2013.06.21  Big Data on the cloud
  Set Similarity Join Using MapReduce Survey & Concerns 
Set data similarity join is an important task and it is widely used in many applications. In this report we mainly make a survey about the existing research works on the set data similarity joins using MapReduce and analyse their pros and cons, then we propose some new idea to improve the performance. Lastly we point out some challenges about set data similarity join processing using MapReduce.
  Spatio-Textual Similarity Join 
Recent years, with the popularity of smartphones and GPS, the quantity of spatio-textual data is increasing rapidly and there are more and more applications based on spatio-texutal similarity join techniques, thus, it has attracted more and more attention on the study of spatio-texutal similarity join technique. This presentation mainly introduce some state of the art work on spatio-textual similarity join.
 2013.05.17  Data Management using MapReduce
  SAX-Based Similarity Join on High Dimensional Data Using MapReduce 
Similarity join on large scale、high-dimensional data is a big challenging task, the existing methods are always centralized mode based on some kind of index structure, they can not deal with large scale data efficiently. In this report, we introduce the related works on similarity joins using MapReduce and propose a new method: SAX-Based Similarity Join on High Dimensional Data Using MapReduce. lastly we point out some challenges about High Dimensional Data join processing using MapReduce.
  Introduction to JVM 
A Java virtual machine is a program which executes certain other programs, namely those containing Java bytecode instructions. JVMs are most often implemented to run on an existing operating system, but can also be implemented to run directly on hardware. This talk aims to intruduce some concepts in JVM and some parameter to tune.
 2013.05.10  Big Data management
  Report of ICDE2013 
The 20th IEEE International Conference on Data Engineering(ICDE2013) was held in Brisbane, QLD, Australia, April 8-11, 2013. ICDE2013 received 443 paper submissions for the research track, 20 submissions for the industrial track, and 69 demo proposals. The research program features 95 papers, the industrial program 8 papers, and the demonstration program 27 demos. The conference program also includes 3 keynotes, 9 seminar tutorials and one panel.
  Probabilistic Data Structure for Big Data Part I:Cardinality Estimation 
With the emerging era of big data, many applications can be satisfied by estimation with certain precision. And this can largely reduce the cost of time and space. The report will take cardinality estimation as a typical application and introduce corresponding algorithms which are adapted to big data.

 2012.11.23  BIG DATA & Early Join
  Data Storage for Big Data—SQL,NoSQL or NewSQL? 
Data store in the era of Big Data meet new challenges.What is suitable for big data, SQL,NoSQL or NewSQL? The report gives a simple overview of this problem and introduces the Bigtable and Spanner.
  Early Join: Non-blocking Join Algorithms 
Essential to the success of online aggregation is a good non-blocking join algorithm that enables both high early result rates with statistical guarantees and fast end-to-end query times. The existing non-blocking join algorithms can be categorized into two classes.The first class aims to generate early representative results for OLA.This class includes Ripple Join, Hash Ripple Join, Sort-Merge-Shrink Join, DBO.The second class aims to generate fast early results,while ignoring the statistical properties of the results, including XJoin, Hash Merge Join, RPJ.
 2012.11.09  Topic: Subgraph & Report of CIKM2012
  Report of CIKM2012 
CIKM2012 was held in Maui, USA. There were three keynotes in CIKM of this year and some famous computer scientists were invited to give talks. Many people from all over the world have attended this conference, which points out that CIKM has a significant influence on the field of computer science. Industry sessions are popular and some interesting talks are given.
 2012.11.02  Topic: Join Query & HBase Coprocessor
  Join Query Processing Using MapReduce 
Join query processing is an important basic operation, while joins on massive、complex dataset is a cost operation. The MapReduce paradigm is good at large scale data processing and data intensive computing, but it can not support complex join operation, this flaw limits its spread application in many other fields. In this report we make a simple survey about the existing research works about Join using MapReduce, and give a detailed analysis on the similarity join using MapReduce. We also introduce the primary idea of the high-dimensional data simialrity join using MapReduce, lastly we point out some challenges about join processing using MapReduce.
  An Introduction to HBase Coprocessor 
HBase, as a kind of distributed and scalable big data store system,has added an important functional component, Coprocessor, since version 0.92. HBase Coprocessor allows users to write their own codes without modifying the HBase source code and run them on the server side of HBase, such that users can enhance or shield the original functions of HBase. This report mainly introduces the concept, implementation and some typical applications of HBase Coprocessor.
 2012.05.11  Topic: New Progress in Cloud
  Join Using MapReduce [pptx]
The MapReduce paradigm is good at large scale data processing and data intensive computing, but it can not support complex join operation, this flaw limits its spread application in many other fields. In order to solve this problem, some researchers have done some works. In this report, we mainly make a simple survey about the existing research works about Join using MapReduce, and we introduce "Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT'2012]" and "Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD'2010]" more detailedly.
  Some Data Store and Openstack 
Recently, there have been some open-source data store system.Some oritent to K-V store, some aim to solve scalability of RDBMS. All these data store is designed to store a large amount of data effeciently. This topic is to introduce some this kind of store.

 2012.03.23  Topic: Cloud & RDF
  Scalable RDF Store Based on HBase and MapReduce [pptx]
With development of the RDF dataset , it becomes too scalable to store based on the traditional RDBMS and conventional RDF storage structures can not satisfy the store and the query needs .So it urge to put forward a kind of high efficient storage schema and query processing.
  Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store 
Traditioanlly, the way of storing RDF triples is to store them in single machine. However, as the Big Data emerges, scalability becomes one of the most important features in storing RDF. In this paper, the author introduces Jena-HBase, an efficient and scalable RDF triple store to solve this problem.
 2012.03.11  Topic: Introduction to XLDB2011
  Introduction to XLDB 
A brief introduction to XLDB and focus on XLDB 2011.
  Facebook Data Freeway [pptx]
We introduced the system achitecture of facebook's data freeway used for log anlysis. Facebook uses scribe for log collection and Calligphus is used for label the catergory of the logs and stored them into HDFS,Puma copies log line from storage system with Ptail and do aggreation operation and flush the aggregation results into HBase periodly.

 2011.12.24  Topic: Series Reports on Flying Elephant in the Cloud I
  Update Efficient Indexing of Massive IoT Data in the Cloud 
Because the high update frequency and large scale volume of the IoT data, the traditional DBMS techniques come into troubles with the scalability and can not deal with high insert throughput, so we want to exploit how to management the IoT data efficient in the cloud environment. In this report, we mainly analysed the characteristics of the IoT data, the shortcomings of the existing cloud data management system and corresponding index solutions, and we proposed a new index framework in the cloud environment that can support high insert throughput and efficient multi-dimensional range query.
  Hadoop in SIGMOD 2011 [ppt]
In order to show the state of the art in hadoop,we introduce some papers in sigmod 2011.
 2011.12.17  Topic: Series Reports on Flying Elephant in the Cloud: the Amazing MapReduce World
  Online Aggregation over MapReduce 
With the development of cloud computing, OLA(online aggregation) which is introduced in 1997 has retained interests in nowadays.In this report, we discussed the challenges of implmenting OLA in the cloud, and tried to propose an initial solution.
  Introducion and Application of MapReduce 
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers. Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). Nowaday, more and more application dealing with big data start to use mapreduce to solve problems.

 2011.10.29  Venue: FL1, Meeting Room, Information Building
  Internet of Things and Cloud Computing [ppt]
Since IBM made the concept of "Smarter Planet" in 2008, the Internet of Things(IOT) are getting more and more attention. In general, the basic structure of IOT is divided into three layers: the RFID, sensor networks compose of the perception layer; Internet, Wifi, 3G and other networks form the network layer; In addition, the application for the various social needs construct the application layer. The cloud computing, which is the key technology in the chain of IOT, will be an important cornerstone of the development of IOT.
  Introduction to linux 
Mainly talked about some basic frequently-used commands and software and some skills or experience in using them to do test.

 2011.09.24  Venue: FL1, Meeting Room, Information Building
  Index for Cloud Data Management [ppt]
Cloud Data Management Systems have attracted more and more attentions because of its high scalability, high availability, while up to now, they only provide efficient query on rowkey, and can not support efficient query on non-rowkey and multi-dimensional query. In this report we did a survey about the index techniques about Cloud Data Management and analysed the Pros and Cons of them, finally point the future work.

 2011.05.27  Venue: FL1, Meeting Room, Information Building
  Join Algorithms Using MapReduce [ppt]
MapReduce as a usefull parallel programing framework enables easy development of scalable paralell applications to process vast amounts of data on large clusters of commodity machines, but it can not directly support processing multiple related heterogeneous datasets,such as join query processing.

 2011.04.22  Venue: FL1, Meeting Room, Information Building
  Introduction to Redis,a key-value memory store 
Redis is a key-value memory store.Since it is in memory Redis holds and deals with data, it can reach high performance.Due to the limited capacity and volatility of memory, Redis also support virtual memory management and data persistence.This ppt talks about the data procedure of Redis and a naive idea to improve the virtual memory management.
 2011.04.15  Venue: FL1, Meeting Room, Information Building
  Online Aggregation [ppt]
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and eventually, the final answer is returned. In this paper, the author propose a new online aggregaion interface that permits users to both observe progress of their aggregation queries and control execution on the fly.
 2011.04.02  Venue: FL1, Meeting Room, Information Building
  Estimating the Progress of Queries on the Cloud 
There are many chanllenges of estimating the progress of queries on the cloud, such as task parallelism, variable execution speed, concurrent workloads, task failure, data skew, etc. In this report, we introduce how the existing methods solve the proble, and then we propose our intial idea about progress estimate.
  System Performance Test Report of Cassandra and Hbase 
A series of test cases about cassandra and hbase ,include data extension , multi-client , multi-table ,consistency and so on.

 2011.01.14  Venue: FL1, Meeting Room, Information Building
  Introduction to UDT 
UDT performs much better than tranditional Network protocol like TCP, while in some case when the latency in the network is large tuning some parameters should be done.
  Metadata Management 
In recent years,to meet the need of large-scale data storage,cluster storage has become more and more popular.Then how to provide high access performance with such a huge number of files and such large directories is a big challenge for cluster file systems.Research of metadata management is to solve this problem.This report mainly introduces some existed methods in metadata management research and some possible research directions in TaijiDB

 2010.11.26  Venue: FL1, Meeting Room, Information Building
  Research of query optimization in the cloud 
In cloud data management systems,data is partitioned into blocks and replicated.It is nesscary to translate some data blocks when we do some types of query processing.So we did some research on how to finish the query with little costs.
 2010.11.19  Venue: FL1, Meeting Room, Information Building
  Research on Query Processing 
Query Processing is an difficult problem in both parallel database and cloud-based database. We briefly introduce basic query processing steps in centralized database and parallel database, and talk something about web-scale query processing, including MapReduce debates, MapReduce-based join algorithms, etc. Finally, we introduce main idea of our work and some future work.
 2010.11.06  Venue: FL1, Meeting Room, Information Building
  CIKM2010 Story 
In this talk, I presented some papers and one panel related to Cloud Data Management in CIKM2010. Then I gave some summary of CIKM2010.
  RHP:a new partitioner to improve the efficiency of range query in cassandra 
The conflicting problems of ensuring data-access load balancing and efficiently processing range queries leads to that cassandra can't support range query very well.So how to trade off them is the key point.

 2010.10.30  Venue: FL1, Meeting Room, Information Building
  Survey of Object-based Storage [ppt]
Object-based Storage, a new approach to storage technology, is a subject of academic research and development in the storage industry. This survey describes the main points of object-based storage technology from five aspects. That is why we introduce the concept of object-based storage, what it is, how to take advantage of it, what the status of object-based storage in both industry and academic research is, and what we can do about it.

 2010.09.25  Venue: FL1, Meeting Room, Information Building
  Paper Summary of VLDB2010 
Papers of VLDB2010 about cloud are classified into four aspects: Cloud Data Management Systems, Benchmark, Query Processing and open questions. This report introduces the motivation, key technology and inspiration to our research work.
 2010.09.18  Venue: FL1, Meeting Room, Information Building
  New Experience in MSRA 
Introduce personal life , feelings in MSRA.

 2010.06.05  Venue: FL1, Meeting Room, Information Building
  Index for cloud data management 
This report mainly introduces why we build index on cloud data management、some related work aboutindex for cloud data management and our work progress on index research.
  NoSQL Overview [ppt]
This report simply introduced NoSQL,four reasons why nosql concept was introduced, the history,definition,Three fundamental theories of NoSQL and categories of NoSQL databases.
 2010.05.08  Venue: FL1, Meeting Room, Information Building
  Benchmark results and analysis 
This report introduces the test results of benmarks on cloud-based DBMSs, and does analysis on the restuls.
  Architecture and Design of Distributed Database Systems [ppt]
This report introduces serval kinds of architectures about Distributed Database Systems based on relational data model, it also introduces two horizonal and a verical fragmentatin method and the allocation model for DDBMS.

 2010.04.03  Venue: FL1, Meeting Room, Information Building
  System Environment and MapReduce Framework 
This report includes the introduction of the construction of our cloud data management platform and a brief talk about MapReduce framework.

 2010.03.27  Venue: FL1, Meeting Room, Information Building
  ICDE2010 Keynote - what's new in the cloud [ppt]
This report talks about why we should do cloud computing,how to do and what to do.
 2010.03.06  Venue: FL1, Meeting Room, Information Building
  Open Source Cloud-based DBMS Experiments 
This report introduces existing expriment benchmarks of cloud-based DBMS experiments. We describe the testbed of our experiment, and show the tasks and results.
  System Architecture Design and Implementation of Cloud-based Database System 
The Cloud-based Database project at WAMDM aims at researching new storage and database system which can support the next generation of data storage and management and applied to mobile communications. This report introduced the architecture design and implementation of our cloud-based database system.

 2009.12.29 Invited Talks Venue: FL4, Meeting Room, Information Building
Haixun Wang   (MSRA)

Relational DBMS for Cloud Computing : An Eventual Consistency Approach

(Seminar Series on Mobile plus Cloud Computing III)


 2009.12.24 Invited Talks Venue: FL1, Meeting Room, Information Building
Xing Xie   (MSRA)

Building Intelligence from the Physical World

(Seminar Series on Mobile plus Cloud Computing II)


 2009.12.19  Venue: FL1, Meeting Room, Information Building
  Cassandra and sigmod contest [ppt]
Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.The task of sigmod programing contest 2010 is to implenment a simple distributed query executor built on top of the last year's main-memory index.
 2009.12.12  Venue: FL1, Meeting Room, Information Building
  Survey on Data Management in the Cloud 
With the development of computer and communication technology, a large scale of data are produced. Cloud-based database is one solution to efficiently store and analyze these data. In this talk, we present some cloud-based database and summarize them from different aspects.
  Hive – A Warehousing Solution Over a MapReduce Framework [ppt]
Introduce a system which support managing and querying structured data and builded on the top of hadoop and the query language.
 2009.12.09  Invited Talks Venue: FL1, Meeting Room, Information Building
Zhenkun Yang   (Baidu)

Cloud Database: The Up-coming Era

(Seminar Series on Mobile plus Cloud Computing I)


 2009.10.31  Venue: FL1, Meeting Room, Information Building
  An Efficient Multi-Dimensional Index for Cloud Data Management [ppt]
In this presentation, I introduced our work of multi-dimensional index structure for Cloud Computing platforms.
 2009.04.25  Venue: FL1, Meeting Room, Information Building
  An Indexing Framework for Efficient Retrieval on the Cloud [ppt]
The emergence of the Cloud system has simplified the deployment of large-scale distributed systems for software vendors. The Cloud system provides a simple and unified interface between vendor and user, allowing vendors to focus more on the software itself rather than the underlying framework. Existing Cloud systems seek to improve performance by increasing parallelism. This paper explores an alternative solution, proposing an indexing framework for the Cloud system based on the structured overlay. Its indexing framework reduces the amount of data transferred inside the Cloud and facilitates the deployment of database back-end applications.
  Data Management in the Cloud - Limitations and Opportunities [ppt]
Analysed data management applications that are suitable to move to the cloud platform and discussed remaining challenges of such movement.
 2009.03.14  Venue: FL1, Meeting Room, Information Building
  Research Report on Map/Reduce Framework Based on Hadoop [ppt]
Map/Reduce is the crucial algorithm of Hadoop. It is a easy but powerful algorithm that can solve the problems based on mass data. In this report,I will introduce the concept of Hadoop and Map/Reduce, then the detail of how the Map/Reduce framework do jobs.
  Introduction to HBase [ppt]
As sub-project of Hadooop, HBase focus on providing storage for the Hadoop Distributed Computing Environment. HBase is a table coloum-oriented operating. Its three-layer file system provides the feasible scheme for the distributing data storage while its three-layer architecture solves the problems of region assignment and region location. To get intuitionistic understanding of HBase, comparison with MySQL has been made in the test.
 2008.12.27 Invited Talks Venue: FL1, Meeting Room, Information Building
 (EMC Research China) Cloud based Personal Information Management – Introduction to EMC Cloud Computing [ppt]
With the requirements of automatic online storage and backup, round the clock access and securely sharing andpublishing of personal digital information, it is inevitable that personal information management will migrate into the cloud. The goal of personal information cloud service is to securely access and organize all your information anytime, anywhere, using any device and never lose any of it. EMC is creating a new cloud services business called Decho ('digital echo' referring to the reverberating accesses to information in a user's digital environment)by joining Mozy (cloud backup) and Pi (personal information) together. It will use EMC data centres around the planet to store consumer and business files using Mozy's software front end to provide data ingest and access services and Pi's metadata software to manage and verify personal information. Decho can deliver on the promise of cloud-based personal information management and can help individuals everywhere preserve, manage and enrich the information most important to them.
 (Tsinghua University) Understanding and Comments on Cloud Computing [ppt]
Cloud computing is new concept proposed in recent years. This talk firstly compares cloud computingwith traditional distributed computing and grid computing to help understanding the concept and chacharacteristicsof cloud computing, and then introduces some possible research directions in both cloud computing platform and combining with applications.
 2008.03.29 Invited Talks Venue: FL1, Meeting Room, Information Building
  Introduction to Cloud Computing [ppt]
This presentation introduced Cloud Computing which is looking like a classic disruptive technology. We also show the relationship of Web2.0, Grid Computing and Cloud Computing. At last, we discuss several Cloud Computing cases and the future of Cloud Computing.
WAMDM, Renmin University of China, All Rights Reserved Last Updated : 2013/11/05