WAMDM Seminars (2009-2014)
 2014.12.30  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Whence and Whither streaming -- Introduction to Messaging System 
It became clear that real-time query processing and stream processing is the immediate need in many practical applications.How to ingest large amounts of stream data from many different sources to a stream processing system like Storm, messaging system is playing an important role. In this report, we introduced some popular data ingestion tools including Flume, Scribe, Sqoop, Chukwa, RabbitMQ, Kafka and SpringXD.
 (Web Group) Semantic Hierarchy Learning and Paraphrase Dictionary Building over Knowledge Base 
Semantic hierarchy learning, also named hypernym-hyponym relation detection.We conbine the method of learning semantic hierarchy with the need of our recent word and propose a new method to build the paraphrase dictionary for knowledge base.
 2014.12.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Scaling Behaviors of Weighted Clickstream Network 
The availability of big data,such as those from human online surfing records,makes it possible to probe into and quantify the regular pattern of user long-range, complex interactions between Web sites.We construct a clickstream network, whose nodes were websites and edges were formed by the users’switching between sites.By analyzing the circulation of the collective attention we discover the scaling relationship between the impact of sites and their traffic.
 (Web Group) Entity Linking with a Knowledge Base 
Plenty of data on the Web is in the form of natural language. Bridging Web data with knowledge bases is beneficial for many aaplications. This seminar talks about the key issues, techniques and solutions of entity linking with a knowledge base and gives a few ideas on future work.
 2014.12.16  Theme: Flash Caching
 (CloudGroup) Enhancing the Performance of Database Applications with Flash Caching 
SSD and HDD exhibit different retrieval cost. HDDs are cost effective for infrequently accessed data, and SSDs are well-suited to data that are relatively hot. Using flash memory as the extended cache can reduce the performance gap between DRAM and HDD.We discuss how to use flash accelate the peromance of database systems.
 (Cloud Group) Enterprise flash - the development and applications. 
Enterprise flash device is more durability, and has higher performance than consumer flash. Also, it has good write performance and delay jitter. In this report,I introduced the development and 2 latest products of enterprise flash devices.
 2014.12.09  Theme: Web Data Management
 (Cloud Data Management) VLDB2014 Overview 
This report gives an overview of VLDB2014,which includes three keynotes and the research papers.Moreover,It introduces two papers of this conference.One paper provides some kinds of sort and join algorithms for persistent memory.Another paper considers the storage Management in the NVRAM Era.
 (Web Group) Uniqueness privacy preserving in location data publication 
During location data publication, the uniqueness issue may reveal sensitive information such as personal profile, policy affilication to attackers. In this paper, we investigate the uniqueness issue in location data and propose a solution to preserve uniqueness and thus protect user's sensitive information.
 (Web Group) Plan Bouquets: Query Processing without Selectivity Estimation 
Selectivity estimates for optimizing OLAP queries often differ significantly from those actually encountered during query execution,leading to poor plan choices. The article propose a new approach to address this problem, wherein the compile-time estimation process is completely eschewed for error-prone selectivities. Instead, a small “bouquet” of plans is identified from the set of optimal plans in the query’s selectivity error space, such that at least one among this subset is nearoptimal at each location in the space. Then, at run time, the actual selectivities of the query are incrementally “discovered” through a sequence of partial executions of bouquet plans, eventually identifying the appropriate bouquet plan to execute. The duration and switching of the partial executions is controlled by a graded progression of isocost surfaces projected onto the optimal performance profile.
 2014.12.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) R-Store: A Scalable Distributed System for Supporting Real-time Analytics
It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system,transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions.
 (Cloud Group) Introduction of design principles and applications of Redis. 
Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
 2014.11.29  Theme: Query Understanding
 (Web Group) Query Understanding over Knowledge Base 
As the popularity of knowledge base,how to retrieve it in a more efficient and accurate way has become a hot topic.It faces three main challenges:(1)Ambiguity,(2)Coverage,(3) Scale.This report makes an introduction to both keyword query and natural language query,and makes conclusion and comparison about methods to these two kinds of query.
 2014.11.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Adaptive Stream Partitioning Based on Temporal Positive Correlation 
We proposed a method of adaptive stream partitioning based on temporal prositive correlation. Firstly, we can get the max partition set based on the user's query at compiling time. Secondly, we can merge partition keys by computing the temporal positive correlation at running time. Finally, we introduce the dynamic partitioning based on the density of grids for improving the robustness of the method.
 (CloudGroup) Offloading Data Processing to Smart SSDs
Solid state drive (SSD) has emerged as a new kind of secondary storage medium. The transfer cost is expensive when handle large-scale data sets. Moving code to data is far more efficient than moving data to code. The computing capability of SSDs becomes more powerful.We discuss how to use smart ssd accelate the process of data-intensive applications.
 2014.11.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Research on Data Stream Partitioning Strategies 
Aiming at the distributed processing platform, we need to partition stream data for improving the processing speed of the systems based on the user's query. We inroduced three kinds of partition strategies: query-aware partitioning, TAD-based partitioning, correlation-aware partitioning. Finally, we summarized the pros and cons of the three strategies.
 (Web Group) Technology of Private Information Retrieval 
This talk introduces methods that offer strong location privacy, by integrating private information retrieval functionality.
 2014.11.02  Theme: Big Data mining & SK Query
 (Web Group) A Survey on Big Data mining in Microblog 
First of all,we mainly analys the content and background characteristics of Micro-blog,and then obtains 1H-2S-3M-4V characteristics of Micro-blog data. Secondly,we mainly analys related researches from the social attribute mining and content mining. Finally, we explore challenge of mining Micro-blog data and the new problems from social demands according to the 10 characteristics of micro-blog data.
 (Mobile Group) A survey of Spatial-Keyword(SK) Query 
Geo-textual indices play an important role in spatial keyword querying. The existing geo-textual indices have not been compared systematically under the same experimental framework. This makes it difficult to determine which indexing technique best supports specific functionality.
 2014.10.28  Theme: Web Data Management
 (Web Group) Study of The Long-Range Evolution of Human Online Interests Based-on Small Data 
human online behavior is a complex process,they are often driven by interests.Despite recent efforts in exploring the behavioral targeting and Web user interests mining,little is known about regular pattern of human-interest process.The availability of big data,such as those from human online surfing records,e-commerce and communication records,makes it possible to probe into and quantify the dynamics of human-interest behaviors.These data are called “small data” in the era of big data.In this presentation, we introduce some new thoughts to mining online behavior from these small data.
 (Web Group) Short text understanding 
Short text understanding is a hot yet challeging task. Different with traditional full text, short text is often considered as lack of syntactic features and context, so the traditional approaches including parsing, chunking, entity recognition and disambiguation, are most not applicable in this situation. This seminar gives an introduction to the problems, challeges and main techniques of short text understanding,and also some hot machine learning models.
 2014.10.21  Theme: Latest Exchange Report
 (Cloud Group) Report of WI2014 
The 2014 IEEE/WIC/ACM International Conference on Web Intelligence(WI2014) was held in Warsaw, Poland on August 11-14,2014. WI2014 received 242 paper submissions for the research track. The research program features 85 papers, and the acceptance rate for regular papers is 35.1%. Thr conference program also includes 8 keynotes, 7 tutorials and 4 panels.
 (Web Group) Introduction of 2014 massive data seminar in Hong Kong
To further promote the research cooperation of Chinese mainland and Hong Kong, the National Science Foundation of China(NSFC) and the Chinese University of Hong Kong(CUHK) jointly organized symposium in Hong Kong,September 23-24,2014. The seminar theme is:massive data management. In this presentation, we introduce some new ideas about Big Data in this symposium.
 (Cloud Group) 2014 Academic symposium on Big Data 
Introduction to several seminars in this academic symposium. Focusing on explaining the special report "One-Pass AUC Optimization". Fanally, there is the photo show.
 (Mobile Group) Report on visit Hong Kong Baptist University 
The report is focus on research progress and experience in Hong Kong Baptist University.
 2014.10.18  Theme: Big Data Management: Data Skew and Statistical Inference
 (Cloud Group) Data Skew in MapReduce-based Systems 
Data skew is inevitable in distributed systems. In this talk, i will introduce the definition of data skew, the types of data skew and the basic approaches for solving data skew. I will also summarize papers which are related with this topic in recent years.
 (Cloud Group) Bootstrap: a simulated-based statistical method 
The statistical inference is very difficult.One difficulty is that in practice, outside of certain properties like the mean, it can be extremely difficult to infer characteristics of a distribution. Motivated by these sorts of problems and facilitated by the advent of inexpensive computational power, the last 25 years have seen the widespread acceptance of experimental or simulation-based confidence bounds in statistics.
 2014.05.30  Theme: Storage Management
 (FlashGroup) Impact of SSD on Different Workload Application
Solid state drive (SSD) has emerged as a new kind of secondary storage medium. Many applications has used SSD to completely replace HDD as its main storage.An interesting research question: “what is the impact of SSD on the different workload application ?" In this presentation,we introduce two work:Multi-tenancy OLTP application and Search Engine.We discuss the key factor when designing SSD-based storage system.
 (Flash Group) Implementation Technologies of Storage Management in PostgreSQL
PostgreSQL is an advanced object-relational database system. In this talk, We will briefly introduce the implementation technologies of storage management in PostgreSQL .
 2014.05.22  Theme: Privacy Protection
 (Web Group) Exloiting service similarity to privacy in location-based search queries 
This paper proposes a user-centric location-based service architecture where a user can observe the impact of location inaccuracy on the service accuracy before deciding the geo-coordinates to use in a query. It also constructs a local search application based on this architecture.
 2014.05.16  Theme: Short Text Understanding
 (Web Group) Short Text Understanding 
Natural language processing is always a hot topic, especially the symantics in the text has received much attention. Short text, because of its lack of syntax and context, makes the understanding more challenging. This seminar talks about some recent work on this area, and providing some interesting topics.
 2014.04.11  Theme: Web Data Management
 (Web Group) Utility Centric Sensitive Data Publication via Partition 
Most work in privacy-aware data sharing has considered disclosing summaries where the aggregate information about the data is preserved.We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning.The data should be distributed so that an adversary, without colluding with other adversaries, can not draw additional inferences about the private information.
 (Web Group) HTML5 Head First 
Intrduce the background and some interesting attributes in HTML5
 2014.04.04  Theme: Data management on New Storage
 (FlashGroup) cost-aware data management for Phase Change Memory [ppt]
Storage systems based on Phase Change Memory (PCM) devices are beginning to generate considerable attention in both industry and academic communities. PCM has high potential as a new component for enterprise storage systems in a multi-tiered environment. Our presentation describes the evaluating phase change memory for tnterprise storage systems.
 (Flash Group) An Introduction to OceanBase 
OceanBase is a extendable relational database system. It is designed by Alibaba Group.This report introduces the architecture and design techniques of OceanBase.
 2014.03.28  Theme: Introduction to RDF Storage and Query 
 (Cloud Group) Introduction to RDF Storage and Query 
RDF is a comprehensive describe resources framework to promote the automated processing of network and has been widely use in recent years. Although its triple structure is easy to understand,the study of RDF can derive a lot of problems,such as RDF organization based on relational database,the retrieval algorithm on RDF graph and so on.These are hot issues recently.This report gives a brief introduction to the RDF background knowledge and the organization and query of relational database based,triple based and graph based RDF.
 2014.03.14  Theme: Web Data Management
 (Web Group) User's online Behavior Data Mining 
User's online behavior information plays an important role in personalized web applications.However, it is usually not easy to obtain this kind of personal behavior data. In this presentation, we introduce two algorithms to predict users鈥檇emographic attributes from their online browsing behaviors.
 (Web Group) Truth Discovery on Deep Web 
Information available on the Web is abundant but often inaccurate. Different information sources publish information with different degree of correctness. For a novice user, it is not easy to identify incorrect data. So, how to find the truths of facts is an important issue. Recently there have been a lot of works that evaluate the truthfulness of facts and the sources. Here, we analysis the works about truth discovery on Deep Web.
 2014.03.07  Theme: Big Data Management: Stream Data and Extreme Value Theory
 (Cloud Group) Data Stream Processing Languages 
Recent years, with the rise of the applications of data streams, the data stream processing languages also emerge focusing on different platforms. This report introduces 4 processing languages: Stanford-CQL, IBM-SPL, StreamBase-StreamSQL and DBT-SQL, and contrasts to explain. Finally, expatiates the articherure and challenges of the PQSAL that is developing in our lab.
 (Cloud Group) An Introduction to Extreme Value Theory 
Traditional statistics focus on the majority of data, but in many applications, the long tail of dataset might be much more valuable. The extreme value theory is interested in modelling the tail of dataset and analyzes features of these extreme values. This report gives a simple introduction of extreme value theory.
 2014.01.10  Theme: System research report: Graphlab, spark, Pregel & Hama
 (Web Group) graphlab + spark 
This report mainly introduces graphlab and spark from the following points: system architecture, function module, implementation, etc. Besiders, compared to other similar systems, we give a series of analysis.
 (Web Group) Large-Scale Graph Processing Systems: Pregel and Hama 
With the coming of the big data age, many practical computing problems concern large graphs. The technologies of graph processing has developed for a long time. But with the development of information technology and the explosion of information, the scale of graphs is growing. Efficient processing of large graphs is challenging. In this report, we will introduce tow systems for large-scale graph processing: pregel and hamma.
 2014.01.03  Theme: Protecting uniqueness in human mobility 
 (Web Group) Protecting uniqueness in human mobility 
As coarse datasets providing little anonymity, new protect frameworks need to be design to protect the privacy of individuals.

 2013.12.27  Theme: Data Stream Processing Systems 
 (Cloud Group) Data Stream Processing Systems 
According to the different application requirements of data stream processing system,this seminar introduced four new data stream processing systems that are used widely. We showed the background,architecture, performance and features for each system respectively.Finally, we contrasted and analyzed the high-availability, load-balance and scalability for each system.
 2013.12.21  Theme: Data Management on SSD
 (FlashGroup) cost-aware data management for hybrid storage [ppt]
flash-based hybrid storage is a research hot. There are a lot of cost-effective storage for different data application. in this presentation,we introduce some issues for flash-based extension cache first,and we present a cost-aware data management policy for multilevel cache storage system.
 (Flash Group) Bloom filters for SSD 
Bloom Filters are widely used in many applications including database management systems.Currently, Bloom filters are stored in main memory, but the limited memory available for allocating a Bloom filter may cause a high rate of false positives. This presentation mainly techniques to reduce the memory requirement for Bloom filters with the help of solid state storage devices (SSD).
 2013.12.13  Theme: System research report: MongoDB, VoltDB & CouchDB
 (FlashGroup) MongoDB [ppt]
MongoDB is a NoSQL database system based on document, which is widely used in many web applications. In this presentation , we introduce the development of mongoDB, and describe the cluster setting. We also compare the performance of mongoDB with the other popular database system.
 (Flash Group) VoltDB Introducation [ppt]
VoltDB is a NewSQL relational database that supports SQL access and high performance ability of transaction processing. We will briefly introduce the features and internals of VoltDB in this talk.
 (Flash Group) CouchDB Research Report [ppt]
CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps.This presentation introduce the key features and architecture of CouchDB.
 2013.011.29  Theme: Data management on the Web
 (Web Group) Differential Private Histogram Release 
This paper proposes a clustering-based method, called AHP, to release differentially private histograms. In AHP, sorting by ascending order and high-pass filtering mechanisms on histogram counts are introduced to boost the accuracy of publication. Three incremental utility-based clustering algorithms, which rely on dynamic programming, empirical value clustering, and greedy clusterig stategies, are proposed to partition the sorted histogram counts.
 (Web Group) A Graph Mining Algorithm and Application in Social Computing 
Graph similarity with known node correspondence arises in numerous settings.In this presentation, we introduce some state-of-the-art methods of graph similarity and evaluate when these methods fail to detect crucial connectivity changes in graphs. We introduce DELTACON(SDM2013), a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes. Experiments on various synthetic and real graphs showcase the advantages of the algorithm over existing similarity measures.
 2013.11.22  Theme: Introduction to Bigdata, A Distributed RDF Database 
 (Cloud Group) Introduction to Bigdata, A Distributed RDF Database 
Bigdata is a horizontally scaled distributed RDF database, which can be run on top of the cluster that is composed of hundreds and thousands of commodity machines. It supports standard SPARQL and provides high-concurrent queries for large amounts of data. This presentation will introduce the Bigdata system briefly, mainly including its distributed architecture, indexing and RDF database schema.
 (Cloud Group) Introduction to JVM 2 
Last time, I introduced the JVM memory management and garbage collection. In this talk, I will introduce the concurrency mechanism and corresponding memory control of JVM.
 2013.11.15  Theme: Interactive real-time processing system
 (Cloud Group) Report of CIKM2013 [ppt]
The 22th ACM International Conference on Information and Knowledge Management(CIKM2013) was held in San Francisco, CA, USA,October 27-November 2, 2013. CIKM2013 received 848 paper submissions for the research track. 143 were accepted as full paper(acceptance rate 16.86%), and 106 were accepted as short paper(acceptance rate 29.36% cumulative). The conference program includes 4 keynotes, 9 tutorials ,10 industry talks, one panel and 52 paper sessions.
 (Cloud Group) Interactive real-time processing system 
Real-time processing is a future trend of data processing technology development, it can be diversely achieved, can be accurate and approximated. Different scenarios can choose different implementations.
 2013.11.08  Theme: Age of Big data: graph data and stream data management
 (Web Group) Graph Mining on Graphic Processing Unit(GPU):An Overview 
From Graphics Processing to General Purpose Parallel Computing Driven by the insatiable market demand for realtime,high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel,multithreaded,many core processor with tremendous computational horsepower and very high memory bandwidth.This presentation introduced recent years state-of-the-art work on graph mining with GPU.
 (Cloud Group) A High-Performance SQL Compiler for Delta Processing in Data Stream 
This report introduces a High-Performance Compiler for Delta Processing in Data Stream,that is optimizing the performance of some specific queries through the method of compilation.
 2013.11.01  Theme: Data management on the web
 (Web Group) Handle Big Data With SQL 
Introduce a Big data system and the parallal computing processing language to handle this big data system in Microsoft
 (Web Group) Joint Entity Resolution 
Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. A flexible, modular resolution framework was proposed where existing ER algorithms developed for a given record type can be plugged in and used in concert with other ER algorithms.
 2013.10.25  Theme: Data management on the Cloud
 (Cloud Group) Scale-up or Scale-out for Hadoop: Time to rethink? 
In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server.Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment.But now it's time to rethink it,majority of realworld analytic jobs process relatively less data and can be processed well on a single server, that means sometimes scale-up is better than scale-out
 2013.10.18  Theme: Data management on flash memory
 (FlashGroup) using SSD buffer pool to Enhancing recovery for DBMSs [ppt]
Flash-based Solide State Device(SSD) exhibit excellent access to data compared to hard disks. Today,SSDs are recevived strong intreast in data inteansive application. However, SSD can not completely replace disk by comparing the price and physical characteristics of SSDs to those of hard disks.Integrating SSD and hard disk together is an effective to improve the performance of DBMSs.Using SSD as a extended cache is a hot topic ,the presentation describes how to exploit the recovery for DBMSs using a non-volatile SSD buffer pool.
 2013.10.11  Theme: Stream Data Management
 (Cloud Group) Introduction to stream processing systems 
Recent years, with the rise of big data, more and more applications are developed for focusing on the fast processing and real-time response about the data stream.This report introduces the development process of stream processing systems and surveys of several related important systems.
 (Web Group) On Co-occurrence Pattern Discovery from Spatio-temporal Event Stream 
As the development of mobile positioning techniques, various positioning devices arise. These devices generate a great deal of event streams which contains both temporal information and spatial information. This presentation aim at introducing our research on the method and key challengs of discovering co-occurrence patterns from event stream.
 2013.06.28  Theme: MixSL: An Efficient Transaction Recovery Model for DBMSs Based on New Hardware
 (Flash Group) MixSL: An Efficient Transaction Recovery Model for DBMSs Based on New Hardware 
Transaction Recovery is an important component of Database Systems, for ensuring the atomicity and durability of transaction. In this report, firstly, two traditional recovery models, WAL and shadow paging, are introduced. And then, we introduce the implements of logging and shadow paging technologies in DBMSs based on flash or PCM. According to the characteristics of MLC Flash and PCM, we propose an efficient transaction recovery model for DBMSs based on new hardware, MixSL, which considers buffer management policy, concurrency control level and flash space utilization
 2013.06.21  Theme: Big Data on the cloud
 (Cloud Group) Set Similarity Join Using MapReduce Survey & Concerns 
Set data similarity join is an important task and it is widely used in many applications. In this report we mainly make a survey about the existing research works on the set data similarity joins using MapReduce and analyse their pros and cons, then we propose some new idea to improve the performance. Lastly we point out some challenges about set data similarity join processing using MapReduce.
 (Cloud Group) Spatio-Textual Similarity Join 
Recent years, with the popularity of smartphones and GPS, the quantity of spatio-textual data is increasing rapidly and there are more and more applications based on spatio-texutal similarity join techniques, thus, it has attracted more and more attention on the study of spatio-texutal similarity join technique. This presentation mainly introduce some state of the art work on spatio-textual similarity join.
 (Flash Group) Improving Search Engine Cache Efficiency with SSDs 
Traditional Large-scale search engins use hard disk drives to store the mass indexs,snippets and documents for their capacity, whose performances are limit by the relatively low I/O performance of HDD. Recently, SSD has emerged as a new kind of secondary storage medium, whose random access latency is comparable to its sequential access latency.This report analyze the I/O pattens of search engins and different cache management policies, then introduce a data management policies based on hybrid storage architecture.
 (Web Group) Step Into Microblog 
With the development of Web 2.0, a new type of social media springs up, which is microblog.This report will introduce the newest research trends in microblog data stream at home and abroad.
 2013.05.31  Theme: Data Management on SSD and Web
 (Flash Group) Accelerating Enterprise Applications with SSDs 
Flash-based SSD has outstanding I/O performance. With the capacity increasing and price dropping gradually, more and more enterprises begins to deploy SSDs to accelerate their key application. This persentation will introduce some work on accelerating enterprise application especially cloud computing applications.
 (Web Group) Introduction to Nginx 
Give a simple brief on Nginx and some background knowledge: Http Reverse Proxy Server, FastCGI and common I/O models.
 2013.05.24  Theme: Deep Learning and Privacy on Big Data
 (Web Group) Deep Learning Introduction 
Deep learning is a new machine learning structure for big data, for example, audio and images.
 (Mobile Group) Two Findings Concerning Protecting Consumer Privacy Online 
As an essencial right of consumer, Privacy attracts a great deal of attraction. Beyond technology, legistration is concerned with it. In terms of online advertising, how to make profit with constraints of privacy is an urgent and important topic. Nowadays, many related subjects are emerging, such as computing advertising. In this paper, researchers give economic suggestion for future business pattern. This is concluded by analysis of effects of online advertising with privacy regulation.
 2013.05.17  Theme: Data Management using MapReduce
 (Cloud Group) SAX-Based Similarity Join on High Dimensional Data Using MapReduce 
Similarity join on large scale、high-dimensional data is a big challenging task, the existing methods are always centralized mode based on some kind of index structure, they can not deal with large scale data efficiently. In this report, we introduce the related works on similarity joins using MapReduce and propose a new method: SAX-Based Similarity Join on High Dimensional Data Using MapReduce. lastly we point out some challenges about High Dimensional Data join processing using MapReduce.
 (Cloud Group) Introduction to JVM 
A Java virtual machine is a program which executes certain other programs, namely those containing Java bytecode instructions. JVMs are most often implemented to run on an existing operating system, but can also be implemented to run directly on hardware. This talk aims to intruduce some concepts in JVM and some parameter to tune.
 2013.05.10  Theme: Big Data management
 (Cloud Group) Report of ICDE2013 
The 20th IEEE International Conference on Data Engineering(ICDE2013) was held in Brisbane, QLD, Australia, April 8-11, 2013. ICDE2013 received 443 paper submissions for the research track, 20 submissions for the industrial track, and 69 demo proposals. The research program features 95 papers, the industrial program 8 papers, and the demonstration program 27 demos. The conference program also includes 3 keynotes, 9 seminar tutorials and one panel.
 (Web Group) Report of DASFFA2013 
The 18th International Conference on Database Systems for Advanced Applications (DASFFA2013) was held on 22-25 April 2013 in Wuhan, China. DASFFA2013 received 208 paper submissions for the research track. The research program features 51 papers. The acceptance rate for regular papers is 24.5%. The conference program also includes 2 keynotes, 4 seminar tutorials and one panel.
 (Cloud Group) Probabilistic Data Structure for Big Data Part I:Cardinality Estimation 
With the emerging era of big data, many applications can be satisfied by estimation with certain precision. And this can largely reduce the cost of time and space. The report will take cardinality estimation as a typical application and introduce corresponding algorithms which are adapted to big data.
 2013.04.19  Theme: Privacy protection on Web & mobile data
 (Web Group) Feel Differentially Private Set-Valued Data Release against Incremental Updates 
Publication of the private set-valued data will provide enormous opportunities for counting queries and various data mining tasks. Compared to those previous methods based on partition-based privacy models (e.g., k-anonymity), differential privacy provides strong privacy guarantees against adversaries with arbitrary background knowledge. However, the existing solutions based on differential privacy for data publication are currently limited to static datasets, and do not adequately address today's demand for up-to-date information. In this paper, we address the problem of differentially private set-valued data release on an incremental scenario in which the data need to be transformed are not static. Motivated by this, we propose an efficient algorithm, called IncTDPart, to incrementally generate a series of differentially private releases.
 (Mobile Group) Feel Free to Check-in: Privacy Alert against Hidden Location Inference Attacks in GeoSNs 
In the environment with rich background information, adversaries may recover locations where users might visit but do not have location samples left. In order to address this problem, this thesis proposes a new attack model called trajectory reconstruction attack and a privacy alert mechanism against this. The attack model can infer the probability of users' visit to hidden locations based on history visit information, visit information of uers' friends. Moreover, we design a privacy alert framework implemented in road network space to warn users the most probable leaked locations and the leakage probability.
 2013.04.12  Theme: Data Management on Flash Memory & web
 (FlashGroup) Memory-efficient Data Management Policy for Flash-based Key-Value Store [ppt]
Key-Value (KV) store has superseded traditional relational databases for many applications, such as data deduplication, on-line multi-player gaming, to meet high throughput demand, the performance of index access and KV pair (data) access is critical. Available memory space limits the maximum number of stored KV pairs, keeping the minimum index structure in RAM and storing the rest of the index structure in SSD is an efficient way to for KV store. We introduce two research work which use bloom filter and its varients to realize memory-efficient data management for flash-based key-Value Store.
 (Web Group) Linked Data Extraction,Construction,and Application Based on Web(1) 
The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as YAGO, DBpedia, as well as industrial ones such as Freebase. This presentation presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.
 2013.03.29  Hybrid Storages & Query authentication
 (Flash Group) Cost-effective Hybrid Storages [ppt]
Introduce the methods to improve the performance of hybride storages with small size SSD. To acheive low cost per throughput in these systems.
 (Mobile Group) Query authentication in outsourced databases [ppt]
Query authentication is a very important technology in outsouced databases model. This model consists of three entities: 1)data owner; 2)service provider; 3)query user.In this model, data owner outsouces databases and indexes to external service provider that provides service to query users. Since service provider is untrusted, it will tamper user data and query for its own benefits. Once happened and users are unable to verify the correctness of the results, it will bring serious consequences. Hence, we need to provide a method for users so that they can efficiently verify if the returned results are genuine and complete.
 2013.03.22  Privacy & Patterns discovery on Graph & stream data
 (Mobile Group) Privacy-aware Query Processing in Large Graph 
Distance related privacy on large graph contributes significantly to people's life. For example, everyone hopes navigation without compromising on location privacy. Also, enterprises such as Facebook or Twitter can save a lot of money by cloud computing, if their sensitive data will not leak on these rent servers. Researchers study and solve party of these problems and furthur works may faciliate people's life and be beneficial to enterprises.
 (Mobile Group) Spatio-Temporal Co-occurrence patterns discovery on stream data 
Nowadays,with the development of location-based services and the popularity of various mobile devices ,large amounts of data with spatial and temporal contexts have become available.Minint spatio-temporal co-occurrence patterns from these data has extensive applications.Our work focus on spatio-temporal co-occurrence patterns discovery on stream data ,we study the spatial character of the pattern discovery .With new metrics and novel methods ,we successfully find a new co-occurrence pattern in the stream data and efficiently discover the evolution trend of co-occurrence patterns over time.
 2013.03.15  Data Pulication & Regression under Differential Privacy
 (Web Group) functional mechanism regression 
Regression analysis is one of the critical techniques under differential privacy. However, existing solutions for regression analysis, however, are either limited to non-standard types of regression or unable to produce accurate regression results. Motivated by this, we propose the Functional Mechanism, a differentially private method designed for a large class of optimizationbased analyses. The main idea is to enforce differential privacy by perturbing the objective function of the optimization problem, rather than its results. As case studies, we apply the functional mechanism to address two most widely used regression models, namely, linear regression and logistic regression.
 (Web Group) Differentially Private Sequential Data Publication via Variable-Length N-Grams [ppt]
In this paper, the author develops a variable-length n-gram model, which extracts the essential information of a sequential database in terms of a set of variable-length n-grams. This approach makes use of a carefully designed exploration tree structure and a set of novel techniques based on the Markov assumption in order to lower the magnitude of added noise. The published n-grams are useful for many purposes. Furthermore, the author develops a solution for generating a synthetic database, which enables a wider spectrum of data analysis tasks.
 2013.01.04  human flesh search & Location Privacy
 (Web Group) Introduction of the human flesh search [pdf]
This talk comprehensive introduce empirical study of the human flesh search(HFS).As a Crowd-powered search,HFS is a new form problem solving scheme that involves collaboration among a potentially large number of voluntary Web users.HFS has seen tremendous growth since 2001. It is a valuable test-bed for scientists to validate new theories in complex social network analysis(CSNA).As a Crowd-powered search,HFS is a new form problem solving scheme that involves collaboration among a potentially large number of voluntary Web users.It is a valuable test-bed for scientists to validate new theories in complex social network analysis(CSNA).
 (Mobile Group) Location Privacy in Vehicular Ad Hoc Networks 
Location privacy is a main concern in vehicular Ad Hoc Networks where vehicles have to broadcast traffic routine information frequently. A promising approach to prevent a vehicle from been tracked suggests vehicle to change pseudonyms in regions called mix-zones, where the adversary cannot eavesdrop the vehicular communication. Then,a statistics-based metric for evaluating and locating a mix-zone. Furthermore, a cost-efficient mix-zones deployment scheme is presented to guarantee that vehicles at any place can pass through an effective mix-zone in certain driving time (DT), and the extra overhead time (ET) of adjusting routes to across the mix-zone is small.

 2012.12.28  Differential Privacy for Spatial OLAP & functional mechanism regression analysis
 (Web Group) Differential Privacy for Spatial OLAP 
A lot of low timeliness data has spatial location information. When it is used for OLAP queries these spatial information cannot be handled well. At the same time, the data always contains sensitive information, how to process spatial related OLAP queries in a differentially private way is a good question to be answered.
 (Web Group) functional mechanism regression analysis under differential privacy 
Regression analysis is one of the critical techniques under differential privacy. However, existing solutions for regression analysis, however, are either limited to non-standard types of regression or unable to produce accurate regression results. Motivated by this, we propose the Functional Mechanism, a differentially private method designed for a large class of optimizationbased analyses. The main idea is to enforce differential privacy by perturbing the objective function of the optimization problem, rather than its results. As case studies, we apply the functional mechanism to address two most widely used regression models, namely, linear regression and logistic regression.
 2012.12.23  MySQL storage engine & A Column Database: C-store
 (FlashDB) MySQL storage engine and introduction of the technology 
Other databases use unique standard for all solutions ,which means they will provide you with less performance, or you will adjust your database for hours or days. However,MySQL storage engine can provide different technologies for different solutions so it can be more efficient and flexible. The difference of storage mechanism, index technology,lock and so on determines the variety of storage engine. The presentation firstly introduces the basic conception of storage engine, the different kinds of storage engine and the architecture .Then it explains how to make your own storage engine. Finally it gives us the work I have done on the hybrid storage and storage engine.
 (Flash Group) A Column Database: C-store 
C-store is a column store developed by Stonebraker in 2005. It is write-optimized with a writeable store(WS) and a read-optimized store(RS). All the inserted or updated data is stored in WS first. And at some moment the data in WS is moved to RS by the tuple mover. Moreover, in C-store tables are not stored physically but in projections.
 2012.12.14  Flash-based SSD & Flash-aware Cache Management
 (Flash Group) Scan and Join Optimization by Exploiting Internal Parallelism of Flash-based SSD 
Flash-based SSD has rich internal parallelism, however,traditional scan and join algorithms don't exploit this characteristic. This work proposes a new scan operator called ParaScan and then we design a new parallel HashJoin algorithm to make full use of internal parallism of SSD.
 (FlashGroup) Flash-aware Cache Management for Heterogeneous Storage Systems 
Heterogeneous storage which intergrating solid state drive and hard disk together is a hot topic,Hybtid architecture,flash serve as a read and write cache for hard disk, can make full use of respectively properties. Cache management policies proposed in our presentation guarantee a high cache hit ratio and flash-freiendly write operation.
 2012.12.07  Evented-based co-occurrence pattern & Friend Recommender
 (Mobile Group) Evented-based co-occurrence pattern on hot regions. 
Event-based social network is a new type of social network which contains both online social interactions and offline social interactions.It has many applications ,such as friends recommendation,advertisement delivery, social services and so on. These information contains spatial-temporal pattern which may help us to provide better service and expand other applications.Besides,hot region is a topic which people always care . If we can do some work on these two things . It can surely contribute to society.Therefore we shall do some deep study on it.
 (Mobile Group) Friend Recommender: Privacy-Aware Proximity Service in Mobile Social Networks [ppt]
As the development of mobile devices, mobile social network becomes an important part of our daily life. Proximity service is a popular kind of service which aims at finding other users nearby, such as reminding the user of her nearby friends or finding new potential friends nearby. We present a new kind of proximity service, i.e. Friend Recommender, to recommend nearby potential friends to the user. In order to return more-satisfactory results, we consider the personal profile similarity of users. However, the service provider is untrusted, so it's necessary to protect user privacy, such as her location and profile, while enjoying proximity service. We suggest two privacy protection technologies to protect location and profile privacy independently. Fiend Recommender can be executed on the processed data.

 2012.11.30  Web Interactive Programming & Event Detection and association analysis
 (Web Group) Share on Web Interactive Programming 
Share on knowledge of web interactive programming, including HTTP, cookie and so on. Also make a simple introduction to our exsited lab systems.
 (Web Group) Event Detection and association analysis in the Mcroblogging Data Stream 
With the development of web 2.0, the new-style media have emerged in recent years. What's more, microblog has become one of the most popular social media with its own characteristics. The microblog data has the characteristics of real-time dynamics and content with a wide coverage, which make it possible for event detection and association analysis. However, the characteristics of the microblog data, such as short texts, noise texts, rich social information, real-time dynamics and so on, also bring challenges. This report analyzes the existing work and proposes a novel event-detection and association-analysis algorithm.
 2012.11.23  BIG DATA & Early Join
 (Cloud Group) Data Storage for Big Data—SQL,NoSQL or NewSQL? 
Data store in the era of Big Data meet new challenges.What is suitable for big data, SQL,NoSQL or NewSQL? The report gives a simple overview of this problem and introduces the Bigtable and Spanner.
 (Cloud Group) Early Join: Non-blocking Join Algorithms 
Essential to the success of online aggregation is a good non-blocking join algorithm that enables both high early result rates with statistical guarantees and fast end-to-end query times. The existing non-blocking join algorithms can be categorized into two classes.The first class aims to generate early representative results for OLA.This class includes Ripple Join, Hash Ripple Join, Sort-Merge-Shrink Join, DBO.The second class aims to generate fast early results,while ignoring the statistical properties of the results, including XJoin, Hash Merge Join, RPJ.
 2012.11.09  Topic: Subgraph & Report of CIKM2012
 (XML Group) Query-depended reachability labeling scheme On Subgraph 
Reachability is a fundamental operator on directed graphs. It answers whether a vertex u can reach another vertex v using a simple path. Computing reachability has been studied in a wide range of computer science disciplines, including software engineering, programming languages, and distributed computing.Althrough,there have been lots of reachability labeling schemes, existing works don't consider the locality of queries. In this work, we propose a Query-depended reachability labeling scheme.
 (Cloud Group) Report of CIKM2012 
CIKM2012 was held in Maui, USA. There were three keynotes in CIKM of this year and some famous computer scientists were invited to give talks. Many people from all over the world have attended this conference, which points out that CIKM has a significant influence on the field of computer science. Industry sessions are popular and some interesting talks are given.
 2012.11.02  Topic: Join Query & HBase Coprocessor
 (Cloud Group) Join Query Processing Using MapReduce 
Join query processing is an important basic operation, while joins on massive、complex dataset is a cost operation. The MapReduce paradigm is good at large scale data processing and data intensive computing, but it can not support complex join operation, this flaw limits its spread application in many other fields. In this report we make a simple survey about the existing research works about Join using MapReduce, and give a detailed analysis on the similarity join using MapReduce. We also introduce the primary idea of the high-dimensional data simialrity join using MapReduce, lastly we point out some challenges about join processing using MapReduce.
 (Cloud Group) An Introduction to HBase Coprocessor 
HBase, as a kind of distributed and scalable big data store system,has added an important functional component, Coprocessor, since version 0.92. HBase Coprocessor allows users to write their own codes without modifying the HBase source code and run them on the server side of HBase, such that users can enhance or shield the original functions of HBase. This report mainly introduces the concept, implementation and some typical applications of HBase Coprocessor.

 2012.05.18  Topic: Postgresql & Parallelism of SSD
 (Flash Group) Memory Management in Postgresql 
Postgresql(PG) is a object-relational database management system. It is used widely in industries. Summer is drawing the near. The database that Wamdm will develop is also based on PG. Memory management in PG is complicated. This presentation will cover four points of memory management: MemoryContext, Cache, Buffer pool management and IPC. We will take more time to discuss MemoryContext and Cache.
 (Flash Group) Exploit Internal Parallelism of Flash-based SSD 
With the extensive application of flash-based SSDs in personal computers and enterprise servers, SSDs have attracted more and more attention by the academia and industry. In addition to the excellent characteristics of flash memory, there is wealth internal parallelism in SSDs. Traditional database systems are designed based on properties of hard disk, such as mechanical property and symmetrical read/write property, which can鈥檛 play the advantages of the internal parallelism of SSDs when traditional database systems are built on SSDs. Firstly, we detect internal parallelism of SSDs seemed as a black box. Based on that, we propose a parallel SSD-aware model, to take full advantage of the internal parallelism of SSDs.
 2012.05.11  Topic: New Progress in Cloud
 (Cloud Group) Join Using MapReduce [pptx]
The MapReduce paradigm is good at large scale data processing and data intensive computing, but it can not support complex join operation, this flaw limits its spread application in many other fields. In order to solve this problem, some researchers have done some works. In this report, we mainly make a simple survey about the existing research works about Join using MapReduce, and we introduce "Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT'2012]" and "Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD'2010]" more detailedly.
 (Cloud Group) Some Data Store and Openstack 
Recently, there have been some open-source data store system.Some oritent to K-V store, some aim to solve scalability of RDBMS. All these data store is designed to store a large amount of data effeciently. This topic is to introduce some this kind of store.
 2012.05.04  Topic: OrientX
 (XML Group) Triple code: On Efficient Processing of Multiway Queries for Large XML Data 
Labeling schemes lie at the core of query processing for XML data that is flooding the web.Although a variety of labeling schemes such as prefix-based labeling,interval-based labeling and prime-based labeling as well as their variants have been available to us for encoding static and dynamic trees, these labeling schemes usually show weakness in one aspect or another.In this work, we propose a new triple labeling scheme, which is very simple but efficient.
 (XML Group) An Introduction to C++ Program Linking Process and Related Technology [pptx]
An introduction to C++ program linking process and related technology.
 (XML Group) XML Database Test Platform Intro & Share 
In recent years, in academia and industry to jointly promote the XML database technology has made rapid progress, and the birth of a large number of XML database prototype systems and commercial products, but does not have a comprehensive evaluation benchmark, a benchmarktest platform to measure the functionality and performance of the database, therefore to build a comprehensive XML database benchmark test platform is a realistic demand.

 2012.04.20  Topic: DSFAA Report

DSFAA Report  
DASFAA participants: Jinzeng Zhang, Yingjie Shi, Zheng Huo, Qingling Cao share their experience and report involved sessions separately.

 2012.04.13  Topic: PCM
 (Flash Group) Phase Change Memory Aware Data Management and Application [pptx]
Phase change memory(PCM) is an emerging memory technology which appear some outstanding features of storage and memory. It is highly effective to integrate PCM into the memory/storage hierarchy on data management and application. We discussed two kinds of way to improve the performance of DBMS ,which are using PCM as main memory and auxiliary memory,respectively. Due to the inherent characteristics of phase change memory which includes asymmetry read/write latency and limited write endurance ,these strategies provided PCM-friendly data structures and algorithms to enhance the availability and reliability of PCM.
 (Flash Group) Storage Class Memory: Technology Overview and System Impact [pdf]
Storage Class Memory (SCM) is IBM's term for a new class of data storage and memory devices. SCM enjoy some special features such as solid state, short access time(within an order-of-magnitude of DRAM), low cost per bit(DISK like) and non-volatile(~10 years). SCM blurs the distinction between main memory and storage, hence it brings huge impact on the design of database system. This report gives an overview of SCM technology and an introduction of phase change memory, a typical SCM device. Furthermore, the reconsideration of the database system design based-on SCM is dicussed in this report.
 2012.04.06  Topic: DSFAA Pre-Report

DSFAA Pre-Report  
DASFAA participants: Jinzeng Zhang, Yingjie Shi, Zheng Huo, Qingling Cao share their pre-report separately.

 2012.03.30  Topic: Flash & Architecture
 (Flash Group) Flash Devices Aware RAID 
More and more properties of solid state disk(SSD) have been explored by researchers and industries, such as internal parallelism, but there existed some problems in SSD and applications built on SSD. This report presents the integration of RAID and SSD from three sides: 1, intra-SSD RAID; 2, inter-SSD RAID; 3, inter-SSD&HDD RAID.
 (Flash Group) Flash Memory Aware Software Architectures and Applications 
Flash memory has been widely used in laptop and enterprise applications. In these situations, most system needs to provide high throughput and low latency performance for storage. So flash memory become the best choice as a non-volatile cache between RAM and hard disk. In this slides, we present two kinds of system designs called FlashStore and SkimpyStash.
 2012.03.23  Topic: Cloud & RDF
 (Cloud Group) Scalable RDF Store Based on HBase and MapReduce [pptx]
With development of the RDF dataset , it becomes too scalable to store based on the traditional RDBMS and conventional RDF storage structures can not satisfy the store and the query needs .So it urge to put forward a kind of high efficient storage schema and query processing.
 (Cloud Group) Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store 
Traditioanlly, the way of storing RDF triples is to store them in single machine. However, as the Big Data emerges, scalability becomes one of the most important features in storing RDF. In this paper, the author introduces Jena-HBase, an efficient and scalable RDF triple store to solve this problem.
 2012.03.16  Topic: Introduction to WSDM2012
 (Web Group) An Overview of WSDM2012 
Analyse the current hot research issues based on the accessed papers of WSDM2012, and introduce three papers related with social network.
 (Web Group) An Overview of WSDM2012 II
Introduce two papers from WSDM2012 related with social network.
 2012.03.11  Topic: Introduction to XLDB2011
 (Cloud Group) Introduction to XLDB 
A brief introduction to XLDB and focus on XLDB 2011.
 (Cloud Group) Facebook Data Freeway [pptx]
We introduced the system achitecture of facebook's data freeway used for log anlysis. Facebook uses scribe for log collection and Calligphus is used for label the catergory of the logs and stored them into HDFS,Puma copies log line from storage system with Ptail and do aggreation operation and flush the aggregation results into HBase periodly.
 2012.03.02  Topic: Introduction to Linked Data
 (Web Group) Linked Data - The Story So Far 
We introduced linked data and its research issues in this talk, including the foundation concepts of linked data, guidelines for publishing linked data on the Web and some applications based on linked data. We also presented Linking Open Data Project, which is a grassrot effort to publish open licence data on the web as linked data. We summarized this presentation with some research directions of linked data.
 (Web Group) Introduction to RDF--Resource Description Framework 
RDF is the data format for linked data. RDF is g general data format, and provides a resource description framework. Therefore, it can be used for descripting anything in the world. In this report, we introduce RDF in six aspects, that is RDF'background, what is RDF, RDF's syntax, RDF's schema, RDF's application and query language.

 2012.01.08  Topic: Inside and Outside SSD
 (FlashGroup) Trading Flash Translation Layer For Performance and Lifetime [pptx]
The Flash Translation Layer is a software built on raw flash memory that carries out translation mapping,garbage collection and wear leveling strategies. Address mapping performs the virtual-to-physical address translations and hides the erase-before-write characteristics of flash.Wear leveling methods can enhence wear evenness and improve the lifespan of flash memory.
 (Flash Group) Performance of SSD 
we know some characteristics of SSD from lots of papers,but we do not find them from testing.Therefor we conduct some experiments on SSD. We test on 6 SSD and collect the data:IOps,MBps,and average response time.After analysis,we get some common characteristics of SSD from the test,and we also discover others different and strange results.
 (Web Group) TextDigger: Recovering Themes of Textual Documents 
This report introduces a new method for keyphrase extraction. This method is graph-based and can overcome the vocabulary gap problem.

 2011.12.31  Topic: Primary Exploring of Differential Privacy II
 (Web Group) Graphical Query Optimization of Degree Sequence under Differential Privacy 
Many algorithms on privacy preserving of degree sequence have been proposed in social networks and graph-structured datasets. However, those works all focus on some special attacks and cannot provide rigorous preservation. In this paper, a new problem of protecting degree sequence based on differential privacy is proposed. differential privacy can strongly avoid the disclosure of degree sequence and still answer analysts' query. However, the error of query result is large as well as the utility is low due to noise perturbation associated with real answer. For balancing privacy and utility, an effective and graphical inference technique is proposed. Based on the proposed inferring technique, and efficient algorithm GQODS is presented for this new problem. It has been theoretically proven that the novel inferring technique and the proposed algorithm are correct.
 (Web Group) Data Mining under Differential Privacy 
Differential privacy is new and powerful privacy requirements. If a algorithm satisfies differential privacy, then it can ensure that the adversary cannot get any individual information. I introduced two papers for data mining under differential privacy.
 2011.12.24  Topic: Series Reports on Flying Elephant in the Cloud I
 (Cloud Group) Update Efficient Indexing of Massive IoT Data in the Cloud 
Because the high update frequency and large scale volume of the IoT data, the traditional DBMS techniques come into troubles with the scalability and can not deal with high insert throughput, so we want to exploit how to management the IoT data efficient in the cloud environment. In this report, we mainly analysed the characteristics of the IoT data, the shortcomings of the existing cloud data management system and corresponding index solutions, and we proposed a new index framework in the cloud environment that can support high insert throughput and efficient multi-dimensional range query.
 (Cloud Group) Hadoop in SIGMOD 2011 [ppt]
In order to show the state of the art in hadoop,we introduce some papers in sigmod 2011.
 2011.12.17  Topic: Series Reports on Flying Elephant in the Cloud: the Amazing MapReduce World
 (Cloud Group) Online Aggregation over MapReduce 
With the development of cloud computing, OLA(online aggregation) which is introduced in 1997 has retained interests in nowadays.In this report, we discussed the challenges of implmenting OLA in the cloud, and tried to propose an initial solution.
 (Cloud Group) Introducion and Application of MapReduce 
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers. Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). Nowaday, more and more application dealing with big data start to use mapreduce to solve problems.
 2011.12.10  Topic: Series Reports on Mobile Computing and Social Network: Spears vs. Shields
 (Mobile Group) Location Privacy in Geo-Social Networks 
With the booming of social networks and smartphones, Geo-Social networks have been drawing more and more attention of the public. However, Geo-information presents new challenges for privacy preservation. This report made a close analysis of location privacy in Geo-Social networks and introduced possible solutions.
 (Mobile Group) Feel Free to Check-in: Privacy-preserving against Hidden Location Inference Attack in Geo-Social Networks 
With the development of Geo-social network/mobile social network, location privacy is one of the most concerns for users. We analyze characteristics of Geo-SN and hidden location inference attacks, then we show a basic method of location privacy-preserving against hidden location inference attack.
 2011.12.03  Topic: Series Reports on Mobile Computing and Social Network: When New Meets Old
 (Mobile Group) Privacy-Preserving Spatial Keyword Search over Encrypted Cloud Data 
With the development of cloud computing, many companies outsource their databases to cloud in order to cut down financial and technical cost of data management. Cloud manages those databases and provides services to query users. However, cloud is a potential attacker, so it's important to address the issue of data privacy and query privacy leakage. Our work is to encrypt databases as well as queries in order to protect their privacy, and to design a proper query processing technique so that cloud could correctly process spatial keyword queries withoud decrypt databases and queries.
 (Mobile Group) Virtual towards Reality-Exploration and Analysis of Geo-social Network 
Geo-social network is a type of social networking in which geographic capabilities are used to enable additional social dynamics. It bridges the gap between the virtual and physical worlds. This talk includes three parts. First, we give an introduction to geo-social network. Next, the existing research works is analyzed form the following perspectives: mining and recommendation of location and friends, friends locater and trajectory query. Finally, the changeling works in the next step is presented.

 2011.11.26  Topic: XML Database
 (XML Group) New Version Of OrientX 
Recently lots of IT developers are keen on XML DB domestic and overseas.There are hundreds of companies busy studying on commerial non-structured databases,in the meantime we can see that it is great important to develop our own xml databse.And OrientX is developed by WAMDM,is a representive of...
 (XML Group) Labeling Schemes in XML Databases 
When ID/IDF is considered, XML data must be modeled as a graph not a tree. So, when we process a query in XML database,it is more difficult to juadge the AD relationship between nodes. To deal with this problem, lots of labeling shemes are proposed for XML data. In this presentation, I introduces some labeling shemes for graph-structured XML data.
 (XML Group) XML Database Testing [pptx]
Using about 1000 cases to test the XML databases,by analysing the resulting data we can find the comparative performance of different XML database.
 2011.11.19  Topic: Topic Detection and Tracking(TDT)
 (Web Group) Event Detection in Microblog 
Event is refered to something happened at a specific time and location. Not only do the real-time distrubuted characteristics of posts in microblog provide a guarantee for event detection, but also they bring many challenges. This report introduces the challenges of event detection in microblog,related works and some improved ideas.
 (Web Group) Topic Detection and Tracking - Review and Challenges 
Topic Detection and Tracking research mainly focuses on discovering and threading together topically related materials in streams of data such as newswire and radio transcripts. We introduced tasks and research directions in Topic Detection and Tracking and presented some works on New Event Detection and Topic Tracking tasks. Finally, we proposed unsolved problems and challenges for Topic Detection and Tracking.
 2011.11.12  Topic: the Perfect Match: Log-Structure & SSD
 (Flash Group) Some key-value stores using log-structure [pptx]
The concept of log-structure was first introduced in log-structure file system, which is a file system design first proposed in 1988 by John K. Ousterhout and Fred Douglis. Nowadays, some key-value stores using log-structure, including Riak, RethinkDB and LevelDB, emerge with different log-structure implementations in many industrial applications.
 (Flash Group) Flash and SSD 
Flash, with its excellent characteristics, has been widely used in the mobile and embedded fields. This report mainly describes knowledges about flash memory and SSD(Solid State Disk), including the classification , performance, limitations and trends of flash memory, SSDs' architecture and interface types; In addition, this report also introduces some recent test results on our SSDs.
 (Flash Group) Optimizations of Column-Store and Adaption for SSD 
In column-stores there are usually three main optimizations, namely compression, block iteration and late materialization. But compression play a most important role. It can improve the performance of column-store by an order of magnitude. Concidering these features of column-store, it can get much more improvement on flash. But flash has its unique features, so column storage should make some change to fit for flash.

 2011.10.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Internet of Things and Cloud Computing [ppt]
Since IBM made the concept of "Smarter Planet" in 2008, the Internet of Things(IOT) are getting more and more attention. In general, the basic structure of IOT is divided into three layers: the RFID, sensor networks compose of the perception layer; Internet, Wifi, 3G and other networks form the network layer; In addition, the application for the various social needs construct the application layer. The cloud computing, which is the key technology in the chain of IOT, will be an important cornerstone of the development of IOT.
 (Cloud Group) Introduction to linux 
Mainly talked about some basic frequently-used commands and software and some skills or experience in using them to do test.
 2011.10.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Personalized Privacy Protection in Social Networks [pptx]
Due to the popularity of social networks, many proposals have been proposed to protect the privacy of the networks. All these works assume that the attacks use the same background knowledge. However, in practice, different users have different privacy protect requirements. Thus, assuming the attacks with the same background knowledge does not meet the personalized privacy requirements, meanwhile, it looses the chance to achieve better utility by taking advantage of differences of users' privacy requirements. In this paper, we introduce a framework which provides privacy preserving services based on the user's personal privacy requests.
 2011.10.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) Flash-based Storage System Supporting Range Query 
Because differences between hard disk and SSD, especially performance of random write, SSD adopt out-of-place update but hard disk use in-place update. Flash-based storage model includes PAX, IPL and Append-only. Though PAX have high performance of querying but not considering update operations. IPL and Append-only have high performance of update opreations but not considering quering processing, especially range querying. So we proposed block-page storage management and in-memory B+-tree index.

 2011.09.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Index for Cloud Data Management [ppt]
Cloud Data Management Systems have attracted more and more attentions because of its high scalability, high availability, while up to now, they only provide efficient query on rowkey, and can not support efficient query on non-rowkey and multi-dimensional query. In this report we did a survey about the index techniques about Cloud Data Management and analysed the Pros and Cons of them, finally point the future work.
 (Web Group) An Introduction of Big Data [ppt]
Recently, many enterprises and research domains begain to focus on Big data. This seminar introduces Big data from the view of definition, framework, application, and challenges respectively. Since Big data differs from large-scale data (massive data), new computing models, algorithms and storage strategies must be provided and designed. In this seminar, we mainly present three models for computing Big data, which are random sampling model, data streaming model, and sketching model.

 2011.06.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Privacy Scores Computing and Trust Predicating in Online Social Network 
Recently, privacy and trust problems have attracted a lot of attention in social networks. This seminar mainly introduces two questions, one of which is privacy score computing in terms of individuals' profile, and MLE and EM methods are illustrated in this part. The other one is how to predicate the trust between entities by using balance theory and status theory.
 2011.06.17  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) An Efficient Tag-based Spatial Collaborative Search on Geo-social Networking 
The proliferation of geo-social networking enables users to generate amounts of location information and corresponding descriptive tags, as well as find and connect other users by mobile devices. In this context, users often have similar interests for planning one or more social activities collaboratively.This report introduce a novel type of query,called Tag-based top-k Spatial Collaborative (TkSCo) query.To answer TkSCo query efficiently, we propose two algorithm to slove this problm. Experimental results validate efficiency of the proposed algorithm.
 (Mobile Group) You Can Walk Alone: Trajectory Privacy-Preserving through Stay Points Protection 
Stay points on trajectories contain more sensitive information than ordinary location samples, so we propose a novel method to protect trajectory privacy through stay points protection, which will sharply reduce information loss.
 2011.06.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Research & Demo on Flickr 
Recently,researches based on Flickr are getting more and more.Facebook,twitter,flickr not only offer us perfect platform to use,but also offer help for researchers to study.We can download many informations from flickr with its api,like tag tiltle picture,etc.At the moment researches based on flickr include flickr distance,tourism recommendation,use flickr to predict information or image retrival and so on...
 (XML Group) GILX:A compressed interval labeling for grpah-structured XML 
As far as the ID/IDREF relationship is concerned, XML documents are no longer modeled as trees, but graphs. Many new problems are arising.Reachability queries in graphs are fundamental to XML database.In this report, we introduce a noval compressed intervallabeling scheme to surpport the reachability queries.
 2011.06.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Finding the Bias and Prestige of Nodes in Networks Based on Trust Scores 
This paper proposed an algorithm to compute the bias and prestige of nodes in networks where the edge weight denotes the trust score.

 2011.05.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Join Algorithms Using MapReduce [ppt]
MapReduce as a usefull parallel programing framework enables easy development of scalable paralell applications to process vast amounts of data on large clusters of commodity machines, but it can not directly support processing multiple related heterogeneous datasets,such as join query processing.
 (Mobile Group) Privacy-Preserving Query Processing in Cloud Computing [ppt]
With the development of cloud computing, DaaS in cloud becomes a trend. However, this service leads to privacy leak in both query content and data. Two papers published in ICDE 2011 and DASFAA 2011 give two different frames to preserve privacy in cloud computing. The first frame is based on Privacy Homomorphism, where clients lead query processing so as to protect query privacy and data privacy. The second frame is based on secret share scheme. Before outsourcing, data is divided into n shares by secret share function and stored in n DSPs. In this way, the data privacy is protected.
 2011.05.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Semantic Rules Optimization and Data Cleaning on Knowledge Base 
Human language is very difficult to handle, so when we build a knowledge base, we need to do semantic rules optimization and also the data cleaning.
 (Flash Group) Query Processing and Optimizing on SSDs [ppt]
A survey on query processing and optimizing for SSDs.
 2011.05.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) Considering Transaction on Append-Only Storage 
Recently, Flash-based DBMS is proposed to utilize the advantages of Flash and reduce the random write for Flash, because the performance of random write is very bad. There are three kinds of Flash-based storage strategy, such as PAX, log-based and Append-only. they have many advantages and many shortcomings. Appen-only is firstly proposed to implement in the key-value data management system. If we migrate the append-only storage method into DBMS, there will be many problems, such as Index and transaction and so on. Rollback and recovery are important components for transaction, so we propose improved flash-based rollback and recovery methods to speed up the recovery and rollback.
 (Web Group) Topical Semantics of Twitter Links 
This report introduces a paper about analysising the link semantics in Twitter which was published in WSDM2011. Moreover, I present the experimental results on the Sina data set.
 2011.05.06  Venue: FL1, Meeting Room, Wing Building for Science Complex

DSFAA Report  
DASFAA participants: Teacher Cao, Yulei Wang, Zhichao Liang, Xiaoying Qi share their experience and report involved sessions separately.

 2011.04.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Introduction to Redis,a key-value memory store 
Redis is a key-value memory store.Since it is in memory Redis holds and deals with data, it can reach high performance.Due to the limited capacity and volatility of memory, Redis also support virtual memory management and data persistence.This ppt talks about the data procedure of Redis and a naive idea to improve the virtual memory management.
 (Flash Group) Logging in Flash-based DBMS 
Flash memory, as a new kind of data storage media, is considered as the main storage device instead of disk in the next generation. We analyze the logging design issues in the flash memory based database and put forward some new solutions. The first method, HV-Logging, makes use of the history versions of data which is naturally emerged in flash memory duo to the out-of-place update. In the second method, we proposed a novel logging method called LB-logging which using list structure instead of sequence structure of the traditional databases to store log records.
 2011.04.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Online Aggregation [ppt]
Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and eventually, the final answer is returned. In this paper, the author propose a new online aggregaion interface that permits users to both observe progress of their aggregation queries and control execution on the fly.
 (Mobile Group) A Collaborative Location Privacy-preserving Method without Cloak Region 
Serious location privacy problems arise with extensive application of location-based services. Nowadays, location k-anonymity is the one of the most popular location privacy-preserving method, it requires a trusted third party as an anonymity server which is proved to be the performance bottleneck and aim point of attacks. This lecture introduced a collaborative location privacy-preserving method without anonymity server and cloaking region.
 2011.04.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Special Issue Theme: Interplay between Architecting a Software System And the Hardware Especially Evident in Data Management [ppts]
The theme introduct the hardware such as tape, disk, Flash/SSD,and Storage Class Memory and then analysize the relationship and interplay between DBMSs and these hardware. This theme includes 7 reports. One details road map of magnetic tape, magnetic disk, and a host of solid state technology. Three other papers are about data management on NAND flash. Two papers talk about software consequences of technology beyond flash. One paper Investigates energy efficiency of current SSDs.
 2011.04.02  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Computing Group) Estimating the Progress of Queries on the Cloud 
There are many chanllenges of estimating the progress of queries on the cloud, such as task parallelism, variable execution speed, concurrent workloads, task failure, data skew, etc. In this report, we introduce how the existing methods solve the proble, and then we propose our intial idea about progress estimate.
 (Cloud Group) System Performance Test Report of Cassandra and Hbase 
A series of test cases about cassandra and hbase ,include data extension , multi-client , multi-table ,consistency and so on.

 2011.03.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Mobile Apps Project Report 
More and more handset manufacturer,carrier and ISP launch their own app store with the big success of Apple's App Store. While it is becoming a nightmare for the user to find the desired apps from so many apps. So mobile apps searh and recommendation techniques are deserved to be studied.The author introduced the project from background,motivation,proposed solutions and some works done, and proposed some open questions at the end.
 (Web Group) What is Twitter, a Social Network or a News Media 
Twitter is a application that is more than popular all over the world. So what is Twitter? This report is going to dig some high level characteristics of twitter based on the paper "What is Twitter, a Social Network or a News Media" in WWW2010.
 2011.03.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Topical Authorities Identification and Search in Twitter 
In this report, we introduced 2 papers about topical authorities identification in Twiiter in WSDM 2010 and WSDM 2011. TwitterRank is a graph-based approach to rank twitterers while the other paper in WSDM 2011 using Gaussian Mixture Model Clustering to choose authority candidates. Besides, we reported a detailed comparison between Microblog Search and Web Search.
 (Web Group) Information Cascades on Twitter 
Twitter is a microblogging service and is growing fast. In this report, we focused on the information diffusion on Twitter. We introduced two papers of the WSDM 2011 confercence. In the first paper, the author studied correcting for missing data in information cascades. In the second paper, the author concerned about quantifying influence on twitter. Through the two papers, we learned about some issues of information diffusion on Twitter.

 2011.01.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Introduction to UDT 
UDT performs much better than tranditional Network protocol like TCP, while in some case when the latency in the network is large tuning some parameters should be done.
 (XML Group) XML Database Test Report 
A summary of four xml databases test report.
 (Cloud Group) Metadata Management 
In recent years,to meet the need of large-scale data storage,cluster storage has become more and more popular.Then how to provide high access performance with such a huge number of files and such large directories is a big challenge for cluster file systems.Research of metadata management is to solve this problem.This report mainly introduces some existed methods in metadata management research and some possible research directions in TaijiDB
 2011.01.07  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) XML Keyword Query Refinement 
In this report, we discussed about the problem of query refinement on XML keyword search. Firstly,I have make a classification of existing methods on xml keyword search refiement.Then, we discussed about my newly method on xml keyword search,we transforoming the keyword query to structure query automaitclly.The main part we mentioned is about the task and ways of XML keywords query refinement. We classfied the keywords to structure terms and content term with the xml data, and we can abstracting the relationship graph of these structure terms, which is a weighted digraph. We compute the best and the k best spanning rooted tree of the relationship graph, and take their as the best and top-k refined structure queries.
 (Cloud Group) Internship Report in Nokia Siemens Networks [ppt]
A summary of internship in Nokia Siemens Networks.The main contents of the report is the performance testing of UDT transfer protocol. First, UDT is a massive data transfer protocol oriented to high-speed WAN. Secondly, each part of the test script is described in detail.
 (Mobile Group) Trajectory Privacy Protection for Mobile Users 
Most of the existing works on trajectory privacy protection focused on trajectory k-anonymity, but k-anonymity alone does not put us on the safe side, although one individual is hidden in a group, if the group doesn't have enough diversity of the sensitive attributes then an attacker can still associate one individual to sensitive information. So, we are trying to figure out a way to offer a strong privacy protection for trajectory data.

 2010.12.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Continuous Density Query 
Give a brief introduction to continuous density query. State the results loss problem it has in the previous work. Give an advanced TPR-Tree based approarch to solve the problem. What's more, the new approarch returns all density regions with a higher accuracy.
 (Mobile Group) Research Review and Discussion 
summarize the research process during 4-year PhD candidate.propose some experience.
 2010.12.17  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Opinion Retrieval in Blogs 
Web opinion monitoring is becoming a main focus for opinion monitoring task as a result to the huge amount of user generated contents such as blogs and forums. Opinion retrieval in blogs has been studied for a long time by researchers in Text Retrieval. Web reported the goal, framework and approaches of blog opinion retrieval in recent years' papers briefly.
 (Web Group) Privacy Preserving on the Searchable Internet 
The Internet is the largest repository of information. With the advent of Web2.0, the number of personal informationon on the Internet increased sharply. Malicious attackers may collect a user's information scattered on the Web via search engines, and obtain some privacy-sensitive information. So we have observed a new privacy problem on the Web: Privacy Mining Attack via Search Engines. In this report, we will extend an existing method which was proposed by our graduated student, Jing Ai. We proposed a clustering method on bipartite graphs to resolve this problem.
 2010.12.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) A novel method to extend flash memory lifetime in flash-based DBMS 
As the capacity increases and the price drops gradually, flash memory is becoming the promising replacement of disk, even in the enterprise applications. However, flash memory suffers from erase-before-write and limited write-erase cycles at the same time, which means the abuse of write,especially small and random write, will wear a flash block out quickly. We analyze the free space management in traditional DBMS and point out its disadvantage when used on flash device. In addition, we also propose a new solution involving free space management and buffer management to extend the lifetime of flash memory by reducing the number of write I/O.
 (Flash Group) An Operation Aware Flash Translation Layer for Enterprise-class SSDs 
Flash translation layer is an important firmware in flash-based devices. It is critical to affect the performance of flash-based devices. So when SSDs are used in enterprise-class environment, FTL should be redesigned to improve the whole performance. In this report, we introduce an operation aware flash translation layer for enterprise-class SSDs.
 2010.12.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) A Structured Approach to Query Recommendation With Social Annotation Data [ppt]
Query recommendation has been recognized as an important mean to help users search and also improve the usability of search engines.
 (Web Group) Introduction to OpenScholar 
OpenScholar is a web system to build scholars' homepage automatic. Its features of searching scholars' infomation and dynamic maintenance can help users build their homepages easily and fast.

 2010.11.26  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Research of query optimization in the cloud 
In cloud data management systems,data is partitioned into blocks and replicated.It is nesscary to translate some data blocks when we do some types of query processing.So we did some research on how to finish the query with little costs.
 (Web Group) Record Linkage with Uniqueness Constraints and Erroneous Values [ppt]
This paper presents some challenges of record linkage and data fusion in heterogeneous data sources with uniqueness constraints and erroneous values, models those records by utilizing K-partite graph, and proposes clustering algorithm and matching algorithm to cope with duplicates and conflicting data.
 2010.11.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Evaluating Entity Resolution Results [ppt]
Entity Resolution is an important technique in data integration. Similar to clutering and partition, ER tries to identity the same entity among messes of records. This report focus on an ER results measure,GMD.
 (Cloud Group) Research on Query Processing 
Query Processing is an difficult problem in both parallel database and cloud-based database. We briefly introduce basic query processing steps in centralized database and parallel database, and talk something about web-scale query processing, including MapReduce debates, MapReduce-based join algorithms, etc. Finally, we introduce main idea of our work and some future work.
 2010.11.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) Diversification for Keyword Search on Graph Data 
Keyword search is the de facto information retrieval mechanism for data on theWorld WideWeb. It also proves to be an effective mechanism for querying semi-structured and structured data, because of its user friendly query interface.Recently, query processing over graph-structured data has attracted increasing attention.In this report,we focus on the semantic Diversification of results from keyword search on graph.
 (Flash Group) Enterprise Application of SSD [ppt]
SSD is becoming more and more popular in enterprise.But there is a question,if the platform ready for SSD?This report solved the question.And it also introduced about SSD RAID.
 2010.11.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) CIKM2010 Story 
In this talk, I presented some papers and one panel related to Cloud Data Management in CIKM2010. Then I gave some summary of CIKM2010.
 (Cloud Group) RHP:a new partitioner to improve the efficiency of range query in cassandra 
The conflicting problems of ensuring data-access load balancing and efficiently processing range queries leads to that cassandra can't support range query very well.So how to trade off them is the key point.

 2010.10.30  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Spatial-temporal sequence views query demo [ppt]
We have taken some informations of views on flicker to analyse how to traverse these views from the realistic perspective.If a user wants to traverse the views in a limited time,he may have several solutions,but which one is the most valuable one?Based on our ideas,we give three solutions to slove this problem,and will show you the solutions in our demo.
 (Cloud Group) Survey of Object-based Storage [ppt]
Object-based Storage, a new approach to storage technology, is a subject of academic research and development in the storage industry. This survey describes the main points of object-based storage technology from five aspects. That is why we introduce the concept of object-based storage, what it is, how to take advantage of it, what the status of object-based storage in both industry and academic research is, and what we can do about it.
 (Mobile Group) Android Development tutorial [ppt]
Android, released by Google on Nov. 5th, 2007, is a Linux kernel-based operating system designed for smartphones. In the past three years, Android system has archived a great market share and this share is still increasing. Meanwhile, Android has been attracting more and more developers who have made contributions to more than 100,000 applications in the second largest online app store called Android Market. This tutorial introduces application development on Android platform and the mechanism of Android as well.
 2010.10.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Flash-based Multi-Version Data Storage 
Because of characteristics of Flash Memory and Data storage of PostgreSQL, More update operations and small random write operations run on flash memory. These operations will degrade the performance of DBMS and age of flash memory. Flash-based Multi-Version Data Storage(FMVDS) is proposed to reduce update and write operations and finally reduce erase times. In FMVDS, transaction table item with timestamp and data record with a point to older version data implement high concurrency control and quickly recovery.
 (MSRA) Context-Aware Search 
Introduce the research on context-aware search in MSRA.

 2010.09.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Entity Resolution with Evolving Rules [ppt]
Entity resolution (ER) identifies database records that refer to the same real world entity. In practice,ER is not a one-time process,but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic “evolves” frequently. A naive approach that re-runs ER from scratch may not be tolerable for resolving large datasets. This paper investigates when and how we can instead exploit previous “materialized” ER results to save redundant work with evolved logic. We introduce algorithm properties that facilitate evolution, and we propose efficient rule evolution techniques for two clustering ER models: match-based clustering and distance-based clustering. Using real data sets, we illustrate the cost of materializations and the potential gains over the naive approach.
 (Mobile Group) VLDB paper report 
This report includes two parts.The fisrt is retrieving top-k prestige-based relevant spatial web objects,this method proposes the concept of prestige-based relevance, the top-k spatial web objects is ranked according to both prestige-based relevance and location proximity.The second part introduces how to mine significant sematic location from GPS data,this method models the relationships between locations and the relationships between locations and users with a two-layered graph.Based on this,this paper proposes a new ranking model which assign significance to locations.
 (Web Group) Paper Summary of VLDB2010 
Papers of VLDB2010 about cloud are classified into four aspects: Cloud Data Management Systems, Benchmark, Query Processing and open questions. This report introduces the motivation, key technology and inspiration to our research work.
 2010.09.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Graduate) New Experience in MSRA 
Introduce personal life , feelings in MSRA.
 (Graduate) Introduction to Cloud and Flash Memory Management
Share new findings and thoughts about cloud computing and flash memory management.

 2010.06.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Privacy-preserving of Trajectory Data: A Survey [ppt]
This survey discussed trajectory data privacy preservation techniques in 4 motivating applications. For online trajectory data privacy preservation, service is centric, trade-off is between QoS and privacy preservation; For offline trajectory data privacy preservation, data is centric, trade-off is between data quality and privacy preservation.
 (XML Group) XML Keyword Query Refinement [ppt]
In this report, we discussed about the problem of query refinement in traditional IR and novel XML keyword search. The main part we mentioned is about the task and ways of XML keywords query refinement. In addition, we classified the existing work of XML keywords query refinement, and give out my own work on it.
 2010.06.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Credibility on the Web: A Survey 
This survey discussed credibility on the web from three kinds of entities
 (Web Group) Information Quality and Trustworthiness in Wikipedia 
In this talk we discussed the problem of information quality and trustworthiness of Wikipedia and introduced some research topics. In addition, we gave an brief overview of current research papers about this topic in WWW, WICOW etc.
 2010.06.05  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Group) Index for cloud data management 
This report mainly introduces why we build index on cloud data management、some related work about index for cloud data management and our work progress on index research.
 (Cloud Computing Group) NoSQL Overview [ppt]
This report simply introduced NoSQL,four reasons why nosql concept was introduced, the history, definition,Three fundamental theories of NoSQL and categories of NoSQL databases.

 2010.05.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) Keyword search on Graph 
In this report, I introduce methods that perform keyword search on graph data. Keyword search provides a simple but user-friendly interface to retrieve information from complicated data structures. In this discussion, I focus on three major challenges of keyword search on graphs. First, an answer to a keyword search on graphs,or, what qualifies as an answer to a keyword search. second, what constitutes a good answer, or how to rank the answers;Third, how to perform keyword search efficiently.
 (XML Group) The Integration of TelecommuniCations Networks, Cable TV Networks and The Internet [ppt]
This report introduces the conception The Integration of TelecommuniCations Networks, Cable TV Networks and The Internet firstly.then present its development Process and its advantages. At last,I describe the current situation of Integration of the three kides of networks at abroad.
 2010.05.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Elementary Structure-based Graph Matching 
Past graph matching techniques is vertex-based. Which means they first find candidate set for each node in the query, then perform searching algorithm to find a match. This approach cost too much since there might be too many candidates for each node, and these candidates will form a large search space. To reduce the search space, it is profitable to elevate the granularity of matching algorithm
 (XML Group) Data deduplication 
This report introduces some methods of data deduplication, such as Hash-based algorithms, Delta algorithms.
 2010.05.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Benchmark results and analysis 
This report introduces the test results of benmarks on cloud-based DBMSs, and does analysis on the restuls.
 (Cloud Computing group) Architecture and Design of Distributed Database Systems [ppt]
This report introduces serval kinds of architectures about Distributed Database Systems based on relational data model, it also introduces two horizonal and a verical fragmentatin method and the allocation model for DDBMS.

 2010.04.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
Xuan Zhou (CSIRO, Australia) Integrating User Interfaces of DB and IR Systems 
In contrast to classical databases and IR systems, real-world information systems have to deal increasingly with very vague and diverse data structures. While current object-relational database systems require clear and unified data schemas, IR systems usually ignore the structured information completely. Malleable schemas, as recently introduced, provide a novel way to deal with vagueness,ambiguity and diversity by incorporating imprecise and overlapping definitions of data structures. In this talk, I will introduce a novel query relaxation scheme that enables users to find best matching information by exploiting malleable schemas. Our scheme utilizes duplicates to discover the correlations within a malleable schema, and then uses these correlations to appropriately relax users' queries.Then, it ranks results of the relaxed queries according to their respective probability of satisfying the original query’s intent. Our experiments with real-world data confirmed its performance and practicality.
 2010.04.17  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) Hush-Tell You Something Novel About Flash Memory ! 
This report introduces some work of Non-volatile Systems Laboratory in UCSD in which a lot of tests on flash memory were done. According to the test results, some applications were deviced, including a variation-aware FTL which is called Mango, a flash-aware data encoding and a system architecture for data-centric applications whose name is Gordon.
 (Mobile Group) Existed DBMS on SSD 
By analysis of IOps of HDD and SSD,we can compare IOps of SSD with IOps of HDD. By analysis of tpcc of MySQL and PG on SSD and HDD, we can compare performance of existing DBMS on SSD with that on HDD. Then we propose some ideas
 2010.04.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Web Pages Extraction Technologies in the Opinion Monitoring System 
This report introduces two web pages extraction technologies in our opinion monitoring system, and some popular tools for system development.
 (Mobile Group) An Introduction to Flex [ppt]
Nowadays Flex is very popular in developing Rich Internet Applications. This report introduces what is Flex and its history and also discusses its mechanism, advantages, applications and the differences between other RIA techniques.
 (Web Group) System Environment and MapReduce Framework 
This report includes the introduction of the construction of our cloud data management platform and a brief talk about MapReduce framework.
 (Flash Group) An Introduction to the Source Insight [ppt]
This report introduces a project-oriented program editor and code browser,Source Insight,which parsers your source code and maintains its own database of symbolic information dynamically while you work,and presents useful contextual information to you automatically.

 2010.03.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) IO3:Interval-based Out-of-order Event Processing in Pervasive Computing 
In pervasive computing environments, complex event processing has become increasingly important in modern applications. A key aspect of complex event processing is to extract patterns from event streams to make informed decisions in real-time. However, network latencies and machine failures may cause events to arrive out-of-order. In addition, existing literatures assume that events do not have any duration, but events in many real world application have durations, and the relationships among these events are often complex. In this work, we first analyze the preliminaries of time semantics and propose a model of it. A hybrid solution including time-interval to solve out-of-order events is also introduced, which can switch from one level of output correctness to another based on real time. The experimental study demonstrates the effectiveness of our approach.
 (Cloud Group) ICDE2010 Keynote - what's new in the cloud [ppt]
This report talks about why we should do cloud computing,how to do and what to do.
 (Web Group) Survey of ICDE2010 and SIGMOD2010 
Based on the accepted papers, this presentation made a survey on recent international database conferences ICDE2010 and SIGMOD2010, and analyzed the research focuses of database area.
 2010.03.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) RWConvertor: Random Write Optimization for SSD 
With the development of electronic technologies, Solid State Drive (SSD) emerge as new data storage media with low power consumption, high shock resistance and lightweight form. Besides these, the most attractive characteristic is the high random read speed because of no mechanical latency. Therefore SSD have been widely used in laptops, desktops, and data servers in place of hard disk during the past few years. However, poor random write performance becomes the bottle neck in wider applications. Random write is almost two orders of magnitude slower than both random read and sequential access, so write-intensive applications have very low performance on SSD. In this paper, the first time we propose to insert unmodified data into random write sequence in order to convert random writes into sequential writes, and then data sequence can be flushed at the speed of sequential write. Further, we improve the write performance by Optimum Converted Write Sequence (OCWS). Strict mathematical proof decides the location and number of inserted data items during the course of getting OCWS. We also optimized our method with throughput, which is decided by gain and granularity, of OCWS when applied in data stream.
 2010.03.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) Approaches to internet of things 
As the next generation of information technology,the internet of things has drawn public attenention.It enables the internet to reach out into the real world of physical objects.This report first gives the concept of the internet of things,then introduces the system architecture and key techniques and gives three applications.Fianlly,I put forward to the furture direction.
 (Mobile Group) Related Work about Internet of Things [ppt]
This report gives an overview of the related and future work about Internet of Things and focus on the The RFID Ecosystem Experience handled by University of Washington.
 2010.03.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Open Source Cloud-based DBMS Experiments 
This report introduces existing expriment benchmarks of cloud-based DBMS experiments. We describe the testbed of our experiment, and show the tasks and results.
 (Web Group) System Architecture Design and Implementation of Cloud-based Database System 
The Cloud-based Database project at WAMDM aims at researching new storage and database system which can support the next generation of data storage and management and applied to mobile communications. This report introduced the architecture design and implementation of our cloud-based database system.

 2010.01.09  Venue: FL1, Meeting Room, Wing Building for Science Complex
(Invited Talk)
Time series and Interactive media 
Time series and interractive media have large applications in computer games or so. One of the most important problem for pattern detection in streaming time series could be how to define a effective distance metric.We propose a novel warping distance and efficient approach for continuous pattern detection. For the interavtive media database, it focus on the index,storage structure for smart media objects, similarity metrics and query procesing on multimedia data.
 (Flash Group) FTL Algorithms and Native Flash Experiments 
This report introduces five flash translation layer algorithms, such as BAST, FAST, LAST, and DFTL etc. We mainly describe the main ideas of those algorithms and their realization. Then we introduce the native flash experiments.

 2009.12.26  Venue: FL1, Meeting Room, Wing Building for Science Complex
Dr. Rui Zhang 
(Invited Talk)
Continuous Intersection Joins Over Moving Objects 
The continuous intersection join query is computationally expensive yet important for various applications on moving objects. No previous study has specifically addressed this query type. We can adopt a naive algorithm or extend an existing technique (TP-Join) to process the query. However, they compute the answer for either too long or too short a time interval, which results in either a very large computation cost per object update or too frequent answer updates, respectively. This motivates us to optimize the query processing in the time dimension. In this study, we achieve this optimization by introducing the new concept of time-constrained (TC) processing. Further, TC processing enables a set of effective improvement techniques on traditional intersection join algorithms. With a thorough experimental study, we show that our algorithm outperforms the best adapted existing solution by several orders of magnitude.
Dr. Jinchuan Chen 
(Invited Talk)
Uncertain Data Management 
Dr. Jinchuan Chen gave a brief introduction to research frontier in uncertain data management and some typical method to handle data uncertainty. He also proposed some research topics in uncertain data management.
 2009.12.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Computing Group) cassandra and sigmod contest [ppt]
Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.The task of sigmod programing contest 2010 is to implenment a simple distributed query executor built on top of the last year's main-memory index.
 (Mobile Group) Hammer & Nail 
"Research is actually a process of hammers(methods) hammer nails(problem)". This report first presents three hammers, i.e.three kinds of hash functions, which are signature, OPMPHF(Order Preserving Minimal Perfect Hash Function) and LSH(Location Sensitive Hashing).Then it introduces a nail using the hammers above.It is called Reveser k Spatial and Textual Nearest Neighbor(RkSTNN).
 2009.12.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Survey on Data Management in the Cloud 
With the development of computer and communication technology, a large scale of data are produced. Cloud-based database is one solution to efficiently store and analyze these data. In this talk, we present some cloud-based database and summarize them from different aspects.
 (Cloud Computing Group) Hive – A Warehousing Solution Over a MapReduce Framework [ppt]
Introduce a system which support managing and querying structured data and builded on the top of hadoop and the query language.
 2009.12.05  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Trust Metric on Social Network 
This report introduces five trust metric mechanisms on social network, such as
 (Web Group) Data Fusion-Resolve Data Conflicts in Integration 
In this talk we gave a brief introdution to data fusion, including data conflict types, conflict resolution strategies, the role played by data fusion in integration programs and current approaches to data fusion. Then we addressed some challenges and open problems in data fusion research. Finally we presented a brief summary to this talk.

 2009.11.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (mobile Group) ACR: an Adaptive Cost-Aware Buffer Replacement Algorithm for Flash Storage Devices 
In this talk, we propose an adaptive cost-aware buffer replacement algorithm--ACR, which adapt to various access patterns on flash disks.
 (Mobile Group) Multi-version Concurrency Control of Database Based on Flash Memory 
Data may have multiple versions as because of the feature of not-in-place update and in-page logging store mechnism in flash memory. Multi-version concurrency control has to be implented based on the Serialization theory, and it includes MV2PL(multi-version 2PL), MVTO(multi-version TO), MVSGT(multi-version SGT), TW(time warp) and ROMV(read-only multi-version). We evaluated the performance of these algorithms by implementing experiments on existing DBMS such as MS SQLServer, MySQL and Postgres. Finally, we proposed some future work in Multiple-version Concurrency Control.
 2009.11.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) Efficient String Similarity Search Using Synonyms 
This report introduces the gram_based string matching functions and the new similarity funcion.
 (XML Group) Reachability Queries on Large Directed Acyclic Graphs 
In particular, graph reachability has attracted a lot of research attention as reachability queries are not only common on graph databases, but they also serve as fundamental operations for many other graph queries. In this reprot, I introduce my new graph label to speed up the processing of reachablity queries on DAG,which index is small and which can be constructed easily。
 (XML Group) Information Retrieval Model and Relevance Feedback 
This report first introduces four classic information retrieval models. Based on those models, we present two methods of improving retrieval results
 2009.11.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Review our studies on dataspace 
Reviewed our works on dataspace research, and introduced a work we are doing.
 (Web Group) Dataspace Research Report 
Introduced research and system implementation progress on Dataspace research.
 (Web Group) Leveraging Feature Context to Facilitate Sub-graph Query in Graph Database 
Previous techniques focus on feature selection strategy to filter false graphs as more as possible. This approach has met a bottleneck, that as the feature is becoming more and more complicated, precision is still low. Thus we propose to investigate into how feature context could help improve pruning power in sub-graph query.
 2009.11.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) About CIKM2009 Story 
Give a short summary on CIKM 2009 based on my impression on this confference, esspecially introduece the three keynotes.
 (Flash Group) Review of CIKM 2009 [ppt]
CIKM is a high level international conference. There are three tracks
 (Web Group) Summary of CIKM2009 
In this talk, I presented three papers and one tutorial related to Web data management and click log mining in CIKM2009. Then give some summary of CIKM2009.
 (Web Group) IR is Interesting-CIKM 2009 Report 
In this presentation, I gave a brief summary and introduction to the CIKM 2009 conference and some of my own experience on this conference.

 2009.10.31  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) An Efficient Multi-Dimensional Index for Cloud Data Management [ppt]
In this presentation, I introduced our work of multi-dimensional index structure for Cloud Computing platforms.
 (Web Group) Supporting Context-based Query in Personal DataSpace [poster]
Many users need to refer to content in existing files (pictures,tables, emails, web pages and etc.) when they write documents(programs, presentations, proposals and etc.), and often need to revisit these referenced files for review, revision or reconfirmation. In this paper, we propose an efficient solution for this problem. We firstly define a new personal data relationship
 (Flash Group) Pre-Report for CIKM 2009 [poster]
Solid State Drive (SSD), emerging as new data storage media with high random read speed, has been widely used in laptops, desktops, and data servers to replace hard disk during the past few years. However, poor random write performance becomes the bottle neck in practice. In this paper, we propose to insert unmodified data into random write sequence in order to convert random writes into sequential writes, and thus data sequence can be flushed at the speed of sequential write.
 2009.10.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web&Mobile Group) Overview of Talks in NDBC 2009 
Dr. Xiangye Xiao gave a brief review of invited talks in NDBC 2009 which includes Dr. Xin Dong from AT&T, Prof. Weiyi Meng from Binghamton Univ., Haixun Wang from MSRA and Lei Chen from HKUST.
 (Web Group) Report on SKG2009 
Give an introduction on SKG2009, and focusing on introducing the two keynotes of this conference.
 (Mobile Group) A new topic: queries with geo-information [ppt]
Discovering users' specific and implicit geographic intention in web search can greatly help satisfy users' information needs. Research on queries with geo-information has becoming hot these years. There are several methods. First, the training data based methods, these methods need big data of query logs; another is spatial and texual information retrieval methods, but these methods can only deal with local geo-informaiton. The challege is how to discover users' implicit geo-information in queries.
 (Web Group) trajectory pattern mining 
The pervasiveness of mobile devices and location based services is leading to an incresing volume of mobility data.This side effect provides the opportunity to analyse the behaviors of movements.With this background,trajectory pattern mining has been a popular topic.This report mainly introduces some representative work about this topic and points out some defects.
 2009.10.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) C-Rank -- A Credibility Evaluation Method for Deep Web Records 
How to identify and evaluate information credibility ranking has become an increasing important problem. To address the issue, an effective credibility evaluation method called C-Rank to compute trust values of records in Deep Web databases is proposed, which constructs an S-R Credibility Graph for each record.
 (Mobile Group) Privacy Preserving towards Continuous Query in Location-based Services 
With advances in wireless communication and mobile positioning technologies, location-based mobile services have been gaining increasingly popularity in recent years. Privacy preservation, including location privacy and query privacy, has recently received considerable attention for location-based mobile services. A lot of location cloaking approaches have been proposed for protecting the location privacy of mobile users. However, they mostly focus on anonymizing snapshot queries based on proximity of locations at query issued time. Therefore, most of them are ill-suited for continuous queries. In view of the privacy disclosure (including location and query privacy) and poor quality of service under continuous query anonymization, a δp-privacy model and a δq-distortion model is proposed to balance the tradeoff between privacy preserving and quality of service. Meanwhile a temporal distortion model is proposed to measure location information loss during a time interval, and it is mapped to a temporal similar distance between two queries. Finally, a greedy cloaking algorithm (GCA) is proposed, which is applicable for both anonymizing snapshot queries and continuous queries. Average cloaking success rate, cloaking time, processing time and anonymization cost for successful requests is evaluated with increasing privacy level (k). Experimental results validate the efficiency and effectiveness of the proposed algorithm.
 (XML Group) Algebra-based Transform query optimization strategy 
XQuery/Update defines a special Transform query, which is similar to be hypothetical query in relation databases, and can be expressed as“Q when {U}”. In other words, the results of query Q are the same as the results after executing hypothetical update {U} on the original database, without actually updating database. The Transform queries need to copy the nodes in XML database and then update copied nodes, so it doesn’t affects the database. But Transform queries will usually copy and update a lot of nodes which are useless for query Q and result in high cost. It is critical for query optimization to decrease the number of copied nodes and the update operation. In this paper, we propose a set of rules for Transform query optimization techniques based on OrientXA. Which are implemented in OrientX3.0.
 (Mobile Group) HF-Tree--An Update-Efficient Index for Flash Memory 
Due to the expensive write cost of flash memory, traditional disk-based indexes have a poor update performance when directly applied to flash drives. In this talk, Da Zhou proposed a novel index called HF tree to improve the update performance of Flash memory, which integrates BF -tree with Tri-hash.
 (Mobile Group) Sub-Join--A Query Optimization Algorithm for Flash-based Database 
Compared with Hard Drive Disk (HDD), SSD has a lot of advantages, such as high random read performance, low power consumption and lightweight form. Therefore it is envisioned to be next generation data storage instead of HDD. However, the enhancement of query performance for flash-based database is not the same as the IO ratio of SSD to HDD. The reason is existing databases which are designed for HDD can not take full advantage of high IO performance of SSD. In this paper, a new join algorithm, Sub-Join, is proposed. Sub-Join first projects the column of join and primary key as Sub-Table, and then executes join operations on Sub-Tables. Finally results are gotten from original table according to the result of join on Sub-Tables. The compared experiments with Oracle Berkeley DB show Sub-Join outperforms original indexed nested-loop join at the ratio of about 40%~100%. The result strongly shows the high efficiency of this method.

 2009.09.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (AT&T Research) Data Integration with Uncertainty 
Dr. Xin (Luna) Dong from Data Management Department at AT&T Research visited Web And Mobile Data Management (WAMDM) lab and gave an invited talk about Data Integration with Uncertainty. Her talk mainly focused on some important and valuable topics in uncertain data integration.
 2009.09.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web&Mobile Group) Efficient Co-Location Pattern Discovery 
Dr. Xiangye Xiao gave a brief talk about her research topics when she was a PHD candidate in the Hong Kong University of Science and Technology. Her talk included efficient co-location pattern discovery and Web browsing on mobile devices. Besides, Dr. Xiangye Xiao proposed some ideas about future research.
 (XML Group) Keyword Search Techniques in Mobile Web 
Dr. Jiaheng Lu received an a funding award about "keyword search in mobile web" from National Science Foundation China (NSFC). He gave a detailed demonstration about the project and proposed some possible topics.

 2009.07.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) OrientX4.0 - Supproting Keyword Search 
With the developing of xml technology, more and more pepole using xml data. In traditional, we use the standard query lanaguage XQuery to find the data we need, but we need to learn the "XQuery" and we must know the structure and content of the xml document. It is great challenge of naive users. For this popose, in the new edition-OrientX4.0, we supporting the xml keyword-search , which can solve the problem we meet by using XQuery and make pepole using xml more easier.
 (XML Group) OrientX4.0 System Development Report [ppt]
the implement of XML keyword search
 2009.07.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Probabilistic kNN Query in Road Network 
Queries for moving objects in road network, especially kNN(k Nearest Neighbor) queries are very important and have received considerable attention. This speech discusses how to model the uncertainty data and process kNN queries in road network.
 (Mobile Group) Report on Privacy Protection Demo Appplication Development 
In order to apply the current privacy protection algorithms and integrate them in the 863 Pervasive Computing project, we decided to develop a demonstration application.This report introduced the technical and functional characteristics of the application as well as the development plan.
 (Mobile Group) Query Processing over Interval-based Out-of-order Event Streams 
Complex event processing has become increasingly important in modern applications, ranging from supply chain management for RFID tracking to real-time intrusion detection. A key aspect of complex event processing is to extract patterns from event streams to make informed decisions in real-time. However, network latencies and machine failures may cause events to arrive out-of-order at the event processing engine. In addition, existing temporal pattern mining assumes that events do not have any duration. However, events in many real world applications have durations, and the relationships among these events are often complex. In this work, we propose solution to process both sequence and parallel pattern queries on out-of-order event streams. First, we analyze the preliminaries and the problems caused by out-of-order data arrival. We then propose a method to detect out-of-order event patterns. A new solution including time-interval to solve out-of-order problems is also introduced. Lastly, we conduct an experimental study demonstrating the effectiveness of our approach.
 2009.07.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Flash Group) System Development Report of Flash Group [ppt]
Our target is to develop a special flash-based DBMS,and we decide to do some changes on an existing open source DBMS to work it out. However, as a matter of fact,there are lots of open source systems. Which one is the best choice? After a detailed analysis, we believe MySQL,which contains the Berkeley DB as one of its storage engines,is the answer to our problem.
 2009.07.04  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) SIGMOD2009 Overview [ppt]
Analyze the current hot research issues based on the accessed papers of SIGMOD2009, and introduce two papers of this conference.
 (Mobile Group) Flash Research Report [ppt]
Flash-based database systems research becomes more and more hot. In sigmod2009 and VLDB2009, we are glad to see that there are some papers about the indexing, query processing and transaction processing. This report gives a coarse overview to the motivations and ideas of these papers.
 (XML Group) XML Labeling and Query Optimization in Sigmod09 [ppt]
Optimization of complex XQueries combining many XPath steps and joins is currently hindered by the absence of good cardinality estimation and cost models for XQuery.Labeling schemes lie at the core of query processing for many XML database management systems. Designing labeling schemes for dynamic XML documents is an important problem that has received a lot of research attention. This presention introduce a new labeling scheme DDE and a new Runtime Optimization approach ROX in sigmod09.

 2009.06.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Logging in Flash-based Database Systems [ppt]
Synchronous transactional logging is the central mechanism for ensuring data persistency and recoverability in database systems. In this report,we discussed the solutions about exploiting different kinds of flash drives for synchronous logging and the recovery processing technologies related with them.
 (Web Group) Location-based Database Selection 
Location_based database selection is a new topic,This report mainly gives an introducton about this topic,including why we choose this topic,what the problem is,some related work and how to solve the problem.
 (Web Group) Snippet of Structured Data 
It is expected that more and more people will search the web when they are on the move. But there are many limitations when we browsing the web page in mobile devices, especially small screen. A record in database usually contain lots of information, which is not useful for user and is so much for small screen. So we try to extract the most useful attributes to return to user.
 2009.06.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) XML Keyword-Search engine 
XML has already became the de-facto of data exchange. So, how to query XML data is becoming very important. We can use the query language XQuery and XPath, which is the standard query language of XML recommended of W3C, to get what we need. But the user must be familiar with the query languages, and know the content and structure of XML data at first, so that the users can write the accurate query. It is not easy for most users, and it forcing the study of XML keyword-search, With it, we needn't learn the XML query language, and also, we needn't known the content and structure of XML. It make the query easier. The main features of next edition of OrientX(edition 4.0) is to supprot the keyword-search, in the presentation, qingsong guo analized the existing XML keyword-search engine and made a comparison and get their features in common . And based it, we defined the main features of OrientX 4.0 . Wei wang analized the key technologies of xml keyword-search, such as the priciple and algorithms of computing SLCA, the ranking of query results.
 2009.06.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML) Query Processing over Graph-structured XML Data 
When XML documents are modeled as graphs, many research issues arise. In particular, there are many new challenges in query processing on graph-structured XML documents because traditional query processing techniques for tree-structured XML documents cannot be directly applied.
 (Mobile Group) MVCC on Flash Memory [ppt]
First, Flash has the characteristic of Out-of-Place Updating, which lead to multiple version of data on Flash. Second, I introduce the basic priciple and some protocols of MVCC, such as MVSR, MVCR, MVTO, MV2PL and so on. Finally, I present some information of transaction in BDB and PG.
 2009.06.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Location,Location, Location 
This talk focuses on the dicussion of Keynote of Christian S. Jensen on MDM2009.
 (Web Group) C-Query: Context-based Query in Personal DataSpace 
Many users need to refer to content in existing files (pictures,tables, emails, web pages and etc.) when they write documents(programs, presentations, proposals and etc.), and often need to revisit the referenced files for review, revision or reconfirmation. In this paper, we propose an efficient method for users to revisit these refferenced files by identifying a context-based refference relationship.

 2009.05.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) OrientX system development report [ppt]
The main features of OrientX3.5 version and its implementation.
 2009.05.16  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Random Write Optimization for SSD 
Random write of SSD has low IO performance when compared with sequential/random read and write. This paper propose a novel method to avoid the low performance of random write.
 (Mobile Group) buffer management policy [ppt]
In this talk, I introduced several interesting buffer management algorithms, including some algorithms which work well on disk-based DBMS, others are buffer management algorithms on flash-based DBMS.

 2009.04.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) An Indexing Framework for Efficient Retrieval on the Cloud [ppt]
The emergence of the Cloud system has simplified the deployment of large-scale distributed systems for software vendors. The Cloud system provides a simple and unified interface between vendor and user, allowing vendors to focus more on the software itself rather than the underlying framework. Existing Cloud systems seek to improve performance by increasing parallelism. This paper explores an alternative solution, proposing an indexing framework for the Cloud system based on the structured overlay. Its indexing framework reduces the amount of data transferred inside the Cloud and facilitates the deployment of database back-end applications.
 (Web Group) Data Management in the Cloud - Limitations and Opportunities [ppt]
Analysed data management applications that are suitable to move to the cloud platform and discussed remaining challenges of such movement.
 2009.04.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) MCN: A New Semantics Towards Effective XML Keyword Search [ppt]
In this talk, We propose a new XML Keyword Search Semantics aiming at capturing meaningful results while avoiding returning meaningless results. This contribution is based on the observation that when talking about relationship between data elements, users query intension is always based on the relationship of real word entities.
 (Web Group) Selectivity Estimation for Exclusive Query Translatio in Deep Web Data Integration [ppt]
In Deep Web data integration, some Web database interfaces express exclusive predicate,which permits only one predicate to be selected at a time. Accurately and efficiently estimating the selectivity of each Qe is of critical importance to optimal query translation. In this paper, we mainly focus on the selectivity estimation on infinite-value attribute which is more difficult than that on key attribute and categorical attribute. We start with two observations
 2009.04.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Summary of ICDE2009 keynotes [ppt]
This slides give a summary on three keynotes of ICDE2009.
 (mobile Group) ICDE 2009 Introduction 
ICDE is a very important international meeting about data management. In this conference, there are a lot of works related to flash-based database. transaction becomes an important topic in this field.
 (Flash Group) Demo in ICDE 2009 Conference [ppt]
WEST(Web Entity Search Technologies),instead of returning webpages that are related to any people who happened to have the queried name,is to output a set of clusters of webpages,one cluster per each distinct person.Fa is a new system for automated diagnosis of system failures that is designed to address the SLO violations.UQLIPS is a Web-based integrated platform which performs online detection of near-duplicate occurrences over continuous video streams,as well as retrieval of near-duplicate clips from segmented video collections.
 2009.04.04  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Mobile Group) Distortion-based Anonymity towards Continuous Query in Mobile Services 
Privacy preservation has recently received considerable attention for location-based mobile services. A lot of location cloaking approaches have been proposed for protecting the location privacy of mobile users. In this paper, we present continuous query privacy disclosed and worst QoS resulting from anonymizing continuous query.
 (Mobile Group) Complex Event Detection in Pervasive Computing 
In pervasive computing environments, wide deployment of sensor devices has generated an unprecedented volume of atomic events. However, most applications such as healthcare, surveillance and facility management, as well as environmental monitoring require such events to be filtered and correlated for complex event detection. Therefore how to extract interesting, useful and complex events from low-level atomic events is becoming more and more important in daily life. Due to the increasing importance of complex event detection, this paper proposes a framework of Complex Event Detection and Operation (CEDO) in pervasive computing. It gives an event model and extends current detection by incorporating temporal and spatial settings of events and different levels of granularity for event representation. We first show research issues, related works, and main research problems in this area. Then our current research works and the preliminary results are introduced. Finally, the research plan of my PhD project is presented for discussion.

 2009.03.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Deep Web Integration:Querying Structured Data on the Deep Web [ppt]
In this report, I will introduce the background of Deep Web, the key technologies of Deep Web data integration and the active research groups. Then I will compare the metaquerier with metasearch engine. Finally I will give the research problems in the future.
 (Web Group) database selection 
Database selection is a important topic,this report gives an introduciton to database selection and then introduces our new problem.
 2008.03.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) CoreSpace: A personal dataspace framework based on user activity 
Present a new framework of personal dataspace by hightlighting relationship between users and average objects, which provides more effective approaches of querying personal dataspace.
 (Web Group) An efficient method to Identify personal task 
Present a new method to identify personal task based on user access activity.
 2009.03.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Cloud Computing) Research Report on Map/Reduce Framework Based on Hadoop [ppt]
Map/Reduce is the crucial algorithm of Hadoop. It is a easy but powerful algorithm that can solve the problems based on mass data. In this report,I will introduce the concept of Hadoop and Map/Reduce, then the detail of how the Map/Reduce framework do jobs.
 (Web Group) Introduction to HBase [ppt]
As sub-project of Hadooop, HBase focus on providing storage for the Hadoop Distributed Computing Environment. HBase is a table coloum-oriented operating. Its three-layer file system provides the feasible scheme for the distributing data storage while its three-layer architecture solves the problems of region assignment and region location. To get intuitionistic understanding of HBase, comparison with MySQL has been made in the test.
 (Web Group) The Progress of C-DBLP's Development and Future Plans 
The develop team of C-DBLP system has added some attractive functions and features to the site based on user's feedback and researching demand since the release of C-DBLP. Besides, we are working on some interesting problems such as Name Disambiguation and Mining of Relations among Authors. This report presented the progress of C-DBLP's development and showed intuitive approaches to the research problems in C-DBLP. Also, we made a detailed plan for future work in C-DBLP.
 2009.03.07  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Study on Fast Approxmate Membership checking 
Introduce ISH for approximate membership checking and analyze its disadvantage. We propose a new index and a corrresponding algorithm, the experiments indicate that the new method is more efficient than ISH.
 (XML Group) String Similarity 
This report introduces the methods about counting string similarity, including edit distance and gram_based similarity.

 2009.02.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (Web Group) Faceted Search [ppt]
A introduction to faceted search, including the evolution of faceted search, the differences between faceted search and navigational search, direct search, and differences between cluster, tag and facet.
 (Web Group) Automatic Construction of Facet Hierarchies 
Facet hierarchies are the main forms of data organization in facet search system. They are used to support facet-based navigation and refine the search results through different facets. The construction of facet hierarchies is one of the most important research topics in facet search. Since most facet hierarchies in current systems are built mannually, the automatic construction method is in great need. This presentation addressed W. Dakka and P. G. Ipeirotis's research progress in automatic construction of facet hierarchies.

 2009.01.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
 (XML Group) Survey of XML Database Technology [ppt]
In this talk, I give the main topics about XML database and explain the existing solutions using simple examples.
 (XML Group) Graph DataBases 
This presentation introduces some rearch hotspots on Graph DataBases,including the construction of the index, the processing of containment queryquery and reachability query answering.