WAMDM Seminar  
  • 2017-03-31 Processing of large-scale spatiotemporal data by Wenmei Wu
  • Abstract: Secondo,which is a extensible system, can provide a variety of data types and operators to effectively represent and process spatiotemporal data. However, due to the use of navigation and mobile devices , spatiotemporal data exploded. Stand-alone version of Secondo can not meet the actual needs of spatiotemporal data processing. This report describes parallel and distributed Secondo systems.
  • 2017-03-31 GWAC data real-time processing and interval query by Chen Yang
  • Abstract: The emergence of very large astronomical observations not only allows researchers to observe new astronomical phenomena, but also can be used to verify the correctness of the existing physical model. At present, the GWAC astronomical telescope data processing project involving units such as the Observatory and the National People's Congress has the following distinctive features (2) the data in the form of a block; (3) can be low latency query the current observation night data At present, the Observatory program to MonetDB database to do the underlying support, the stars related to the (1) data source in a fixed frequency to produce data in the form of flow; Data into a logical table, although the program is simple, but monetDB every few dozen files will jump, load time increased to about 10 seconds, instability may lead to data storage lag. People's Congress to Redis cluster as the underlying support, each star data to form KEY-LIST structure, but the structure of the storage on the network delay is high, and data management memory overhead. In the face of these problems, we have improved the program, each exception star data stored in KEY-LIST structure, the remaining data in the form of block by KEY-LIST storage. The case advantage is that we can trade off the efficiency of storage and query efficiency, but for special queries such as interval query will reduce the efficiency of the query, so we plan to introduce a special inverted index and segment tree to construct an index to improve the overall query speed.

    2017
     2017.03.31  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Processing of large-scale spatiotemporal data 
    Abstract:
    Secondo,which is a extensible system, can provide a variety of data types and operators to effectively represent and process spatiotemporal data. However, due to the use of navigation and mobile devices , spatiotemporal data exploded. Stand-alone version of Secondo can not meet the actual needs of spatiotemporal data processing. This report describes parallel and distributed Secondo systems.
     (Cloud Group) GWAC data real-time processing and interval query 
    Abstract:
    The emergence of very large astronomical observations not only allows researchers to observe new astronomical phenomena, but also can be used to verify the correctness of the existing physical model. At present, the GWAC astronomical telescope data processing project involving units such as the Observatory and the National People's Congress has the following distinctive features (2) the data in the form of a block; (3) can be low latency query the current observation night data At present, the Observatory program to MonetDB database to do the underlying support, the stars related to the (1) data source in a fixed frequency to produce data in the form of flow; Data into a logical table, although the program is simple, but monetDB every few dozen files will jump, load time increased to about 10 seconds, instability may lead to data storage lag. People's Congress to Redis cluster as the underlying support, each star data to form KEY-LIST structure, but the structure of the storage on the network delay is high, and data management memory overhead. In the face of these problems, we have improved the program, each exception star data stored in KEY-LIST structure, the remaining data in the form of block by KEY-LIST storage. The case advantage is that we can trade off the efficiency of storage and query efficiency, but for special queries such as interval query will reduce the efficiency of the query, so we plan to introduce a special inverted index and segment tree to construct an index to improve the overall query speed.
     2017.03.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Skewed Data Stream Join in Distributed Data Stream Management Systems 
    Abstract:
    Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, being memory-based processing, they impose high memory requirement on the system. Join-Biclique model has three characteristics: memory-efficiency, elasticity and salability. However, existing Join-Biclique model is unable to allocate the query nodes dynamically, and requires to set parameters about the grouping manually. What is more serious is that the effect of data skew is worse under the full-history join query. In this talk, in order to ensure the consistency of the query statement, we introduce a greedy algorithm to deal with the data steam skew.
     (cloud Group) Spark Core Programming and Core Architecture Depth Analysis 
    Abstract:
    This paper focuses on the spark features, core programming principles, operator case studies, and kernel architecture analysis.
     2017.03.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) KB Completion Based on PRA 
    Abstract:
    Though Knowledge Bases are becoming much larger, they are still far away from completion. KB completion models can be classified into three kinds: graph feature models, latent feature models and Markov Random Fields. This report will share Path Ranking Algorithm (PRA) included in graph feature models and two optimizations on it. The first optimization is to extend a large knowledge base by reading relational information from a large Web text corpus. The other one proposes a novel multi-task learning framework for PRA, referred to as Coupled PRA(CPRA). Do these optimizations still apply to the latent feature models? Can we just combine these two models into one? At the end of this report, there will be a simple comparison between graph and latent feature models.
     (Web Group) Knowledge Base Construction with Deepdive 
    Abstract:
    Knowledge base, which consists of entities and relationships, describes the abstract concepts of the world.It is widely used in commercial search engines, question answering systems, electronic commerce sites and social network sites.Deepdive,developed by Stanford University, is an open source knowledge base building tool.This report first introduces Deepdive's development background and its architecture, and then according to an example (Spouse relationship construction) presents Deepdive's application development process. The final part of the report introduces the difficulties of using Deepdive and web group's future work.
     2017.02.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Knowledge Fusion) Automatic Knowledge Base Construction by NELL, EntityCube, Watson, or DeepDive 
    Abstract:
    Large knowledge bases (KB鈥檚) about entities, their properties, and the relationships between entities, have become an important asset for semantic search, analytics, and smart recommendations over Web contents and other kinds of Big Data. Knowledge base construction (KBC) is the process of populating a knowledge base, i.e., a relational database storing factual information, from unstructured inputs. One key challenge in building a high-quality KBC system is that developers must often deal with data that are both diverse in type and large in size. Further complicating the scenario is that these data need to be manipulated by both relational operations and state-of-the-art machine-learning techniques. The technology implemention and development of KBC can be shown through several actual systems.
     (Web Group) Constructing an Interactive Natural Language Interface for Relational Databases 
    Abstract:
    Natural language has been the holy grail of query interface designers, but has generally been considered too hard to work with, except in limited specific circumstances. In this report, I describe the architecture of an interactive natural language query interface for relational databases. Through a carefully limited interaction with the user, we are able to correctly interpret complex natural language queries, in a generic manner across a range of domains. By these means, a logically complex English language sentence is correctly translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and can be evaluated against an RDBMS. We have constructed a system, NaLIR (Natural Language Interface for Relational Databases), embodying these ideas. Our experimental assessment, through user studies, demonstrates that NaLIR is good enough to be usable in practice: even naive users are able to specify quite complex ad-hoc queries.
     2017.01.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Discussion on Warehousing Procedure 
    Abstract:
    We warehouse the data accumulated in redis to HDFS through spark. This report introduces how to implemente the warehousing program in details, and explains how we solve the problems in practice.
     (Web Group) Based on Big data Small Footprint and Sirius for understanding benchmark 
    Abstract:
    Sensors on mobile phones and wearables, and in general sensors on IoT (Internet of Things), bring forth a couple of new challenges to big data research. First, the power consumption for analyzing sensor data must be low, since most wearables and portable devices are power-strapped. Second, the velocity of analyzing big data on these devices must be high, otherwise the limited local storage may overflow.Another paper talks about As user demand scales for intelligent personal assistants(IPAs) such as Apple鈥檚 Siri, Google鈥檚 Google Now, and Microsoft鈥檚 Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an opensource IPA workload is an obstacle in addressing this question. Comparing these two papers and give some conclusion about how to write a benchmark paper.

    2016
     2016.12.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Knowledge-based Question Answering With Deep Learning 
    Abstract:
    Deep learning has made great progress in the fields of image and speech. Some common tasks of natural language understanding, such as part of speech tagging, word segmentation, named entity recognition, entity extraction, relation classification and classification, have achieved good results with deep learning. This report focuses on KB-QA (knowledge-based Question Answering).Some common methods will be introduced. This report will also present some progress of Web Group in this field and the future work of Web Group.
     (Cloud Group) Data Management Challenges and Real-time Processing Technologies in Astronomy 
    Abstract:
    In recent years, many large telescopes, which can produce petabytes or exabytes of data, have come out. These telescopes are not only beneficial to the find of new astronomical phenomena, but also the confirmation of existed astronomical physical models. However, the produced star tables are so large that the single database cannot manage them efficiently. Taking GWAC that has 40 cameras and is designed by China as an example, it can take high-resolution photos by 15s and the database on it has to make star tables can be queried out in 15s. Moreover, the database has to process multi-camera data, find abnormal stars in real time, query their recent historical data very fast, persist and offline query star tables as fast as possible. Based on these problems, firstly, we design a distributed data generator to simulate the GWAC working process. Secondly, we address a two-level cache architecture which cannot only process multi-camera data and find abnormal stars in local memory, but also query star table in a distributed memory system. Thirdly, we address a storage format named the star cluster, which can storage some stars into a physical file to trade off the efficiency of persistence and query. Last, our query engine based on an index table can query from the second cache and star cluster format. The experimental results show our distributed system prototype can satisfy the demand of GWAC in our server cluster.
     2016.12.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Energy Conservation Techniques for Disk 
    Abstract:
    Today,with the explosive growth of data, we not only need store large amounts of data but also need deal with massive amounts of data.This will bring high energy consumption.Moreover, data center energy consumption is still showing rapid growth year by year. A large portion of the energy consumption in the data center is caused by the disk. The current storage system energy consumption accounts for 37% of the energy consumption of the entire IT center.At the same time, storage energy consumption is also increasing at a very high speed. This report summarizes disk-based energy savings and discusses energy-saving ideas for specific applications.
     (Web Group) Knowledge Base Completion via Search-Based Question Answering 
    Abstract:
    Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. In paper"Knowledge base completion via search-based question answering(WWW 2014)",the author propose a way to leverage existing Web-search鈥揵ased question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, they learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute.The paper also discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute.
     2016.12.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Exploring Relation Paths between Entities in Knowledge Graphs 
    Abstract:
    Exploring relation paths between entities is a common need in many fields. For example, social networking services suggest friends based on known associations between people. Security agents are interested in associations between suspected terrorists. Biologists discover the relations among genes, proteins and diseases to develop drugs. In recent years, the increasing amount of graph-structured data on the Web, like RDF data, has made association finding easier than extracting from Web text. This report compares some systems of relation-path finding, and proposes some problems in the biomedical domain.
     2016.12.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Constructing ARM TCM platform and optimizing energy consumption of SQLite 
    Abstract:
    In the previous report, it was found that in the energy consumption of database applications, data migration accounted for about 60%, while only the L1 cache accounted for 90% of them. This time we put forward the corresponding improvement solution. The general idea is to use software-controlled ARM TCM to partially replace the traditional hardware-controlled L1 cache. We first describes the selection and construction of hardware and software environments, and the user interface implementation of TCM. Secondly, the implementation of SQLite is analyzed, and the hot data structure, B-tree and basic operation optimization are preliminarily constructed and realized.
     (Privacy Group) Privacy-Preserving Data Publishing 
    Abstract:
    Privacy is an important issue when one wants to make use of data that involves individuals鈥檚ensitive information. Research on protecting the privacy of individuals and the confidentiality of data has received contributions from many fields, including computer science, statistics, economics, and social science. The report primarily focuses on research work in privacy-preserving data publishing. This is an area that attempts to answer the problem of how an organization, such as a hospital, government agency, or insurance company, can release data to the public without violating the confidentiality of personal information.
     2016.12.01  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) An Initial Survey on KB-based Relationship Extraction 
    Abstract:
    Today is an information age and full of data among which knowledge is the most special one. Usually, knowlegde presents as relationships. On the one hand, mankind can understand relationships easily. But on the other hand, because of the intricate relationships among objects, automatically analysing them is difficult. Therefore, according to previous literature and typical systems, this report will mainly share some Relationships Extraction technologies on Knowledge Base which can benefit decision-making and science research.
     (Privacy Group) Mobile Privacy Survey--Evaluating App Privacy and User Privacy Solution 
    Abstract:
    Privacy has become a key concern for smartphone users as many apps tend to access and share sensitive data. There are three main approaches to surveying sensitive data collection status on mobile phone: permissions analysis, static code analysis and dynamic analysis in researches. As mobile privacy is defined as collect sensitive data without user鈥檚 consent later, permission-based and privacy policy based analysis methods are proposed to evaluate privacy leakage. At the same time, a few privacy-preserving techniques are offered to prevent data collection process or anonymize sensitive information. And Now There is Local Differential Privacy method which applies differential privacy to small mobile device to protect user privacy.
     2016.11.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Text mining-BioNLP2016 
    Abstract:
    With the rapid growth of biomedical information, only rely on manual reading to obtain and understand the required knowledge become extremely difficult, how to integrate existing knowledge and new knowledge mining has become a hotspot of current research from the mass of biomedical literature. Text mining can help people from a large number of unstructured and semi-structured biomedical text mining in the extraction of implicit, people do not know in advance, but is potentially valuable information and knowledge, which is now widely used in biomedical research. Conference such as BioNLP proposed biomedical text mining tasks, through different methods to explore and practice, to promote the development of the field of research. This report mainly introduces the BioNLP previous reporting center, and two papers as an example to elaborate. Finally put forward their own ideas.
     (Cloud Group) Untwisting The Rope: A Resource Decoupling Approach Revisiting Performance bottlenecks of Big Data Systems 
    Abstract:
    Big data systems are complex, and it is difficult to analyze performance bottlenecks. Researches focus on presenting many model methods to identify performance bottlenecks by one observation, but only can quantify bottlenecks of part components and are error-prone. In this paper, we present a recourse decoupling approach to systematically quantify bottlenecks of major components. To conduct a detailed analysis, we do the following work including: (1) we present four quantitative methods solving CPU, memory, disk and network bottleneck; (2) we address an ideal speedup to quantify the minimum acceleration potential of non-CPU components; (3) we develop a tool to monitor performance events to cross validate the ranking of performance bottlenecks and find out fine-grained reasons; (4) we use Spark as the example of big data systems and evaluate its performance with two SQL benchmarks.
     2016.11.17  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Report of CIKM2013 
    Abstract:
    The 25th ACM International Conference on Information and Knowledge Management(CIKM2016) was held in Indianapolis, USA,October 24-28, 2016. CIKM2016 received 935 paper submissions for the research track. 160 were accepted as full paper(acceptance rate 22.8%), and 55 were accepted as short paper(acceptance rate 23.5%). The conference program includes 3 keynotes, 7 tutorials ,4 industry talks, and 50 paper sessions.
     (Web Group) Deep learning and NLP 
    Abstract:
    Nature language processing has developed for more than 50 years, which was dominated by the rule-based approach in the early days. But the really effective processing language was from 2000 mainly because of the rise of the statistical natural language processing technology.After more than 10 years' development, mass data acquisiton is no longer a problem with the emergence of large data technology. The deep machine learning methods has firstly made a breakthrough in speech and image processing as a new way for AI. NLP is naturally joined into the transformation made by the deep learning. Deep learning technology has been applied to many problems in NLP such as word representation, emotion classification, entity recognition, reading comprehension, relation extraction,visual QA and so on, which are superior to statistic methods in many problems. This report discuss and study some problems selected from the above issues with the deep learning.
     2016.11.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) URMDA : A System for Diagnosing Spark鈥檚 Performance Bottlenecks 
    Abstract:
    This paper demonstrates URMDA for diagnosing Spark鈥檚 performance bottlenecks. We implement the resource decoupling approach to quantify the bottlenecks of major components, which include CPU, disk, network and memory, and build a fine-grained monitor to do a depth analysis of the Spark鈥檚 performance bottleneck by being combined with several analyzing functions. We demonstrate URMDA using two SQL benchmarks, and draw the conclusions as follows. (1) Network is likely to be the bottleneck especially when the bandwidth is 100Mbps. (2) CPU is always the major bottleneck. (3) Spark in memory is not as fast as the official propaganda because of the weak cache operation.
     (Web Group) Data visualization technology application and research 
    Abstract:
    Visual analytics is an important method used in big data analysis. The aim of big data visual analytics is to take advantage of human鈥檚 cognitive abilities in visualizing information while utilizing computer鈥檚 capability in automatic analysis. By combining the advantages of both human and computers, along with interactive analysis methods and interaction techniques, big data visual analytics canhelp people to understand the information, knowledge and wisdom behind big data directly and effectively.
     2016.10.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) differential privacy demo system demonstration 
    Abstract:
    The data set is not sensitive to changes in a specific record in differential privacy.Whether a single record in the data set or not, the impact on the calculation results remains constant.As a result, the risk of lossing privacy is controlled in the range of we can acceptable and the attacker can not obtain accurate individual informations through the observation result.The content of the report is displaying the system.
     (Privacy Group) Data Mining with Differential Privacy 
    Abstract:
    We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the dierential privacy framework. Dierential privacy requires that computations be insensitive to changes in any particular individual's record, thereby restricting data leaks through the results. The privacy preserving interface ensures unconditionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sample application. The privacy mechanism has a profound effect on the performance of the methods chosen by the dataminer. We demonstrate that this choice could make the difference between an accurate classier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.
     2016.10.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Plan Circular Embeddings of Knowledge Graphs 
    Abstract:
    The embedding representation technology provides convenience for machine learning on knowledge graphs (KG), which encodes entities and relations into continuous vector spaces and then constructs triples. However, KG embedding models are sensitive to infrequent and uncertain objects. Furthermore, there is a contradiction between learning ability and learning cost. To this end, we propose circular embeddings (CirE) to learn representations of entire KG, which can accurately model various objects, save storage space, speed up calculation, and is easy to train and scalable to very large datasets. We have the following contributions: (1) We improve the accuracy of modeling and learning for various objects by combining holographic projection and projection degree. (2) We reduce parameters and storage by adopting circulant matrix as the projection matrix from the entity space to the relation space. (3)We accelerate convergence and reduce training time by adaptive parameters update algorithm dynamic change learning time for various objects. (4) We speed up the computation and enhance scalability by fast Fourier transform (FFT). Extensive experiments show that CirE outperforms state-of-the-art baselines in link prediction and entity classification, justifyed efficiency and the scalability of CirE.
     (Cloud Group) OrientStream: A Framework for Dynamic Resource Allocation in Distributed Data Stream Management Systems 
    Abstract:
    Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for dynamic resource allocation in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level and cluster-level's four-level feature extraction mechanism, we firstly use the different query workloads as training sets to predict the resource usage of DDSMS and then select the optimal resource configuration from candidate settings based on the current query requests and stream properties. Finally, we validate our approach on the open source SPS--Storm. Experiments show that OrientStream can reduce CPU usage of 8%-15% and memory usage of 38%-48% respectively.
     2016.10.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Smart Storage 
    Abstract:
    With the development of the Internet of things and social networks,a huge amount of data will be produced.How to store and process the data is an urgent problem.And today's customers need real-time feedback.The traditional architecture of CPU-DRAM-Disk can not meet the needs of data storage and processing.We need a new computer architecture.This report introduces the architecture that moves compute to storage and makes intelligent storage.It also introduces some related work.
     (Web Group) Strategies for Training Large Scale NNLM 
    Abstract:
    Because of the breakthrough performance in image and audio processing, neural network language model (NNLM) is also widely used in natural language processing. To get high precision, it requires to be trained on very large corpus. Also, we have to tune parameters repeatedly according to different situations. The training precess is very time consuming. Combined with my experience on RNNLM, this report shares some strategies for training large scale NNLM from several aspects, including corpus, iteration times, vocabulary, the hidden layer and so on.
     2016.9.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Astro data analysis frame based on two-level cache 
    Abstract:
    his report describes a prototype system for the prototype system design framework of processing GWAC astronomical data. Different from the the two-layer analysis framework of the first version, the three layers of the new framework meet the new performance requirements. The first layer is a local cache memory to detect mutations in milliseconds. The second layer is a distributed memory system to detect transient sources in seconds. The third layer is a distributed database to do off-line analysis and long term storage in minutes.
     (Web Group) Single-relation Question Answering based on Knowledge Base 
    Abstract:
    Single-relation questions are the most common form questions in search logs and community question answering websites. A knowledge-base(KB) such as Freebase and DBPedia can help answer such questions after reformulating them as queries. Hower,automatically mapping a natural language question to its corresponding KB query remains a challenging task. This report will present some progress in this field.
     2016.09.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Data movement in database application 
    Abstract:
    To our knowledge, 200 times more energy spends on data movement from DRAM to CPU than computation. With the development gap between CPU and DRAM, the energy gap between them will be more serious. In big data applications especially database applications, data is a key resource, so data movement would become the bottleneck of energy efficiency. In the report, we firstly introduce the methodology we use to quantify single cache line movement energy between caches. Secondly, analyse energy comsumption of TPC-H benchmark in postgreSQL. To find the energy features of database application, we compared the results with CPU2006 benchmark and find L1 cache contributes the majority of energy of database.
     (Privacy Group) Publishing the Column Counts under Differential Privacy 
    Abstract:
    We primarily consider the problem of publishing column counts for datasets. These statistics are useful in a wide and important range of applications, including transactional, traffic and medical data analysis. The key challenge is that as the sensitivity is high, high-magnitude noises need to be added to satisfy differential privacy. GS is first presented, which pre-processes the counts by elaborately grouping and smoothing them via averaging.The grouping strategy is dictated by a sampling mechanism, which minimizes the smoothing perturbation. DPSense and DPSense-s are the state-of-the-art approaches for publishing column counts for high-dimensional datasets, whose key idea is to reduce the sensitivity by setting a limit on the contribution of each record.
     2016.06.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Introduction to MonetDB 
    Abstract:
    MonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in theNetherlands. It is widely used in OLAP, GIS and data mining. This report will first introduce the background,the architecture and its BAT algebra. Also the typical technologies , such as Late Materialization,Database Cracking and Hardware-Conscious Query Processing ,are presented to have a deep understanding of it.
     (Web Group) Predict on Public Big Bata by Google BigQuery and TensorFlow 
    Abstract:
    We can built model on specific business application and predict user demand via Google BigQuery and TensorFlow. Since, Google BigQuery public data sets provide available training data and test data and TensorFlow open source software libraries provide machine learning model. The report briefly describes BigQuery, TensorFlow related knowledge and application cases, which primarily to bring information, providing learning materials.
     2016.06.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Develop a PINQ Demo System 
    Abstract:
    Privacy Integrated Queries is a LINQ-like API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. According to the method mentioned above, a PINQ Demo System application is designed and implemented during this term.
     (Mobile Group) Reports about ICDE2016&XLDB2016 [ppt]
    Abstract:
    Reposrts bout ICDE2016&XLDB2016.
     2016.6.3  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Automatic Enforcement of Data Use Policies with DataLawyer 
    Abstract:
    Data has value and is increasingly being exchanged for commercial and research purposes. Data, however, is typically accompanied by terms of use, which limit how it can be used. To date, there are only a few, ad-hoc methods to enforce these terms. DataLawyer, a new method to formally specify usage policies and check them automatically at query runtime in a relational database management system (DBMS).
     2016.5.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Guessing the extreme values in a data set: a Bayesian method and its applications 
    Abstract:
    For a largenumber of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.
     (Cloud Group) A Data stream partitioning method for multiple group-by query 
    Abstract:
    In the application of query and analysis of data stream in real-time, data summarization is essential for users. The multiple Group-By queries is widely used in distributed data stream management systems. Compared with the existing data partitioning methods, this report attempts to achieve efficient data stream partitioning strategy through the combination of compile query optimization and runtime query optimization. In compile optimization, we try to construct a query cost model based on partition keys; in runtime optimization, we design the dynamic adjustment strategies based on the distribution of data stream. It can construct a complete method of data stream partitioning.
     2015.5.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The I/O Preformance of Big Data Computing Framework 
    Abstract:
    With the continuous development of the Internet and other information technology, the amount of data are at an alarming rate in the accumulation. Spark representing mainstream distributed computing framework can work on the performance analysis of the log, the source code and the Java virtual machine incrementally.This study starting from an architectural perspective, focuses on clusters of abstract, CPU-bound quantitative, I/O performance model and other issues, and seeks to provide a reference for relevant researchers.
     (Privacy Group) Blockchain: The State of the Art and Future Trends [ppt]
    Abstract:
    Blockchain is an emerging decentralized architecture and distributed computing paradigm underlying Bitcoin and other cryptocurrencies, and has recently attracted intensive attention from governments,financial institutions, high-tech enterprises, and the capital markets. Blockchain's key advantages include decentralization, time-series data, collective maintenance, programmability and security, and thus is particularly suitable for constructing a programmable monetary system, 炉financial system, and even the macroscopic societal system.
     2016.05.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Privacy Data Release 
    Abstract:
    Privacy-preserving data publishing is an important problem that has been the focus of extensive study. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. To address the deficiency of the existing methods, this paper presents PRIVBAYES, a differentially private method for releasing high-dimensional data. Intuitively, PRIVBAYES circumvents the curse of dimensionality. Private construction of Bayesian networks turns out to be significantly challenging, and this paper introduces a novel approach that uses a surrogate function for mutual information to build the model more accurately.
     (Mobile Group) FGMP 
    Abstract:
    近年来,大数据管理系统的发展趋势主要形成了三个方向,一种是以 Hadoop 和 MapReduce 为代表的批处理系统,另一种是以Storm为代表的,为各种特定应用开发的流处理系统,最后一种是最近兴起的混合式计算模式的spark系统。这些分布式的大数据管理系统给我们带来了高速处理海量数据的能力。如何提升这些平台的性能成为大家探讨的话题。为了能够监测分布式的大数据管理系统的性能,UC?Berkeley?开发了开源工具ganglia。但是它只能提供非常粗粒度的监控(例如,CPU利用率),无法满足我们的要求。如何细粒度地监测大量的运算节点,从而发现系统性能瓶颈成为一个迫切需要解决的问题。为此,在本文第二部分,我们构建了一个分布式的大数据管理系统监测平台——FGMP,它可以给用户带来如下便利:(1)便捷地在大量节点上部署大数据管理系统;(2)根据集群硬件资源自适应调整监控方案;(3)调节各个节点的CPU频率;(4)通过web服务远程提交任务给大数据管理系统运行(5)细粒度的(进程级别)监控系统性能。
     2016.05.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Introduction to scientific data management 
    Abstract:
    With the development of cloud computing,business and government data can be processed by less time.However,in science field,large number of data will be produced.Moreover,it has more data than business data.Then,how to manage the scientific data.This report introduces the challenge of scientific data management and the weakness of using cloud computing to scientific data management.It also introduces the idea from Jim Gray in scientific data management.
     (Cloud Group) The integration and optimization of deep learning processor based on Caffe 
    Abstract:
    Deep learning has dramatically improved the state-of-art in image classification,specch recognition ,nature language understanding and many other domains.First,we introduces the basic concepts of deep learning and Caffe ,one of the most popular deep learning framework in this report .Second,the integration and optimization of Caffe with Cambrian,a deep learning processor designed by Chen Yunji's group of ICT,is presented.Finally,the future work is reported.
     2016.04.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Introduction of Computational Social Science 
    Abstract:
    In this presentation, we introduce a new field of computational social science(CSS). CSS is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale.The CSS may reveal patterns of individual and group behaviors. The emergence of a computational social science shares with other nascent interdisciplinary fields the need to develop a new paradigm for training new scholars. Initially, computational social science needs to be the work of teams of social and computer scientists. In the long run, the question will be whether academia should nurture computational social scientists, or teams of computationally literate social scientists and socially literate computer scientists.
     2016.4.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) The Price of Free
    Abstract:
    In-app advertising is an essential part of the ecosystem of free mobile applications. On the surface, this creates a win-win situation where app developers can profit from their work without charging the users. However, as in the case of web advertising, ad-networks behind in-app advertising employ personalization to improve the effectiveness/profitability of their ad-placement. This need for serving personalized advertisements in turn motivates ad-networks to collect data about users and profile them. As such, “free” apps are only free in monetary terms; they come with the price of potential privacy concerns. The question is, how much data are users giving away to pay for “free apps”?
     2016.4.8  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Analysis of OpenStack's architecture and revolution 
    Abstract:
    Openstack is a by NASA and Rackspace, developed and launched by the cooperation of Apache license authorization free software and open source projects.By several major components together complete the specific work.It supports almost all types of cloud, the project goal is to provide implementation is simple and can be large-scale extension, rich, standard unified management of cloud computing platform.It through a variety of complementary services provide the infrastructure as a service (IaaS) solution, each service provides the API for integration.It is a to the construction and management of public and private cloud to provide software open source project.Its community with more than 130 companies and 1350 developers, these organizations and individuals will it as infrastructure as a service (IaaS) general front end of resources.It project's first priority is to simplify the deployment process of cloud and bring its good extensibility.
     (Web Group) An Axiomatic Approach to Link Prediction 
    Abstract:
    The evaluation of link prediction functions has mostly been based on experimental work, which has shown that the quality of a link prediction function varies significantly depending on the input domain. There is currently very little understanding of why and how a specific link prediction function works well for a particular domain. The underlying foundations of a link prediction function are often left informal—each function contains implicit assumptions about the dynamics of link formation, and about structural properties that result from these dynamics. So the paper presents an axiomatic basis for link prediction. This approach seeks to deconstruct each function into basic axioms, or properties, that make explicit its underlying assumptions. This framework uses “property templates”.
     2016.03.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Privacy and Human Behavior 
    Abstract:
    We mainly focus on the end of privacy, a special issue in Science. It uses three themes to connect insights from social and behavioral sciences: people’s uncertainty about the consequences of privacy-related behaviors and their own preferences over those consequences; the context-dependence of people’s concern, or lack thereof, about privacy; and the degree to which privacy concerns are malleable—manipulable by commercial and governmental interests. Large-scale data sets of human behavior have the potential value. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use. The study including credit card records for 1.1 million people shows that four spatiotemporal points are enough to uniquely reidentify 90% of individuals and knowing the price of a transaction increases the risk of reidentification by 22%, on average.
     (Cloud Group) Performance analysis of distributed data stream processing systems 
    Abstract:
    In the era of big data, with the gradual rise of open computing platform, distributed data stream processing systems are used to process distributed and continuously increasing flow data. According to the query tasks submitted by users, the stream processing platform often converts the query plan into DAG graph for decomposition and processing. This report is based on Storm as the processing platform, according to different types of benchmark, analysis of Storm in different data flow rate and the allocation of different degrees of parallelism in the use of resources and the corresponding processing delay and throughput, and other indicators. To lay the foundation for further fine grained analysis of storm scheduling mechanism and system bottleneck.
     2016.03.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) A Geometric Representation of Online Collective Attention Flows in USA and China 
    Abstract:
    With the fast development of Internet and WWW, information overload has become an overwhelming problem, and collective attention of online users will play a more important role nowadays. Knowing how collective attention distributes and flows among different websites is the first step to understand the underlying dynamics of attention on WWW. In this presentation, we intoduce a method to embed a large number of web sites into a high dimensional Euclidean space according to the novel concept of flow distance, which both considers connection topology between sites and collective click behaviors of users. With this geometric representation, we visualize the attention flow in the data set of Indiana university and Chinese online users clickstream over one day.
     (Cloud Computing) AdaStorm: Resource Efficient Storm with Adaptive Configuration 
    Abstract:
    As a distributed and scalable real-time processing system, Storm has been widely used in various scenarios, including real-time analytics, continuous computing, and alerting, etc. However, since the Storm configuration (e.g., the number of workers, spout parallelism, and bolt parallelism, etc.) is predetermined before the deployment of the execution topology (i.e., a graph consisting of tasks), the system cannot adapt to fluctuant data stream properties (e.g., arrival rate and value distribution), leading to either significant resource waste or limited processing throughput. To address this problem, we present AdaStorm, a resource-efficient system to dynamically adjust the Storm configuration according to current data stream properties. AdaStorm is designed to minimize the resource usage while still ensuring the same or even better real-time response. This is achieved by first incrementally training machine learning models for predicting resource usage with accumulated system behaviors, and then deploying the most resource-efficient configuration derived from the models. We demonstrate three scenarios on the effectiveness of AdaStorm, i.e., resource efficiency, data rate tolerance, and online model update.
     2016.3.4  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Human-level concept learning through probabilistic program induction 
    Abstract:
    People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion.The main focus is in reference [1].
     (Web Group) Combining Language Model with Conceptualization for Definition Ranking 
    Abstract:
    Question Answering is a hot trend in search engine field. "What" type of queries is one of the most common one in Q&A systems. To ensure the coverage of answers, we crawl definition sentences from the web for these queries. It leaves a lot to improve on how to tell good answers from bad ones and sort all candidate answers reasonably. Traditional method use SVM to rank definition sentences, but its features are all syntactic. Even we strengthen the semantic feature through language models, it still has some defects. Thus, we combine the implicit and the explicit models through adding the conceptualization process to RNNLM. The semantic relation between terms and their definitions are obtained, so that we improve both the precision and recall.

    2015
     2015.12.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Mobile Group) Privacy Integrated Queries---An Extensible Platform for Privacy-Preserving Data Analysis 
    Abstract:
    PINQ is an platform named Privacy Integrated Queries (PINQ) for privacy-preserving data analysis. PINQ is built atop C# Language Integrated Queries. LINQ is a recent language extension to the .NET framework for integrating declarative access to data streams (using a language very much like SQL) into arbitrary C# programs.PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language.At the same time, the design of PINQ analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform.
     (Web Group) Knowledge Base and Matrix Factorization 
    Abstract:
    With the development of Semantic Web, the automatic construction of large scale knowledge bases (KBs) has been receiving increasing attention in recent years. Although these KBs are very large, they are still often incomplete. Many existing approaches to KB completion focus on performing inference over a single KB and suffer from the feature sparsity problem. Moreover, traditional KB completion methods ignore complementarity which exists in various KBs implicitly.We will go through the basic ideas and the mathematics of matrix factorization, and then we will present a popular technology in KB completion with matrix faxtorization or tensor decomposition.The topic of this report focus in the embedding technology for KB completion.Several methods are show in PPT including RESCAL, TRESCAL, MF with similarity and negative datas and so on.
     2015.12.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Differentiallly private high-dimensional data publication via sampling-based inference 
    Abstract:
    Releasing high-dimensional data enables a wide spectrum of data mining tasks. Yet, individual privacy has been a major obstacle to data sharing. In the paper, the problem of releasing high-dimensional data with differential privacy guarantees is considered and a novel solution to preserve the joint distribution of a high-dimensional dataset is proposed.It first develops a robust sampling-based framework to build a dependency graph, and then identifies a set of marginal tables from the dependency graph to approximate the joint distribution based on the junction tree algorithm while minimizing the resultant error.
     (Privacy Group) GUPT: Privacy Preserving Data Analysis Made Easy 
    Abstract:
    GUPT uses a new model of data sensitivity that degrades privacy of data over time. This enables efficient allocation of different levels of privacy for different user applications while guaranteeing an overall constant level of privacy and maximizing the utility of each application. GUPT also introduces techniques that improve the accuracy of output while achieving the same level of privacy. These approaches enable GUPT to easily execute a wide variety of data analysis programs while providing both utility and privacy.
     2015.12.04  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Common Patterns of Online Collective Attention Flow 
    Abstract:
    If we see the Web as a virtual living organism, according to the metabolic theory, it must absorb 鈥渆nergy鈥� to grow and evolve. We want to know: (1)where does the 鈥渆nergy鈥� of the Web come from? (2)what are the common patterns of this 鈥渆nergy鈥� flow? We make a conjecture that the websites survival and development highly rely on the energy, which is online collective attention flow. We analyze the empirical data obtained from CNNIC and find a number of interesting common patterns: the allometric scaling laws, the dissipation laws, the gravity law and the Heaps' law. These common patterns will play a more important role in quantifying the Web evolution and online collective behaviors prediction.
     (Cloud Group) Resource estimation and perfermance analysis for Storm and Spark 
    Abstract:
    In the ear of big data, there are a number of distributed data stream processing systems for coping with different data processing models. This report focuses on the stream processing system-Storm and the hybrid processing system-Spark. Firstly, there are some drawbacks about Storm, such as limitation of rebalance and the dynamic change of data stream load. So, we design the prediction model by using the MOA framework, and achieve the dynamic optimization of the configuration parameters. Secondly, we construct the performance analysis platform based on Spark, and describe the bottleneck of Spark quantitatively. Finally, we sum up and look forward to the future research work.
     2015.11.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Mobile Group) SST: Privacy Preserving For Semantic Trajectories 
    Abstract:
    To preserve privacy in trajectory data, most existing approaches adapt cloaking techniques to protect individual location points or clustering and perturbation techniques to protect entire trajectories. To confirm to the k-anonymity model, they first group locations/trajectories and then modify location points to ensure a cluster of k location points/trajectories are close to each other. However, when k is large or the time span of trajectory is long, the cluster based k-anonymity approaches will suffer from great distortion and lead to misleading analysis results. Observing that it is unnecessary to brutally provide the same level of privacy protection to all locations, we analyze the visiting status of a semantic place at which a point is situated as well as the distribution of neighboring semantic places and infer four privacy risk levels based on the risk of privacy breach. Then, we propose the semantic space translation algorithm that it can strike a good balance between privacy preserving and data utility.
     (Mobile Group) WISE2015 Report 
    Abstract:
    WISE2015 participant: Lu Wang shares her experience and report involved sessions.
     (Cloud Group) Unified platform of resource managerment and scheduler 
    Abstract:
    Apache Hadoop began as one of many open-source implementations of MapReduce [12], focused on tackling the unprecedented scale required to index web crawls.Its execution architecture was tuned for this use case, focusing on strong fault tolerance for massive, data-intensive computations. In many large web companies and startups, Hadoop clusters are the common place where operational data are stored and processed.This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs鈥� control flow, which resulted in endless scalability concerns for the scheduler.
     2015.11.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Bayesian Differential Privacy on Correlated Data 
    Abstract:
    Differential privacy provides a rigorous standard for evaluating the privacy of perturbation algorithms. It has widely been regarded that differential privacy is a universal definition that deals with both independent and correlated data and a differentially private algorithm can protect privacy against arbitrary adversaries. However, recent research indicates that differential privacy may not guarantee privacyag ainst arbitrary adversaries if the data are correlated.The paper focuses on the private perturbation algorithms on correlated data and propose Bayesian differential privacy, by which the privacy level of a probabilistic perturbation algorithm can be evaluated even when the data are correlated and when the prior knowledge is incomplete.
     (Cloud Group) Key Concept of Spark-RDD(Resilient Distritubed Dataset) 
    Abstract:
    Spark is a fast and general engine for large-scale data processing. Its key concept is RDD(Resilient Distritubed Dataset). In order to demostrate the roots of Spark’s advantages,this report first introduces the oringin and overview of RDD.Then the lineage ,fault-tolerance and generality of RDD are presented. Finally,we have a sumarry and talk about our future work.
     2015.11.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) How to reduce system power using flash 
    Abstract:
    With the development of cloud computing,the cluster which becomes bigger than ever consumes more power. How to reduce the power of a cluster? We know that a cluster is composed of many nodes.If we can reduce the power of a single node, we will reduce the power of a cluster. This report briefly introduces some general methods to reduce the power of a single node.Moreover,it introduces the method using flash to reduce system power in two papers.
     (knowledge fusion) Entity Linking in Biomedical Domain 
    Abstract:
    The Entity Linking (EL) task is well-studied in news and social media, this problem has not received much attention in the life science domain.The first paper examine a key task in biomedical text processing, normalization of disorder mentions. It present a multi-pass sieve approach to this task, which has the advantage of simplicity and modularity. This approach is evaluated on two datasets, one comprising clinical reports and the other comprising biomedical abstracts, achieving state-of-the-art results.Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, the second paper propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking.It also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature.
     2015.11.6  Topic: Open-domain question answering
     (Web Group) Semantic Parsing for Single-Relation Question Answering 
    Abstract:
    Open-domain question answering (QA) is an important and yet challenging problem that remains largely unsolved. In this paper, the author propose a semantic parsing framework based on semantic similarity for open domain question answering (QA). Using convolutional neural network models, this paper measures the similarity of entity mentions with entities in the knowledge base (KB) and the similarity of relation patterns and relations in the KB. Deep learning is nowadays very popular. These new techniques are worth investigation in knowledge base related field.
     2015.10.30  Topic: Unified platform of resource management and scheduler
     (Cloud Group) spark and mapreduce performance comparison 
    Abstract:
    Since the report into the lab, and the understanding of the spark mapreduce platform made. The main content is an introductory performance comparison of spark and mapreduce paper. The article points out, for different tasks, you should use the right architecture - that mapreduce or spark.
     (Cloud Group) Unified platform of resource management and scheduler 
    Abstract:
    Large-scale compute clusters are expensive, so it is important to use them well. Utilization and efficiency can be increased by running a mix of workloads on the same machines: CPU- and memory-intensive jobs, small and large ones, and a mix of batch and low-latency jobs – ones that serve end user requests or provide infrastructure services such as storage, naming or locking. This consolidation reduces the amount of hardware required for a workload, but it makes the scheduling problem (assigning jobs to machines) more complicated: a wider range of requirements and policies have to be taken into account. Meanwhile, clusters and their workloads keep growing, and since the scheduler’s workload is roughly proportional to the cluster size,the scheduler is at risk of becoming a scalability bottleneck.
     2015.10.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Work on Storm 
    Abstract:
    Introduce features of storm system, mainly focusing on the difference from Hadoop and Spark that offer batch processing. For evaluating storm's performance,that is, CPU and memory, We implemented Linear Road Benchmark in storm and deployed ganglia tool to monitor storm cluster. From the result, we found parameter tuning in storm is essential for the CPU and memory usage. We will introduce DRS model risen by ICDSC 2015 before our machine learning proposal, after which feature choice and techniques for samples collection.
     (Web Group) Joint RNNLM for Definition Mining 
    Abstract:
    Question Answering, which provides the direct answers instead of 10 blue links for user’s queries, has become a hot trend in web search.To answer "what" questions, one of the biggest segments in Q&A systems, we need to do definition mining from the web. Traditional approaches tend to use SVM to rank alternative definitions, but the features are usually syntactic. Even adding word embedding feature to comprehend definitions semantically, we still ignore some important relations for definitions, like the is-a relation between words. Therefore, we propose a joint RNNLM model for definition mining. It combines the explicit language model with the implicit model, namely the conceptualization and the word embedding, to capture the semantic relation between terms and their definitions.
     2015.10.16  Topic: Privacy Protection
     (Web Group) WISE Pre-Report 
    Abstract:
    WISE participant: Lu Wang shares her pre-report.
     (Mobile Group) Provenance and Privacy
    Abstract:
    Provenance in scientific workflows is a double-edged sword. On the one hand, recording information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions, enables transparency and reproducibility of results. On the other hand, a scientific workflow often contains private or confidential data and uses proprietary modules. Hence, providing exact answers to provenance queries over all executions of the workflow may reveal private information.
     2015.10.9  Venue: FL1, Meeting Room, Information Building
     (Cloud Group) Approximate Medians and other Quantiles in One Pass and with Limited Memory 
    Abstract:
    Some new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude. And also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e. they apply with respect to a (user controlled) confidence parameter. Finally present the algorithms, their theoretical analysis and simulation results on different datasets.
     (Cloud Group) The parameter optimizing of Hadoop based on machine learning 
    Abstract:
    Parameters in Hadoop are relevant to the system performance. When the hardware configuration is invariant, Different parameter configurations result in different run-times. Thus, we can learn the relevance between parameter choice and system performance by machine learning algorithm. We configure the Hadoop system by many figure vectors consisting of benchmarks and different hardware and software configuration, and finally we can get the run-time corresponding to figure vector. Many figure vectors plus the corresponding run-time form the figure dataset in which classification algorithms can learn a system action model. Using the model, we can predict the run time of the new figure vector, detect anomaly and estimate the smallest scale of the figure dataset in system hardware upgrading.
     2015.10.2  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Accelerating File System Access with Nonvolatile Memory 
    Abstract:
    With the development of social network and electronic business,there are many data in our daily life.Everyone wants to store these data with high performance.However it depends on the traditional CPU-Memory-Disk system architecture,in which the speed of CPU and the speed of disk does not match.The new non-volatile memory has the advantages of fast access speed.How to improve the speed of dish access with the new non-volatile memory?This report introduced two related paper.And these methods can improve the speed of file system access and the speed of disk access.
     (Web Group) Using Encyclopedic Knowledge to Understand Queries 
    Abstract:
    Query understanding is a challenging but beneficial task.In this paper, we propose a context-aware method to use the encyclopedic knowledge to aid in query understanding. Given a query, we first use a dictionary constructed from the encyclopedic knowledge bases to detect the possible entities and their associated categories. Then, we use a topic based method to derive semantic information from the query. By comparing the topical similarity between various candidate phrases, we get the most likely entities and their related categories.
     (Web Group) Question Answering over Linked Data Using First-order Logic 
    Abstract:
    question answering over linked data(QALD) aims to evaluate answering system over structured data.The key objective of which is to translate questions posed using natural language into structured queies. This report introduce a novel method using Markov Logic Network to resolving the ambiguities and a holonomic framework to complete query translation
     2015.09.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Plan Bouquets: Query Processing without Selectivity Estimation 
    Abstract:
    Selectivity estimates for optimizing OLAP queries often differ significantly from those actually encountered during query execution,leading to poor plan choices. The article propose a new approach to address this problem, wherein the compile-time estimation process is completely eschewed for error-prone selectivities. Instead, a small “bouquet” of plans is identified from the set of optimal plans in the query’s selectivity error space, such that at least one among this subset is nearoptimal at each location in the space. Then, at run time, the actual selectivities of the query are incrementally “discovered” through a sequence of partial executions of bouquet plans, eventually identifying the appropriate bouquet plan to execute. The duration and switching of the partial executions is controlled by a graded progression of isocost surfaces projected onto the optimal performance profile.
     (Cloud Group) Head First DDSMS 
    Abstract:
    In the era of big data, distributed data stream processing systems have emerged, which are used to process the distributed and increasing stream data. In order to build a complete distributed data stream management system(DDSMS), the query system is easy to use and improve the query processing system. Firstly,this report introduces the background and development of DDSMS from stream processing and stream query. Then, we summarizes the hot research fields of DDSMS(such as query language, system performance improvement and building-in new architectures). Finally, we point out the ongoing research work and future research directions.
     2015.09.18  Topic: Online users' behaviors evolution dynamics
     (Web Group) Online users'behaviors evolution dynamics: a collective attention flow perspective 
    Abstract:
    If we see the Web as a virtual living organism, according to the metabolic theory, the websites must absorb “energy” to grow, reproduce and develop. We are interested in the following two questions: (1)where does the “energy” come from? (2)will the websites generate macro influence on the whole Web based on the “energy”? We make a conjecture that the websites grow at the cost of collective users’ attention flow as “energy” and produce influence on the whole Web. In other words, the more attention flow from online collective users a website, the more influence will be created. The results of data analysis in our experiments confirm this conjecture. Furthermore, we study collective surfing behavioral data from network science’s perspective and find that the evolution of the Web are governed by the rules that also leads to the evolution of living organism.
     2015.6.26  Theme: Emerging Programming Languages
     (Cloud Group) Introduction to Emerging Programming Languages 
    Abstract:
    With the rapid development of mobile internet technology, it has appeared the new programming language that is suitable for different application platforms. This report starts with introducing to the programming languages used in the new data management systems. and describes the features and application scenarios of the popular programming languages. It focuses on the two newly released programming languages: the one is Swift released by Apple Corp, the other is Go released by Google Corp. A detailed analysis and comparision of two languages is carried out in the following aspects: historical origin, language features, compiling framework and performance comparison. Finally, we conclude the hardware platform and feature classification of new programming languages.
     2015.6.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Spark Actual Combat 
    Abstract:
    After Hadoop era, with its incomparable advantages, the rapid development of the next generation big data core technology Spark technology is regarded as an alternative to Hadoop cloud computing. Next, I will introduce Spark clusters construction, architecture, core analysis, Shark, Spark on Yarn and JobServer through the actual examples.
     (Web Group) Horizon- detail-height 
    Abstract:
    This report focuses on two articles about the clustering. These two articlesthe have a similar core idea , but there are many difference between the implementation details, the focus and the writing techniques ,and also have published different publications.The purpose is to study the article by comparing method , and the accumulation of writing methods.
     2015.6.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Introducing SSD to Hadoop 
    Abstract:
    Hadoop is a distributed data processing system.It depends on the classical computer system architecture,which is composed of disk,DRAM and CPU.However,the speed of disk can not match that of CPU.The speed of SSD is faster than that of disk.So we introduce SSD to Hadoop.There are many problems between SSD and Hadoop.This report introduced two papers which are related with the problem.
     (Mobile Group) Practical knn queries with location privacy
    Abstract:
    In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user.A solution for the mobile user to preserve his location privacy in kNN queries. The proposed solution is built on the Paillier public-key cryptosystem and can provide both location privacy and data privacy.
     2015.5.30  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The Phd grind [ppt]
    Abstract:
    This report will share some interesting books that i read during my doctoral-study years.
     (Web Group) Short text clustering 
    Abstract:
    Short texts are more and more popular.The classification and/or the clustering of shor text is a beneficial yet challenging problem due to the difficulty of setting the number of clusters, high-dimensionality, interpretability of results, scalability to large datasets, and sparsity. This seminar discusses about several existing method of short text classification and/or clustering, including topic model based methods, Dirichlet multinomial mixture model based methods and conceptual based methods.
     2015.05.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Live PPT 
    Abstract:
    As a popular office application, Power Point acts important role for presentation. Making a good powerpoint presentition is not difficult, but is does require some forethought. The more attrative your presentition is, the more likely your audience will be to understand and remember the information you present.This report shows how to make a live powerpoint.
     (Web Group) semantic similarity measure of sentence and document 
    Abstract:
    Sementic similarity measure is one of the most important topic of natural language processing.It is widely used in search engine,text mining,recommendation system and so on.This report first introduces the traditional word similarity based approach of measuring sentence and document similarity, furthermore introduces Word2Vec and Doc2Vec model,which are of deep learning architechture to measure the similarity and outperform other models.
     2015.05.16  Theme: Network Science and Complexity Science
     (Web Group) Special Lecture:Complex World,Simple Rules----From Network Science to Complexity Science 
    Abstract:
    The explosive growth of World Wide Web during the past two decades has presents an important complex artificial system for scientists to unveil the universal patterns and principles of its organization. In this talk, we started by brief introduction of network science, it is the work by Pro.Barabási et al., which described how the tools of network science can help understand the Web's structure, development and weaknesses. Additionally, we give a brief survey of the various research fields in complexity science.We present key concepts and analyze state-of-the-art of the field.Finally some new works and ideas by our team are introduced,including online attention flow network study and allometric scaling law in Web sites growth.
     2015.5.8  Theme: DASFAA Report
     (Mobile Group) DASFAA Report 
    Abstract:
    DASFAA participants: Jiangtao Wang, Lu Wang, Fengming Wang share their experience and report involved sessions separately.
     2015.04.25  Theme: Bootstrap
     (Cloud Group) Bootstrap for AQP and OLA systems 
    Abstract:
    As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quility of estimates, bootstrap provides perhaps the most promising step in this direction. In this report, we survey the related work in approximate query processing(AQP) community recent years, and introduce main ideas of the paper "Knowing when you're wrong: building fast and reliable approximate query processing systems". Besides, we introduce the first work integrating bootstrap into on-line aggregation(OLA), "G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data".
     2015.4.17  Theme: DASFAA Pre-Report
    DASFAA Participants DASFAA Pre-Report 
    Abstract:
    DASFAA participants: Jiangtao Wang, Lu Wang share their pre-report separately.
     2015.4.10  Theme: Cloud Data Management
     (Cloud Group) Real-time query processing for distributed data streams 
    Abstract:
    We can design the processing mode of directed acyclic graph(DAG) focusing on the data intensive computing in the distributed environment. The optimization strategy of DAG is the key point of research. This report analyzed two processing strategies based on the testbed of STORM. The one is dynamic construction of degree of parallelism(dop) and processing batch size(bs) focusing on the continuous queries. The other is designing the adaptive join operator for intra-operator adaptivity. Finally, we concluded the two optimization strategies and planed the future work and study.
     (Cloud Group) Some issues in Online Aggregation 
    Abstract:
    The appearance of big data has brought great challenges to traditional data manage-ment technology. With the rapid growth of data volumes,research in the field of relational databases has focused on the problem of aggregation queries under massive data, and proposed online aggregation approach to solve the massive data aggregate.
     2015.04.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Can SSDs help improve MapReduce performance? 
    Abstract:
    MapReduce is a programming model for parallel computation of large data sets. This report analyses the I/O characteristics in Map, Shuffle and Reduce progresses and then research on the viability of improving MapReduce performance by adopting SSD.
     (Web Group) Semantic Similar Words Extraction from Large Corpus 
    Abstract:
    Semantic similar words extraction is of important role in natural language processing research.The distributional hypothesis states that words with similar meanings tend to appear in similar contexts and construct word vector for each word, measure word similarity by vector similarity.Then filter and rerank the result from word vector using machine learning models and get the list of synonyms finally.This report introduce the distributional hypothesis and the rerank model respectively.
     2015.03.27  Theme: Memory Management
     (Cloud Group) The efficient management of hybrid PCM and DRAM memory 
    Abstract:
    New applications and multi-core processors need more main memory.The increasing memory capacity can lead to the increasing energy consumption.How to reduce the energy consumption of computer?The hybrid PCM and DRAM memory can reduce the energy consumption.This report introduces two papers about it.
     (Cloud Group) Deep into Spark 
    Abstract:
    Spark is distributed framework for parallel in-memory computing. It has widely attracted public attentions and is developing fast. This report is to introduce the spark ecosystem, the theorem and usage of Spark and the programming on Spark. At last but not least, compared with MapReduce, we discuss the advantages and disadvantages of Spark for big-data processing.
     2015.03.20  Theme: Web Data Management
     (Web Group) How Websites Influence Growth? 
    Abstract:
    The availability of big data,such as those from human online surfing records,makes it possible to probe into and quantify the regular pattern of user long-range, complex interactions between Websites.We try to apply the approaches developed by complex weighted network and flow network study to the clickstream network. If we see the user's attention flow as energy flow, it is reasonable to conjecture that the patterns found in other weighted networks should be also suitable for the weighted clickstream networks. By analyzing the circulation of the collective attention flow,we discover the scaling relationship between the impact of websites and their attention flow traffic. We found that the websites influence growth linearly with attention flow size. We also examine the collective total time online of a website with website influence,which turn out that influence scales sub-linearly with time. This result is not consistent with the common sense of “the more user’s time take up, the greater influence of website is”.
     (Wen Group) Named entity recognition and conceptualization on short text 
    Abstract:
    Named entity recognition (NER) is very important for many applications such as information integration, knowledge base population, question answering and so on. Though a plenty of work have been devoted to this task, the exsting methods cannot perform well on short text due to the sparsity, noise and lack of syntactical structure on short text. This seminar focuse on the NER task on shor text like search engine queries.
     2015.03.13  Theme: MapReduce
     (Cloud Group) MRSimJoin: MapReduce-based Similarity Join for Metric Spaces 
    Abstract:
    Similarity join is one of the most popular techniques for the domain of data analysis and cleaning. Recently, there are much work that has proposed efficient solutions for texutal similarity join, while, not much for distance-based similarity join. This presentation is about to introduce an approach that uses grid partitioning and MapReduce for efficient distance-based similarity join processing.
     (Cloud Group) SpongeFiles:Mitigating Data Skew in MapReduce Using Distributed Memory 
    Abstract:
    Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. Spilled data goes to SpongeFiles,which route it to the nearest location with sufficient capacity(local memory, remote memory, local disk, or remote disk as a last resort).
     2015.01.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Random Sampling in Online Aggregation 
    Abstract:
    OLA algorithms must rely on randomness to achieve statistical guarantees on accuracy of estimates. In the report, we introduce some classic probability sampling: including simple random sample, stratified sample, cluster sample and systematic sample. And we also introduce some efficient methods of answering random sampling queries of relational databases. Beside, two sampling techniques targeted at single-table and mutiple-table OLA are discussed.At last, We describe two sampling algorithms to process OLA on skewed data.
     (Cloud Group) MRSimJoin: Partition-based Textual Similarity Join 
    Abstract:
    Textual similarity join is an important part of spatial-textual similarity join. Recently, there are many kinds of solutions for textual similarity join, such as prefix filtering, edit-distance based partitioning, etc. This presentation is to introduce an efficient partition-based similiarty join method, including its main idea and distributed implementation useing MapReduce.
     2015.01.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The future of solid-state memory 
    Abstract:
    The high density of memory will bring us challenges.How can we respond to the challenge?This report gives the five schemes of the hypothetical system which was proposed by microsoft research.
     (Cloud Group) Sampling for group-by queries 
    Abstract:
    Sampling is an important technique of approximate query. For the group-by queries, the problem of small group can be caused by low selectivity. Random sampling would become invalid because of the existence of small group. This report will introduce the cause of small group and several approaches will be presented.
     (Cloud Group) Core Technology Analysis of Storm 
    Abstract:
    MapReduce, Hadoop and some related technology allow more data than we can handle before. However, these techniques are not real-time data processing systems, they are not designed for real-time computation. There is no way you can easily make hadoop into a real-time computing system, real-time data processing system and batch data processing system have big differences in the nature of demand. The lack of a "real-time version of hadoop" data processing has become a huge missing, Storm fill this missing.
     2015.1.6  Theme: Spatio-Textual Query
     (Cloud Group) Quadtree Spatial Index for Spatio-Textual Query 
    Abstract:
    Spatio-textual data is such kind of data that contains not only textual information but also spatial location information. These data is usually produced and used by LBS applications. For now, the query processing about this kind of data always use the hybrid solution of spatial and textual query techniques, and the choice of spatial and textual query techniques is the key to improving the query performance. Academic communities have paid attention to this kind of query for a long time and published large amounts of works. Through two new published papers, this report is to introduce how to use quadtree as the spatial part for spatio-textual query processing.

    Seminars(2009-2014)

    Seminars(2006-2008)