WAMDM Seminar  
  • 2018-06-21 Security and Privacy of Machine Learning Models by Junxu Liu
  • Abstract: Artificial Intelligence has been widely used in many practical applications, such as autonomous vehicles, facial recognition system, network security and other fields. As the essential part of Artificial Intelligence, Machine Learning(ML) is booming and has also received extensive attention. It is not surprising that some ML models are more accurate than humans in many scenarios. However, many security threats also exists in these "amazing" algorithms. In this report, I will introduce the security and privacy threats against ML models, includes taxonomy of the threats, three representative attack models and theirs defense models. Futher more, I will also introduce the privacy problems exists in user profiling, which is known as one of the main application fields of ML nowadays.
  • 2018-06-21 Data Management on Persistent Memory by Wenmei Wu
  • Abstract: In current computer storage structure, memory is volatile and DRAM is the main memory technology. However, DRAM technology is not likely to develop on a large scale because of energy consumption and extensibility. So it is necessary to introduce non volatile memory represented by PCM. Due to its non volatile property,this type of non volatile memory will have a profound impact on the file system, data management and data analysis systems. This report describes persistent memory technology and its impact on data management system.

     2018.06.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Security and Privacy of Machine Learning Models 
    Artificial Intelligence has been widely used in many practical applications, such as autonomous vehicles, facial recognition system, network security and other fields. As the essential part of Artificial Intelligence, Machine Learning(ML) is booming and has also received extensive attention. It is not surprising that some ML models are more accurate than humans in many scenarios. However, many security threats also exists in these "amazing" algorithms. In this report, I will introduce the security and privacy threats against ML models, includes taxonomy of the threats, three representative attack models and theirs defense models. Futher more, I will also introduce the privacy problems exists in user profiling, which is known as one of the main application fields of ML nowadays.
     (Cloud Group) Data Management on Persistent Memory 
    In current computer storage structure, memory is volatile and DRAM is the main memory technology. However, DRAM technology is not likely to develop on a large scale because of energy consumption and extensibility. So it is necessary to introduce non volatile memory represented by PCM. Due to its non volatile property,this type of non volatile memory will have a profound impact on the file system, data management and data analysis systems. This report describes persistent memory technology and its impact on data management system.
     2018.06.14  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) AstroSpark:A Unified Astronomical Big Data Processing Engine over Spark 
    The next decade promises to be an exciting time for astronomers. Large volumes of astronomical data are continuously collected from highly productive space missions. This data has to be efficiently stored and analyzed in such a way that astronomers maximize their scientific return from these missions.In this talk, we present AstroSpark, a distributed data server for astronomical data. AstroSpark introduces effective methods for efficient astronomical query execution on Spark through data partitioning with HEALPix and customized optimizer. AstroSpark offers a simple, expressive and unified interface through ADQL, a standard language for querying databases in astronomy. Experiments have shown that AstroSpark is effective in processing astronomical data, scalable and overperforms the state-of-the-art.
     (Web Group) ParaGraphE: A Library for Parallel Knowledge Graph Embedding 
    Knowledge graph embedding aims at translating the knowledge graph into numerical representations by transforming the entities and relations into continuous low-dimensional vectors, but existing single-thread implementations of them are time-consuming for large-scale knowledge graphs.This paper designs a unified parallel framework to parallelize these methods based on Lock-Free, which achieves a significant time reduction without in fluencing the accuracy.
     2018.05.31  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Relation extraction based on deep learning 
    The method of relation extraction based on deep learning achieves the best effect on the open data sets,and we will introduce these methods from different perspectives, including the pipeline method, the end-to-end method, and the distant supervision method.
     2018.05.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) data transparency and blockchain 
    Big data has become the core resource of the information society. At the same time, it also brings problems such as privacy leakage, data manipulation, data abuse, and "black boxes" of algorithm . The fundamental method to solve these problems is to improve the data transparency in the process of big data , so as to strengthen the supervision of the use of data, which involves the cross-disciplinary research of law and computer science. Blockchain-based data transparency technology allows data to be recorded on the blockchain at every step of data record, sharing, analysis, and deletion, facilitating the implementation of big data value and data accountability. This paper proposes data transparency, including transparent data record, transparent data sharing, transparent data analysis, and transparent data deletion. And research progress was summarized and analyzed. Finally, it summarizes the future development direction of data transparency technology.
     (Cloud Group) Jupyter Notebook Introduction 
    Jupyter Notebook is an application based on webpage for interactive computation. It can be applied to whole process computation: development, documentation, operation code and display results. This tool that integrates programming development and result display is a new working experience for everyone. This report will start with several examples and show the skills of using Jupyter Notebook at the scene. I hope this demonstration will enable you to basically learn the use of Jupyter Notebook.
     2018.05.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Replication-based State Management in Distributed Stream Processing Systems 
    Storm's state management is achieved by a checkpointing framework, which commits states regularly and recovers lost states from the latest checkpoint. However, this method involves a remote data store for state preservation and access, resulting in significant overheads to the performance of error-free execution. E-Storm, a replication-based state management system that actively maintains multiple state back ups on different worker nodes.
     (Privacy Group) Attacks on Machine Learning Model 
    Machine learning (ML) model may be deemed confidential due to their sensitive training data, commercial value, or use in security applications. Increasingly often, confidential ML models are being deployed with publicly accessible query interfaces. The tension between model confidentiality and public access motivates the investigation of model inversion and extraction attacks. In such attacks, an adversary with black-box access, but no prior knowledge of an ML model's parameters or training data, aims to steal the training data or directly duplicate the functionality of (i.e.,"steal") the model. In this report, these attacks will be demonstrated and some basic countermeasures will be further presented.
     2018.05.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Report of XLDB2018 
    The 11th Extremely Large Databases Conference(XLDB2018) was held in California, America, April 30 - May 2, 2018. This year, special consideration is given to large-scale data management for Machine Learning and Artificial Intelligence in production use cases. "KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building", a project of our laboratory research, has made a lightening talk at the conference.
     2017.04.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) A Hybrid NVM-DRAM Storage Engine for fast data Recovery 
    Many applications need response to customer quickly.Then it need efficient storage system to respond to customer.In-memory key-value storage is used by many applications.However,due to the limitation of DRAM itself,DRAM is impossilbe to develop on a large scale.Thus we must introduce new memory.This report tell us how to build a hybrid nvm-dram storage engine for fast data recovery.
     (Web Group) The Methods of Knowledge Graph Embedding Based on Convolution 
    For the knowledge graph embedding methods, the deep model can capture more features than the translation or bilinear model. This report presents ConvE and ConvKB methods. They are the latest papers in 2018. Both them are based on convolutional neural networks. The knowledge graph is modeled by 2D convolutional embedding and multi-layer non-linear feature in ConvE. It uses the following techniques to improve performance: 1-N fast scoring, multi-layer non-linear feature, batch normalization and dropout. However, it is a very simple convolutional model that can only capture local relations. To this end, ConvKB uses a convolutional neural network to capture the global relations and transformation features between entities and relations in the knowledge graph.
     2018.04.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building 
    In scientific domain, cutting edge creations and discoveries are often open accessible as texts in Web, papers and other carriers. Decentralized and unordered, they are hard to follow. By information extraction and reorganization, knowledge graph can help professionals follow latest and discover unknown scientific facts more effectively. Therefore, we will take microbiology for an example to show the process of large-scale scientific domain knowledge graph building by KGBuilder.Taking texts as input, KGBuilder creates domain knowledge graph in 3 steps: identifying entities, relations and new links respectively by named entity recognition, relation extraction and linking prediction. By combining BiLSTM, CRF and probabilistic methods, domain knowledge is added to named entity recognition in a simple and extensible way. Its relation extraction can automatically generate large amount of annotated data and extract features via distant supervision and neural network, reducing annotating cost. Inspired by Question Answering and rectilinear propagation of lights, KGBuilder puts forward TransMT to deal with the head-tail unbalanced problem in link prediction, performing better than traditional models, especially translation ones.
     (Privacy Group) Case Analysis of Facebook's Data Abuse and A state-of-art Framework for anti-proling. 
    This report includes two main topics as follows: First, I will give a brief introduction for the Facebook-Cambridge Analytica data analysis and political advertising uproar. Then I will show everyone the main contents of the process from data to personalized advertising. Experiments show that a wide variety of people's personal attributes can be automatically and accurately inferred using their Facebook Likes, and computer-based models maybe know you better than your friends. Second, I will introduce a state-of-art framework for anti-proling. This framework not only can reconcile privacy and user utility, but control their trade-off. This approach make the services provider only sees an intermediate layer consists of many Mediator Accounts(MAs), instead of the real users, so that users' personal information can be protected and will not produce accurate profiles.
     2018.04.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Application of scientific workflow in software defined 
    The Aserver+ system introduces the concept of Software Define, which aims to make a combination of data dependencies between complex applications and programs, and to control each part under the constraints of time, space, and resources, and provide scientific data management, analysis, and visualization for scientific data management, analysis, and automation. The management platform. This report gives a brief explanation of the overall design of Software Defined and the design of each module, focusing on the analysis of the front-end scheduling system based on BPEL language.
     (Web Group) Latent Dirichlet Allocation Topic Model 
    LDA is an unsupervised machine learning model and USES the word bag model. An article will construct a word vector in the word bag model. LDA (Latent Dirichlet Allocation) is a kind of document generation model. It thinks that an article has multiple topics, and each one corresponds to a different word. The construction of the passage, first choose a topic in a certain probability, and then under the theme of choose one word at a certain probability, thus generating the first word in this article. This process is repeated, and the entire article is generated. Of course, there is no order between the words and the words.
     2018.04.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Mobile Privacy Survey—Evaluating App Privacy and User Privacy Solution 
    Privacy has become a key concern for smartphone users as many apps tend to access and share sensitive data. There are three main approaches to surveying sensitive data collection status on mobile phone: permissions analysis, static code analysis and dynamic analysis in researches. As mobile privacy is defined as collect sensitive data without user’s consent later, permission-based and privacy policy based analysis methods are proposed to evaluate privacy leakage. At the same time, a few privacy-preserving techniques are offered to prevent data collection process or anonymize sensitive information. And Now There is Local Differential Privacy method which applies differential privacy to small mobile device to protect user privacy.
     (Web Group) Technic of Man-Machine Dialogue 
    Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring labelled datasets. In this talk, I will first explain the common components of task-oriented dialogue. Then, I will introduce a neural network-based text-in, text-out end-to-end trainable dialogue system. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. Last, I will show a demo using a man-machine diologue training platform and a speech recognition system.
     2018.03.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Traversing Knowledge Graph in Vector Space without Symbolic Space Guidance 
    Recently, based on the observed facts in the knowledge base to finding the missing facts , shows the importance of learning multi-step relations in vector space. The contents of the report include Compositionalization model and Implicit ReasoNets model of knowledge base complementary (IRNs), and the comparison between two models.
     (Cloud Group) An Approach to Quantify The Resource Impact in Big Data Systems 
    The performance of big data systems is always affected simultaneously by the CPU, memory, disk and network. It is important for the bottleneck analysis to quantify the resource impacts. However, the existing approaches do not address the comparable quantitative impacts on four major resources. Although some works can work on the specific resources, the results are error-prone. At this speech, we present an approach to address this problem by isolating the resource impact when observing the performance variation. Our approach is general-purpose due to having no knowledge about execution frameworks. We have developed two high-level end-to-end performance models to build new performance metrics, which can normalize the performance variation into the resource impact. A general performance model is built to capture the performance of big data systems. It can ensure that our methodology is general-purpose. The other uses the speedups obtained by the system to evaluate the impact factors of four major resources.
     2018.03.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) A Novel Security Framework for Managing Android Permissions Using Blockchain Technology 
    The Android system still occupies the dominant position of the market, and it is all attributed to its open source. The number of Google Apps in 2016 has reached 27 million. Its popularity has also become the target of many malicious software attacks. Of course, Google has also made a lot of efforts, starting from the linux security model to Android 6.0 users to manage their own permissions, but there are always its loopholes. This report will select a brand-new framework and use blockchain decentralization, self-control, non-destructive, open and transparent features to better and more effective management of Android system permissions.
     (Web Group) Emotion Dictionary Applied in Text and Word Embedding for Sentiment Analysis 
    The contents of the report mainly cover the followings:(1)Emotional analysis of emotional dictionary in Shakespeare's dramas;(2)Emotional analysis applied in world famous works;(3)Sentiment analysis application of emotional dictionary in Sina Weibo datasets;(4)The application of emotional dictionary in word embedded learning.
     2018.3.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Graph Data Release with Local Differential Privacy 
    A large amount of valuable information resides in decentralized social graphs, where no entity has access to the complete graph structure. Instead, each user maintains locally a limited view of the graph. In this report, we mainly investigate techniques to ensure local differential privacy (LDP) of individuals while collecting structural information and generating representative synthetic social graphs. We first demonstrate the importance of calibration in LDP methods, and then we present the details of the existing BTER model. To overcome the drawback of the existing solution based on BTER, i.e., LDPGen, we propose to combine the perturbed node degrees and neighbor list to generate a more accurate synthetic graph based on BTER.
     (Privacy Group) Blockchain and decentralized data storage 
    Right now, our data, including web data, Internet of Things data and our files, are collected and stored by third-party service. These are based on the fact that we have to trust third-party services. At the same time, we have lost the ownership of data.And,there are a single point of failure and data silos. Decentralized storage based on Blockchain allows individuals to control their own data and solve single-point failures and data silos. This presentation mainly introduces several decentralized storage and sharing systems.
     2017.12.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Semi-stream Join 
    As memory continues to grow and powerful cloud computing platforms emerge, considerable computational resources can be utilized to implement stream-based connections. However, there are several ways in which it is possible to operate with limited resources. First of all, the data may be too large for the resources allocated to the flow connection, thus requiring a better algorithm. Second, when it comes to mobile and embedded devices, there may be a need for low resource consumption methods. Stream-based connectivity is an important operation in modern system architectures where data can be transferred in time. This report discusses a class-specific, half-stream connection for stream-based connections that can be applied to real-time data warehouses where slowly changing tables are usually data tables and streams contain incoming real-time data.
     (Cloud Group) Software Defined 
    The goal of software defined is to use network technology to integrate geographically different computing facilities and storage devices and establish a common basic support environment for network services so as to effectively aggregate and widely share computing resources, data resources and service resources on the Internet Thereby creating a virtual scientific and experimental environment capable of regional or global cooperation or collaboration and supporting scientific activities characterized by large-scale calculations and data processing.
     2017.12.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Cloud based Real-Time and Low Latency Scientific Event Analysis 
    Short-timescale and large field of view sky survey can lead to grand scientific discoveries because the novel scientific infrastructure may capture different kinds of optical transient sources quickly. It also brings us the challenges in real-time and low latency scientific event analysis. All new survey data must be handled successfully before the next survey cycle and the alert that will trigger the follow-up observations should be issued as soon as possible. A cloud based method is proposed in this paper and the solution is implemented as a highly efficient system, Aserv. A set of compact data store and index structures are proposed to describe the proposed scientific events and a typical analysis pattern is formulized as a set of query operations. Domain aware filter, accuracy aware data partition, highly efficient index and frequently used statistical data designs are four key methods to optimize the performance of Aserv. Experimental results under the typical cloud environment show that the presented optimization mechanism can meet the low latency demand for both large data insertion and scientific event analysis. For GWAC (Ground-based Wide Angle Camera) that will generate about 3.5 million rows of survey data in every 15 seconds, Aserv can perform the data insertion in 3 seconds and execute the heaviest query also in 3 seconds. Furthermore, a performance model is given to help Aserv choose the right cloud resource setup to meet the guaranteed real-time performance requirement.
     (Cloud Group) Performance prediction of large data systems 
    In computer science, performance prediction is a method of estimating the execution time or other performance factors of a program on a given computer. However, in the context of large data, because the calculation is carried out in a distributed environment, it increases the accuracy of computer performance prediction.This report introduce a new big data performance prediction method that predicts run time on large datasets by running part of that dataset on a small cluster. And an optimization scheme of experimental design is proposed, which can greatly save the experimental run time and experimental cost and improve the accuracy of model prediction.
     (Web group) ScholarSpace Ranking system development 
    ScholarSpace Ranking system aimed at improving ScholarSpace original Ranking system, namely each scholar and the Ranking of schools depends not only on the number of papers, but also take into account a series of Ranking algorithms, so that the Ranking results is more reference and Ranking algorithms reference is Computer Science Ranking system. This report mainly introduces ScholarSpace Ranking development of a series of process, first introduced the algorithm, then talk about how to use data synthesis and system implementation, and finally briefly summarizes and introduces the next step of work.
     2017.12.16  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Frequent Itemset Mining under Differential Privacy Abstract:
    Frequent Itemset Mining is one of important tasks of the association rules mining.On the one hand, it enables data analysts to discover statistical patterns from sensitive data. On the other hand, there are unauthentic third party can infer individual-level information with high confidence based on the released results. Differential privacy is a strong and rigorous standard for privacy protection, which can tradeoff these two problems well and has been applied to well-known software systems.This report focuses on mining top-k frequent itemsets over sensitive data under differential privacy. The goal of this task is to release a randomized version of the top-k frequent itemsets with high utility to the user, under 蔚-differential privacy constraints. We proposes a novel solution named PrivSuper, which contains both a new algorithm PrivSuperDFS and a new differential privacy mechanism SEM. Because of these two improvements, PrivSuper achieves significantly higher result utility compared to previous solution.
     2017.11.30  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Data transparency 
    Data transparency is the ability of a subject to effectively gain access to all information related to data used in processes and decisions that affect the subject, including record transparency, use transparency, disclosure and data provisioning transparency,algorithm Transparency,and law and policy transparency.The report includes the concepts of data transparency and the application of blockchain to solve the record transparency.
     (Privacy Group) OrientAP system and mobile user privacy leakage data acquisition method 
    Big data era, the issue of large-scale privacy leaks highlighted. Among them, the leakage of mobile user privacy accounts for a large proportion. How to visualize the risk value of mobile users' privacy leakage is becoming more and more important for the purpose of monitoring. This report starts with the system, briefly introducing and demonstrating the OrientAP system. Secondly, it details how to crawl private data when the mobile user operates the APP, and finally introduces the direction of the future system.
     2017.11.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Named Entity Recognition towards Specific Domains 
    The construction of knowledge graph for specific domains has become a common focus of academia and industry. Extracting entity relationships from texts is the main problem. It is usually divided into two steps: Firstly, do named entity recognition (NER) and then extract relations between entities. There are two kinds of NER methods: rule-based matching and machine learning. The former usually has lower recall. The latter relying heavily on the training corpus, and most are open-domain and perform badly in a specific domain. This report takes microbiology and habitat entities as an example to introduce how to integrate domain knowledge into neural networks to improve the precision and recall rate of NER. The report analyzes in detail the impact of different strategies.
     (Cloud Group) Aserv persistence and offline query engine design 
    The content of this report is divided into two parts: 1.Aserv system persistence and offline query engine design. First, in order to solve the separation of hot and cold data, we designed a two-tier storage scheme. We use the first level of storage that is cached storage of hot data, the use of Spark + Cassandra-based management solutions and propose a line segment tree-based indexing technology for efficient query. In Level 2 storage, we persist the catalog data for all night observations. In the distributed file system HDFS, we implement a management scheme based on logical stratification, that is, designing a star cluster structure to aggregate and store the entire star table data, and designing an index table based query engine according to the astronomical requirements Can query the catalog data at a small cost from the cache and catalog cluster. 2. Software and hardware-level acceleration of Spark + HDFS-based persistence and query engines. Because the persistence and query engines we designed are built on Spark, essentially all queries and persistence operations are applications running on Spark. During the actual operation of the Aserv system, we found that the resource utilization rate of the cluster is not high. Therefore, we try to optimize the resource utilization of the cluster from the software and hardware levels respectively to improve the overall system throughput. At the software level, we implemented a parallel execution framework for the Spark application layer that allows persistent and offline query applications to execute in parallel, dramatically improving efficiency. At the hardware level, we built the D-Spark system to diagnose cluster bottlenecks by quantifying the performance bottlenecks of major hardware components and to make targeted upgrades to the persistence and query engines running on that cluster The operating speed has been more substantial increase.
     2017.11.16  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Graph Analysis under Local Differential Privacy 
    A large amount of valuable information resides in decentralized social graphs, where no entity has access to the complete graph structure. Instead, each user maintains locally a limited view of the graph. In this report, we mainly investigate techniques to ensure local differential privacy of individuals while collecting structural information and generating representative synthetic social graphs. In addition, another interesting topic we will focus in this report is the relationship between differential privacy and the problem of over-fitting in machine learning.
     (Web Group) Relationship discovery and ScholarSpace Report 
    Relationship discovery is that using the existing knowledge in the knowledge graph to infer unknown knowledge. This report is mainly about the summary of relationship discovery and the relationship discovery system RelFinder and RECAP, and the working principle and visualization process of RelFinder and RECAP, and the progress report of laboratory project ScholarExplorer.
     2017.11.09  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) SSD-based in-memory key-value storage 
    Many applications need response to customer quickly.Then it need efficient storage system to respond to customer.In-memory key-value storage is used by many applications.However,due to the limitation of DRAM itself,DRAM is impossilbe to develop on a large scale.This report tell us how to introduce SSD to in-memory key-value system.
     (Web Group) Analysis of the inherent defects and explore of corresponding strategies of the Norm family 
    Norm family and Conbination family, Neuron family are KGE three genres, it is known for its simplicity and efficiency. However, they are also limited. This report mainly analyzes the inherent defects of each model in the Norm family and gives the corresponding solutions.
     2017.11.04  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) A report on DegreeTree completing 
    DegreeTree can represent scholars' relationships with their advisors and students, therefore, playing a crucial role in ranking scholars and reviewers recommendation. However, data lacking has become the most serious bottleneck. So this report will focus on the commpleting work from three aspects. The first one is the outcomes. Then it is the problems occuring during data processing and corresponding solutions. Finally, from the perspective of ontology, this report will analyse the feasibility of mulitiple applications for ScholarSpace, including ranking of scholars and organizations, recommendation for experts, reviewers, and literature.
     (Cloud Group) Report of XLDB2017 
    The 10th Extremely Large Databases Conference(XLDB2017) was held in Clermont-Ferrand, France,October 10th-12th, 2017. The conference program includes 4 sessions, 17 light talks,1 hackathon, and 1 demo. "AstroServer - A Framework for Real-time Analysis in Large-scale Astronomical Data", a project of our laboratory research, has made a presentation at the conference.
     2017.10.19  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Collecting and Analyzing Data from Mobile Devices with Local Differential Privacy 
    Privacy can be defined using a four dimensional taxonomy with the sense of increasing privacy exposure and we can use information theory to quantify privacy violation. There is a potential privacy threat in the current smartphone platforms, which means that user data collected from smartphones, such as installed apps, can be used to infer various user attributes (age, gender, race and income). LDP address this problem by only collecting randomized answers from each user, with guarantees of plausible deniability. Besides, we also investigate building a large class of machine learning models for demographic prediction under ε-LDP to protect user privacy.
     (Could Group) Deep Neural Network and Applications on China Secondary Market. 
    Deep learning is another peak in the field of machine learning, in the image, voice, natural language processing and other tasks have made revolutionary progress. This report describes the basic neural network structure and the current popular neural network, including convolution neural network, recurrent neural network and LSTM, reinforcement learning and Deep Q Network used by AlphoGO. I will introduce how these neural networks are used for the China secondary market.
     2017.10.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (cloud Group) Block chain: principle, technology and value 
    Block chain technology is the most advanced technology in the field of financial science and technology, which has attracted the attention of many government departments, financial institutions and investors. This report starts with a mature application of block chain technology bitcoin, introduces the principle, characteristics and application of block chain technology in the financial industry. From the technical level and application level of block chain, the characteristics of block chain are given, the classification of block chain is given, and the requirements and difficulties of block chain technology application are put forward.
     2017.09.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) NVM-based in-memory key-value storage 
    Many applications need response to customer quickly.Then it need efficient storage system to respond to customer.In-memory key-value storage is used by many applications.However,due to the limitation of DRAM itself,DRAM is impossilbe to develop on a large scale.Thus we must introduce new memory.This report tell us how to introduce new non-volatile memory to in-memory key-value system.
     (Web Group) EAE: Enzyme Knowledge Graph Adaptive Embedding 
    Recent years have seen a drastic rise in constructing web-scale knowledge graph (KG) and deal with practical problems fall back on KG. Embedding learning of entities and relations has become a popular method to perform machine learning on relational data such as KG. Based on embedding representation, knowledge analysis, inference, fusion, completion and even decision-making could be promoted. Constructing and embedding open-domain knowledge graph (OKG) has mushroomed, which greatly promots the intelligentization of big data in open domain. Meanwhile, specific-domain knowledge graph (SKG) has become an important resource for smart applications in specific domain. However, SKG is developing, its embedding that is still in the embryonic stage. This is mainly because there is a germination in specific-domain knowledge graph (SKG) due to the data distributions are quit different between OKG and SKG. More specifically: (1) in OKG, such as WordNet and Freebase, sparsity of head and tail entities are nearly equal, but in SKG, such as Enzyme KG and NCI-PID, inhomogeneous is more popular. For example, the tail entities are about 1000 times more than head ones in the enzyme KG of microbiology area. (2) head and tail entities can be commuted in OKG, but they are noncommuting in SKG because most of relations are attributes. For example, entity 'Obama' can be a head entity or a tail entity, but the head entity 'enzyme' is always in the head position in the enzyme KG. (3) breadth of relation has a small skew in OKG while imbalance in SKG. For example, a enzyme entity can link 31809 x-gene entities in the enzyme KG. Based on observation, we propose a novel approach EAE to deal with the 3 issues. We evaluate our approach on link prediction and triples classification tasks. Experimental results show that outperforms Trans(E, H, R, D and TranSparse) significantly, and achieves state-of the-art performance.
     2017.09.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Collecting key-value data under local differential privacy 
    So far, existing work under local differential privacy is mostly limited to simple data types, such as categorical, numerical or primitive set-valued data. To the best of our knowledge, no existing work focuses on key-value data in local setting. In this work, we initiate the study of collecting key-value data under LDP. We propose PrivKV, which is essentially a multi-round framework (MRF), aiming to estimate frequency and mean value simultaneously. The key idea is to iterate the results through MRF, and gradually approximate the real results. We design a local perturbation protocol (LPP) as a building block and further present a more practical multi-round framework (PMRF) considering the communication overhead. Finally, we also present a optimization strategy, which gives a comprehensive account of the preserving framework. We confirm the correctness and effectiveness of PrivKV through theoretical analysis and extensive experimental results.
     (Cloud Group) AstrongServer- a astronomy analysis System 
    GWAC, Ground-based Wide-Angle Camera array, collects high-density astronomical sources with a high cadence. WAMDM design a system named AstroServer to solve the problem about how to storage the data that GWAC captures and how to process these data. We show how AstroServer processes GWAC’s catalogs, models astronomical time-series data and queries transients. Further, we will discuss how to optimize AstroServer to ensure real-time analysis, as optical telescope technologies are developing, causing more astronomical sources to be collected.
     2017.06.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) GWAC visualization of starry sky monitoring data-AstroDB 
    With the emergence of a variety of the latest observation technology, astronomical areas ushered in the era of information explosion, and the corresponding large volume of data visualization, for astronomical information monitoring has become particularly important. This report describes the overall framework of the astronomical information monitoring visualization, technical details and difficult breakthroughs, and the Astro system demonstration and future system direction.
     (Cloud Group) GWAC query implementation v.2 
    The main content of this report is the improvement of GWAC astronomical large data system second edition. GWAC first version of the system, a night of data will consume more than 3TB memory, the existing environment can not meet the requirements, so with the v2.0 version of the work. V2.0 system uses a new data structure, in the memory footprint for v1.0 version of the half.
     2017.06.16  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Real-time anomaly astronomical data management based on digested information 
    The emergence of very large astronomical observations not only allows researchers to observe new astronomical phenomena, but also to verify the correctness of existing physical models. At present, the GWAC telescope is characterized by: (1) low delay continuous sampling; (2) multi-camera parallelling; (3)the large field of view with a single camera. Based on the above characteristics, GWAC astronomical telescope can be a low delay on a day to take pictures, this feature is conducive to observation of short time-scala abnormal astronomical phenomenon. As a telescope with the data management system needs to be able to quickly complete the storage of high-value astronomical data storage and query, in order to quickly find some special astronomical phenomena to enhance scientific data support. Due to the fact that there is no astronomical telescope in the world that does not have low latency observations for specific fields, there is not much research on the management system of real-time anomaly astronomical data. Based on the above characteristics of GWAC, this speech will present a real-time anomaly astronomical data management system based on digested information. It mainly designs the degested information for time-index, space-index and counting requirements, and optimizes four typical anomaly data queries.
     2017.06.02  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) linear algebra on database system 
    Data analytics,including machine learning and large-scale statistical processing,is an important application domain and such computations often require linear algebra.Not only the traditional relational database but also the array database system supports linear algebra.This report tells us how to use relational database and array database system to implement linear algebra.
     (Web Group) A comparative analysis on PRA and SFE 
    PRA is a typical KB completion method based on the topological structure. Its main idea is to get path feature information by Random Walking. Though the Random Walking method can reduce computation cost, it also makes experimental results unstable. Matt Gardner put forward a simpler and more efficient metheod - SFE(Subgraph Feature Extraction). How did he make it? This report will explain Matt Gardner's idea in details and present the experimental results. Besides, in order to ensure correctness and make full use of the mathods, I also carried on the PRA and SFE experiments on Freebase data.
     2017.05.26  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Mobile Group) Word and Document Embeddings based on Neural Network Approaches 
    In natural language processing, the most widely used feature representation is the Bag-of-Words model. This model has the data sparsity problem and cannot keep the word order information.The report summarizes and analyzes the semantic representation technology of text from two levels of word and document, as follows:First, for generating word embeddings, we make comprehensive comparisons among existing word embedding models;Second, in Chinese character or word representation, we find that the existin models always use the word embedding models directly.Third, for document representation, we analyze the existing document representation models, including recursive neural networks, recurrent neural networks and convolutional neural networks.Four, summary and Outlook
     2017.05.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (knowledge fusion) Introduction of Generative Adversarial Networks 
    This introduction includes 5 parts as follow:(1) Why generative modeling is a topic worth studying, (2) how generative models work, and how GANs compare to other generative models, (3)the details of how GANs work, (4) research frontiers in GANs, and (5)state-of-the-art image models that combine GANs with other methods.
     (Privacy Group) System and Application Based on Differential Privacy Protection 
    In recent years, with the arrival of big data era, the data privacy protection has attracted more and more attention. As a result, how to protect the data releasing, storaging and analysising effectively has become a hot issue. Many of the traditional privacy protection technologies rely on specific background knowledges, such as k-anonymous and other privacy protection methods, and they will be invalid without those specific background knowledges. Therefore, in recent years, there have appeared differential privacy protection technology. It is an emerging data privacy protection method that does not rely on the specific background knowledges of data sets, which is a robust privacy protection strategy.At present, the research on the differential privacy protection is mostly at the theoretical level, and the related principle display system is very few. Therefore, this research has developed a principle display and verification system based on the differential privacy protection strategy, and protects the car coordinate privacy, which achieves good results.
     2017.04.28  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Bacteria-Biotope Event Extraction with Neural Network 
    Bacteria-Biotope Event Extraction is to automatically extract the relation of microorganisms and habitats from biological documents. This is not only a key point in the construction of a comprehensive, understandable microbiological and habitat relational database, but also promotes the development and practical application of microbiology, health science and food industry. At present, the main methods of Bacteria-Biotope Event Extraction are divided into rule based method and machine learning based method. These two methods need to manually design rules and features, select the classifier, and can not use large amount of unannotated corpus. while neural network based method can learn the features automatically, avoid manual intervention,and make use of the unannotated corpus of special domain. This report focuses on the progress and future work of our team on the Bacteria-Biotope Event Extraction with the use of neural network.
     (Cloud Group) ICDE2017 Participant Reports 
    Introduce two research paper and two demos of ICDE 2017 which held in San Diego.
     2017.04.21  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Rappor : LDP technology used in Google Chrome Browser 
    Differential Privacy allows data collectors to analyze statistical information while guaranting user privacy, but there is privacy risk and collectors still have user's raw data. Local Differential Privacy solved this problem for each individual user in the local model randomizes his data himself before he sends data to an untrusted server. Google has already been using it for its Rappor project for the last years. Randomized Aggregatable Privacy-Preserving Ordinal Response, or Rappor, is a technology for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantee.
     (Cloud Group) GWAC persistence and query implementation v.1 
    Design and Implementation of Persistence and Query System for GWAC Astronomical Large Data System. 1. Persistent means that during the day GWAC needs to be in a limited period of time in the night buffer in the redis data read out through the spark, the establishment of the table structure, and finally into the HDFS. 2. We introduce our real-time query for large astronomical data and the need for offline query, the design of the query engine
     2013.04.07  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) AdaStorm: Resource Efficient Storm with Adaptive Configuration (ICDE2017 Pre-Report) 
    Storm is a popular real-time processing system. However, our earlier experiment shows that the fixed configuration of Storm would lead to either significant resource waste or limited processing throughput. In this demonstration, we present AdaStorm, a system to dynamically adjust the Storm configuration according to current data stream properties. AdaStorm is designed to minimize the resource usage while still ensuring the same or even better real-time response. We will demonstrate that AdaStorm can achieve resource efficiency as well as data rate tolerance, compared to Storm system with fixed configuration. Video: https://youtu.be/YFPBFNdMbXM.
     (Web Group) Relation exploration and interaction analysis based on microbiology data 
    With the sequencing, mass spectrometry and other means of continuous improvement, along with the development of science and technology, the efficiency of data generation has been greatly improved, microbial comprehensive analysis of various large data has become a key issue. How to store microbiological data, how to extract the key information in the data, and finally how to carry out interactive visual display, all this has become a big data age microbial data analysis challenges.
     2017.03.31  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Processing of large-scale spatiotemporal data 
    Secondo,which is a extensible system, can provide a variety of data types and operators to effectively represent and process spatiotemporal data. However, due to the use of navigation and mobile devices , spatiotemporal data exploded. Stand-alone version of Secondo can not meet the actual needs of spatiotemporal data processing. This report describes parallel and distributed Secondo systems.
     (Cloud Group) GWAC data real-time processing and interval query 
    The emergence of very large astronomical observations not only allows researchers to observe new astronomical phenomena, but also can be used to verify the correctness of the existing physical model. At present, the GWAC astronomical telescope data processing project involving units such as the Observatory and the National People's Congress has the following distinctive features (2) the data in the form of a block; (3) can be low latency query the current observation night data At present, the Observatory program to MonetDB database to do the underlying support, the stars related to the (1) data source in a fixed frequency to produce data in the form of flow; Data into a logical table, although the program is simple, but monetDB every few dozen files will jump, load time increased to about 10 seconds, instability may lead to data storage lag. People's Congress to Redis cluster as the underlying support, each star data to form KEY-LIST structure, but the structure of the storage on the network delay is high, and data management memory overhead. In the face of these problems, we have improved the program, each exception star data stored in KEY-LIST structure, the remaining data in the form of block by KEY-LIST storage. The case advantage is that we can trade off the efficiency of storage and query efficiency, but for special queries such as interval query will reduce the efficiency of the query, so we plan to introduce a special inverted index and segment tree to construct an index to improve the overall query speed.
     2017.03.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Skewed Data Stream Join in Distributed Data Stream Management Systems 
    Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, being memory-based processing, they impose high memory requirement on the system. Join-Biclique model has three characteristics: memory-efficiency, elasticity and salability. However, existing Join-Biclique model is unable to allocate the query nodes dynamically, and requires to set parameters about the grouping manually. What is more serious is that the effect of data skew is worse under the full-history join query. In this talk, in order to ensure the consistency of the query statement, we introduce a greedy algorithm to deal with the data steam skew.
     (cloud Group) Spark Core Programming and Core Architecture Depth Analysis 
    This paper focuses on the spark features, core programming principles, operator case studies, and kernel architecture analysis.
     2017.03.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) KB Completion Based on PRA 
    Though Knowledge Bases are becoming much larger, they are still far away from completion. KB completion models can be classified into three kinds: graph feature models, latent feature models and Markov Random Fields. This report will share Path Ranking Algorithm (PRA) included in graph feature models and two optimizations on it. The first optimization is to extend a large knowledge base by reading relational information from a large Web text corpus. The other one proposes a novel multi-task learning framework for PRA, referred to as Coupled PRA(CPRA). Do these optimizations still apply to the latent feature models? Can we just combine these two models into one? At the end of this report, there will be a simple comparison between graph and latent feature models.
     (Web Group) Knowledge Base Construction with Deepdive 
    Knowledge base, which consists of entities and relationships, describes the abstract concepts of the world.It is widely used in commercial search engines, question answering systems, electronic commerce sites and social network sites.Deepdive,developed by Stanford University, is an open source knowledge base building tool.This report first introduces Deepdive's development background and its architecture, and then according to an example (Spouse relationship construction) presents Deepdive's application development process. The final part of the report introduces the difficulties of using Deepdive and web group's future work.
     2017.02.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Knowledge Fusion) Automatic Knowledge Base Construction by NELL, EntityCube, Watson, or DeepDive 
    Large knowledge bases (KB鈥檚) about entities, their properties, and the relationships between entities, have become an important asset for semantic search, analytics, and smart recommendations over Web contents and other kinds of Big Data. Knowledge base construction (KBC) is the process of populating a knowledge base, i.e., a relational database storing factual information, from unstructured inputs. One key challenge in building a high-quality KBC system is that developers must often deal with data that are both diverse in type and large in size. Further complicating the scenario is that these data need to be manipulated by both relational operations and state-of-the-art machine-learning techniques. The technology implemention and development of KBC can be shown through several actual systems.
     (Web Group) Constructing an Interactive Natural Language Interface for Relational Databases 
    Natural language has been the holy grail of query interface designers, but has generally been considered too hard to work with, except in limited specific circumstances. In this report, I describe the architecture of an interactive natural language query interface for relational databases. Through a carefully limited interaction with the user, we are able to correctly interpret complex natural language queries, in a generic manner across a range of domains. By these means, a logically complex English language sentence is correctly translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and can be evaluated against an RDBMS. We have constructed a system, NaLIR (Natural Language Interface for Relational Databases), embodying these ideas. Our experimental assessment, through user studies, demonstrates that NaLIR is good enough to be usable in practice: even naive users are able to specify quite complex ad-hoc queries.
     2017.01.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Discussion on Warehousing Procedure 
    We warehouse the data accumulated in redis to HDFS through spark. This report introduces how to implemente the warehousing program in details, and explains how we solve the problems in practice.
     (Web Group) Based on Big data Small Footprint and Sirius for understanding benchmark 
    Sensors on mobile phones and wearables, and in general sensors on IoT (Internet of Things), bring forth a couple of new challenges to big data research. First, the power consumption for analyzing sensor data must be low, since most wearables and portable devices are power-strapped. Second, the velocity of analyzing big data on these devices must be high, otherwise the limited local storage may overflow.Another paper talks about As user demand scales for intelligent personal assistants(IPAs) such as Apple鈥檚 Siri, Google鈥檚 Google Now, and Microsoft鈥檚 Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an opensource IPA workload is an obstacle in addressing this question. Comparing these two papers and give some conclusion about how to write a benchmark paper.

     2016.12.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Knowledge-based Question Answering With Deep Learning 
    Deep learning has made great progress in the fields of image and speech. Some common tasks of natural language understanding, such as part of speech tagging, word segmentation, named entity recognition, entity extraction, relation classification and classification, have achieved good results with deep learning. This report focuses on KB-QA (knowledge-based Question Answering).Some common methods will be introduced. This report will also present some progress of Web Group in this field and the future work of Web Group.
     (Cloud Group) Data Management Challenges and Real-time Processing Technologies in Astronomy 
    In recent years, many large telescopes, which can produce petabytes or exabytes of data, have come out. These telescopes are not only beneficial to the find of new astronomical phenomena, but also the confirmation of existed astronomical physical models. However, the produced star tables are so large that the single database cannot manage them efficiently. Taking GWAC that has 40 cameras and is designed by China as an example, it can take high-resolution photos by 15s and the database on it has to make star tables can be queried out in 15s. Moreover, the database has to process multi-camera data, find abnormal stars in real time, query their recent historical data very fast, persist and offline query star tables as fast as possible. Based on these problems, firstly, we design a distributed data generator to simulate the GWAC working process. Secondly, we address a two-level cache architecture which cannot only process multi-camera data and find abnormal stars in local memory, but also query star table in a distributed memory system. Thirdly, we address a storage format named the star cluster, which can storage some stars into a physical file to trade off the efficiency of persistence and query. Last, our query engine based on an index table can query from the second cache and star cluster format. The experimental results show our distributed system prototype can satisfy the demand of GWAC in our server cluster.
     2016.12.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Energy Conservation Techniques for Disk 
    Today,with the explosive growth of data, we not only need store large amounts of data but also need deal with massive amounts of data.This will bring high energy consumption.Moreover, data center energy consumption is still showing rapid growth year by year. A large portion of the energy consumption in the data center is caused by the disk. The current storage system energy consumption accounts for 37% of the energy consumption of the entire IT center.At the same time, storage energy consumption is also increasing at a very high speed. This report summarizes disk-based energy savings and discusses energy-saving ideas for specific applications.
     (Web Group) Knowledge Base Completion via Search-Based Question Answering 
    Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. In paper"Knowledge base completion via search-based question answering(WWW 2014)",the author propose a way to leverage existing Web-search鈥揵ased question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, they learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute.The paper also discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute.
     2016.12.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Exploring Relation Paths between Entities in Knowledge Graphs 
    Exploring relation paths between entities is a common need in many fields. For example, social networking services suggest friends based on known associations between people. Security agents are interested in associations between suspected terrorists. Biologists discover the relations among genes, proteins and diseases to develop drugs. In recent years, the increasing amount of graph-structured data on the Web, like RDF data, has made association finding easier than extracting from Web text. This report compares some systems of relation-path finding, and proposes some problems in the biomedical domain.
     2016.12.08  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Constructing ARM TCM platform and optimizing energy consumption of SQLite 
    In the previous report, it was found that in the energy consumption of database applications, data migration accounted for about 60%, while only the L1 cache accounted for 90% of them. This time we put forward the corresponding improvement solution. The general idea is to use software-controlled ARM TCM to partially replace the traditional hardware-controlled L1 cache. We first describes the selection and construction of hardware and software environments, and the user interface implementation of TCM. Secondly, the implementation of SQLite is analyzed, and the hot data structure, B-tree and basic operation optimization are preliminarily constructed and realized.
     (Privacy Group) Privacy-Preserving Data Publishing 
    Privacy is an important issue when one wants to make use of data that involves individuals鈥檚ensitive information. Research on protecting the privacy of individuals and the confidentiality of data has received contributions from many fields, including computer science, statistics, economics, and social science. The report primarily focuses on research work in privacy-preserving data publishing. This is an area that attempts to answer the problem of how an organization, such as a hospital, government agency, or insurance company, can release data to the public without violating the confidentiality of personal information.
     2016.12.01  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) An Initial Survey on KB-based Relationship Extraction 
    Today is an information age and full of data among which knowledge is the most special one. Usually, knowlegde presents as relationships. On the one hand, mankind can understand relationships easily. But on the other hand, because of the intricate relationships among objects, automatically analysing them is difficult. Therefore, according to previous literature and typical systems, this report will mainly share some Relationships Extraction technologies on Knowledge Base which can benefit decision-making and science research.
     (Privacy Group) Mobile Privacy Survey--Evaluating App Privacy and User Privacy Solution 
    Privacy has become a key concern for smartphone users as many apps tend to access and share sensitive data. There are three main approaches to surveying sensitive data collection status on mobile phone: permissions analysis, static code analysis and dynamic analysis in researches. As mobile privacy is defined as collect sensitive data without user鈥檚 consent later, permission-based and privacy policy based analysis methods are proposed to evaluate privacy leakage. At the same time, a few privacy-preserving techniques are offered to prevent data collection process or anonymize sensitive information. And Now There is Local Differential Privacy method which applies differential privacy to small mobile device to protect user privacy.
     2016.11.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Text mining-BioNLP2016 
    With the rapid growth of biomedical information, only rely on manual reading to obtain and understand the required knowledge become extremely difficult, how to integrate existing knowledge and new knowledge mining has become a hotspot of current research from the mass of biomedical literature. Text mining can help people from a large number of unstructured and semi-structured biomedical text mining in the extraction of implicit, people do not know in advance, but is potentially valuable information and knowledge, which is now widely used in biomedical research. Conference such as BioNLP proposed biomedical text mining tasks, through different methods to explore and practice, to promote the development of the field of research. This report mainly introduces the BioNLP previous reporting center, and two papers as an example to elaborate. Finally put forward their own ideas.
     (Cloud Group) Untwisting The Rope: A Resource Decoupling Approach Revisiting Performance bottlenecks of Big Data Systems 
    Big data systems are complex, and it is difficult to analyze performance bottlenecks. Researches focus on presenting many model methods to identify performance bottlenecks by one observation, but only can quantify bottlenecks of part components and are error-prone. In this paper, we present a recourse decoupling approach to systematically quantify bottlenecks of major components. To conduct a detailed analysis, we do the following work including: (1) we present four quantitative methods solving CPU, memory, disk and network bottleneck; (2) we address an ideal speedup to quantify the minimum acceleration potential of non-CPU components; (3) we develop a tool to monitor performance events to cross validate the ranking of performance bottlenecks and find out fine-grained reasons; (4) we use Spark as the example of big data systems and evaluate its performance with two SQL benchmarks.
     2016.11.17  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Report of CIKM2013 
    The 25th ACM International Conference on Information and Knowledge Management(CIKM2016) was held in Indianapolis, USA,October 24-28, 2016. CIKM2016 received 935 paper submissions for the research track. 160 were accepted as full paper(acceptance rate 22.8%), and 55 were accepted as short paper(acceptance rate 23.5%). The conference program includes 3 keynotes, 7 tutorials ,4 industry talks, and 50 paper sessions.
     (Web Group) Deep learning and NLP 
    Nature language processing has developed for more than 50 years, which was dominated by the rule-based approach in the early days. But the really effective processing language was from 2000 mainly because of the rise of the statistical natural language processing technology.After more than 10 years' development, mass data acquisiton is no longer a problem with the emergence of large data technology. The deep machine learning methods has firstly made a breakthrough in speech and image processing as a new way for AI. NLP is naturally joined into the transformation made by the deep learning. Deep learning technology has been applied to many problems in NLP such as word representation, emotion classification, entity recognition, reading comprehension, relation extraction,visual QA and so on, which are superior to statistic methods in many problems. This report discuss and study some problems selected from the above issues with the deep learning.
     2016.11.10  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) URMDA : A System for Diagnosing Spark鈥檚 Performance Bottlenecks 
    This paper demonstrates URMDA for diagnosing Spark鈥檚 performance bottlenecks. We implement the resource decoupling approach to quantify the bottlenecks of major components, which include CPU, disk, network and memory, and build a fine-grained monitor to do a depth analysis of the Spark鈥檚 performance bottleneck by being combined with several analyzing functions. We demonstrate URMDA using two SQL benchmarks, and draw the conclusions as follows. (1) Network is likely to be the bottleneck especially when the bandwidth is 100Mbps. (2) CPU is always the major bottleneck. (3) Spark in memory is not as fast as the official propaganda because of the weak cache operation.
     (Web Group) Data visualization technology application and research 
    Visual analytics is an important method used in big data analysis. The aim of big data visual analytics is to take advantage of human鈥檚 cognitive abilities in visualizing information while utilizing computer鈥檚 capability in automatic analysis. By combining the advantages of both human and computers, along with interactive analysis methods and interaction techniques, big data visual analytics canhelp people to understand the information, knowledge and wisdom behind big data directly and effectively.
     2016.10.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) differential privacy demo system demonstration 
    The data set is not sensitive to changes in a specific record in differential privacy.Whether a single record in the data set or not, the impact on the calculation results remains constant.As a result, the risk of lossing privacy is controlled in the range of we can acceptable and the attacker can not obtain accurate individual informations through the observation result.The content of the report is displaying the system.
     (Privacy Group) Data Mining with Differential Privacy 
    We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the dierential privacy framework. Dierential privacy requires that computations be insensitive to changes in any particular individual's record, thereby restricting data leaks through the results. The privacy preserving interface ensures unconditionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sample application. The privacy mechanism has a profound effect on the performance of the methods chosen by the dataminer. We demonstrate that this choice could make the difference between an accurate classier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.
     2016.10.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Plan Circular Embeddings of Knowledge Graphs 
    The embedding representation technology provides convenience for machine learning on knowledge graphs (KG), which encodes entities and relations into continuous vector spaces and then constructs triples. However, KG embedding models are sensitive to infrequent and uncertain objects. Furthermore, there is a contradiction between learning ability and learning cost. To this end, we propose circular embeddings (CirE) to learn representations of entire KG, which can accurately model various objects, save storage space, speed up calculation, and is easy to train and scalable to very large datasets. We have the following contributions: (1) We improve the accuracy of modeling and learning for various objects by combining holographic projection and projection degree. (2) We reduce parameters and storage by adopting circulant matrix as the projection matrix from the entity space to the relation space. (3)We accelerate convergence and reduce training time by adaptive parameters update algorithm dynamic change learning time for various objects. (4) We speed up the computation and enhance scalability by fast Fourier transform (FFT). Extensive experiments show that CirE outperforms state-of-the-art baselines in link prediction and entity classification, justifyed efficiency and the scalability of CirE.
     (Cloud Group) OrientStream: A Framework for Dynamic Resource Allocation in Distributed Data Stream Management Systems 
    Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for dynamic resource allocation in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level and cluster-level's four-level feature extraction mechanism, we firstly use the different query workloads as training sets to predict the resource usage of DDSMS and then select the optimal resource configuration from candidate settings based on the current query requests and stream properties. Finally, we validate our approach on the open source SPS--Storm. Experiments show that OrientStream can reduce CPU usage of 8%-15% and memory usage of 38%-48% respectively.
     2016.10.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Smart Storage 
    With the development of the Internet of things and social networks,a huge amount of data will be produced.How to store and process the data is an urgent problem.And today's customers need real-time feedback.The traditional architecture of CPU-DRAM-Disk can not meet the needs of data storage and processing.We need a new computer architecture.This report introduces the architecture that moves compute to storage and makes intelligent storage.It also introduces some related work.
     (Web Group) Strategies for Training Large Scale NNLM 
    Because of the breakthrough performance in image and audio processing, neural network language model (NNLM) is also widely used in natural language processing. To get high precision, it requires to be trained on very large corpus. Also, we have to tune parameters repeatedly according to different situations. The training precess is very time consuming. Combined with my experience on RNNLM, this report shares some strategies for training large scale NNLM from several aspects, including corpus, iteration times, vocabulary, the hidden layer and so on.
     2016.9.29  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Astro data analysis frame based on two-level cache 
    his report describes a prototype system for the prototype system design framework of processing GWAC astronomical data. Different from the the two-layer analysis framework of the first version, the three layers of the new framework meet the new performance requirements. The first layer is a local cache memory to detect mutations in milliseconds. The second layer is a distributed memory system to detect transient sources in seconds. The third layer is a distributed database to do off-line analysis and long term storage in minutes.
     (Web Group) Single-relation Question Answering based on Knowledge Base 
    Single-relation questions are the most common form questions in search logs and community question answering websites. A knowledge-base(KB) such as Freebase and DBPedia can help answer such questions after reformulating them as queries. Hower,automatically mapping a natural language question to its corresponding KB query remains a challenging task. This report will present some progress in this field.
     2016.09.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Data movement in database application 
    To our knowledge, 200 times more energy spends on data movement from DRAM to CPU than computation. With the development gap between CPU and DRAM, the energy gap between them will be more serious. In big data applications especially database applications, data is a key resource, so data movement would become the bottleneck of energy efficiency. In the report, we firstly introduce the methodology we use to quantify single cache line movement energy between caches. Secondly, analyse energy comsumption of TPC-H benchmark in postgreSQL. To find the energy features of database application, we compared the results with CPU2006 benchmark and find L1 cache contributes the majority of energy of database.
     (Privacy Group) Publishing the Column Counts under Differential Privacy 
    We primarily consider the problem of publishing column counts for datasets. These statistics are useful in a wide and important range of applications, including transactional, traffic and medical data analysis. The key challenge is that as the sensitivity is high, high-magnitude noises need to be added to satisfy differential privacy. GS is first presented, which pre-processes the counts by elaborately grouping and smoothing them via averaging.The grouping strategy is dictated by a sampling mechanism, which minimizes the smoothing perturbation. DPSense and DPSense-s are the state-of-the-art approaches for publishing column counts for high-dimensional datasets, whose key idea is to reduce the sensitivity by setting a limit on the contribution of each record.
     2016.06.24  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Introduction to MonetDB 
    MonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in theNetherlands. It is widely used in OLAP, GIS and data mining. This report will first introduce the background,the architecture and its BAT algebra. Also the typical technologies , such as Late Materialization,Database Cracking and Hardware-Conscious Query Processing ,are presented to have a deep understanding of it.
     (Web Group) Predict on Public Big Bata by Google BigQuery and TensorFlow 
    We can built model on specific business application and predict user demand via Google BigQuery and TensorFlow. Since, Google BigQuery public data sets provide available training data and test data and TensorFlow open source software libraries provide machine learning model. The report briefly describes BigQuery, TensorFlow related knowledge and application cases, which primarily to bring information, providing learning materials.
     2016.06.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Develop a PINQ Demo System 
    Privacy Integrated Queries is a LINQ-like API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. According to the method mentioned above, a PINQ Demo System application is designed and implemented during this term.
     (Mobile Group) Reports about ICDE2016&XLDB2016 [ppt]
    Reposrts bout ICDE2016&XLDB2016.
     2016.6.3  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Automatic Enforcement of Data Use Policies with DataLawyer 
    Data has value and is increasingly being exchanged for commercial and research purposes. Data, however, is typically accompanied by terms of use, which limit how it can be used. To date, there are only a few, ad-hoc methods to enforce these terms. DataLawyer, a new method to formally specify usage policies and check them automatically at query runtime in a relational database management system (DBMS).
     2016.5.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Guessing the extreme values in a data set: a Bayesian method and its applications 
    For a largenumber of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.
     (Cloud Group) A Data stream partitioning method for multiple group-by query 
    In the application of query and analysis of data stream in real-time, data summarization is essential for users. The multiple Group-By queries is widely used in distributed data stream management systems. Compared with the existing data partitioning methods, this report attempts to achieve efficient data stream partitioning strategy through the combination of compile query optimization and runtime query optimization. In compile optimization, we try to construct a query cost model based on partition keys; in runtime optimization, we design the dynamic adjustment strategies based on the distribution of data stream. It can construct a complete method of data stream partitioning.
     2015.5.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The I/O Preformance of Big Data Computing Framework 
    With the continuous development of the Internet and other information technology, the amount of data are at an alarming rate in the accumulation. Spark representing mainstream distributed computing framework can work on the performance analysis of the log, the source code and the Java virtual machine incrementally.This study starting from an architectural perspective, focuses on clusters of abstract, CPU-bound quantitative, I/O performance model and other issues, and seeks to provide a reference for relevant researchers.
     (Privacy Group) Blockchain: The State of the Art and Future Trends [ppt]
    Blockchain is an emerging decentralized architecture and distributed computing paradigm underlying Bitcoin and other cryptocurrencies, and has recently attracted intensive attention from governments,financial institutions, high-tech enterprises, and the capital markets. Blockchain's key advantages include decentralization, time-series data, collective maintenance, programmability and security, and thus is particularly suitable for constructing a programmable monetary system, 炉financial system, and even the macroscopic societal system.
     2016.05.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Privacy Data Release 
    Privacy-preserving data publishing is an important problem that has been the focus of extensive study. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. To address the deficiency of the existing methods, this paper presents PRIVBAYES, a differentially private method for releasing high-dimensional data. Intuitively, PRIVBAYES circumvents the curse of dimensionality. Private construction of Bayesian networks turns out to be significantly challenging, and this paper introduces a novel approach that uses a surrogate function for mutual information to build the model more accurately.
     (Mobile Group) FGMP 
    近年来,大数据管理系统的发展趋势主要形成了三个方向,一种是以 Hadoop 和 MapReduce 为代表的批处理系统,另一种是以Storm为代表的,为各种特定应用开发的流处理系统,最后一种是最近兴起的混合式计算模式的spark系统。这些分布式的大数据管理系统给我们带来了高速处理海量数据的能力。如何提升这些平台的性能成为大家探讨的话题。为了能够监测分布式的大数据管理系统的性能,UC?Berkeley?开发了开源工具ganglia。但是它只能提供非常粗粒度的监控(例如,CPU利用率),无法满足我们的要求。如何细粒度地监测大量的运算节点,从而发现系统性能瓶颈成为一个迫切需要解决的问题。为此,在本文第二部分,我们构建了一个分布式的大数据管理系统监测平台——FGMP,它可以给用户带来如下便利:(1)便捷地在大量节点上部署大数据管理系统;(2)根据集群硬件资源自适应调整监控方案;(3)调节各个节点的CPU频率;(4)通过web服务远程提交任务给大数据管理系统运行(5)细粒度的(进程级别)监控系统性能。
     2016.05.06  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Introduction to scientific data management 
    With the development of cloud computing,business and government data can be processed by less time.However,in science field,large number of data will be produced.Moreover,it has more data than business data.Then,how to manage the scientific data.This report introduces the challenge of scientific data management and the weakness of using cloud computing to scientific data management.It also introduces the idea from Jim Gray in scientific data management.
     (Cloud Group) The integration and optimization of deep learning processor based on Caffe 
    Deep learning has dramatically improved the state-of-art in image classification,specch recognition ,nature language understanding and many other domains.First,we introduces the basic concepts of deep learning and Caffe ,one of the most popular deep learning framework in this report .Second,the integration and optimization of Caffe with Cambrian,a deep learning processor designed by Chen Yunji's group of ICT,is presented.Finally,the future work is reported.
     2016.04.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Introduction of Computational Social Science 
    In this presentation, we introduce a new field of computational social science(CSS). CSS is emerging that leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale.The CSS may reveal patterns of individual and group behaviors. The emergence of a computational social science shares with other nascent interdisciplinary fields the need to develop a new paradigm for training new scholars. Initially, computational social science needs to be the work of teams of social and computer scientists. In the long run, the question will be whether academia should nurture computational social scientists, or teams of computationally literate social scientists and socially literate computer scientists.
     2016.4.15  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) The Price of Free
    In-app advertising is an essential part of the ecosystem of free mobile applications. On the surface, this creates a win-win situation where app developers can profit from their work without charging the users. However, as in the case of web advertising, ad-networks behind in-app advertising employ personalization to improve the effectiveness/profitability of their ad-placement. This need for serving personalized advertisements in turn motivates ad-networks to collect data about users and profile them. As such, “free” apps are only free in monetary terms; they come with the price of potential privacy concerns. The question is, how much data are users giving away to pay for “free apps”?
     2016.4.8  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Analysis of OpenStack's architecture and revolution 
    Openstack is a by NASA and Rackspace, developed and launched by the cooperation of Apache license authorization free software and open source projects.By several major components together complete the specific work.It supports almost all types of cloud, the project goal is to provide implementation is simple and can be large-scale extension, rich, standard unified management of cloud computing platform.It through a variety of complementary services provide the infrastructure as a service (IaaS) solution, each service provides the API for integration.It is a to the construction and management of public and private cloud to provide software open source project.Its community with more than 130 companies and 1350 developers, these organizations and individuals will it as infrastructure as a service (IaaS) general front end of resources.It project's first priority is to simplify the deployment process of cloud and bring its good extensibility.
     (Web Group) An Axiomatic Approach to Link Prediction 
    The evaluation of link prediction functions has mostly been based on experimental work, which has shown that the quality of a link prediction function varies significantly depending on the input domain. There is currently very little understanding of why and how a specific link prediction function works well for a particular domain. The underlying foundations of a link prediction function are often left informal—each function contains implicit assumptions about the dynamics of link formation, and about structural properties that result from these dynamics. So the paper presents an axiomatic basis for link prediction. This approach seeks to deconstruct each function into basic axioms, or properties, that make explicit its underlying assumptions. This framework uses “property templates”.
     2016.03.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Privacy and Human Behavior 
    We mainly focus on the end of privacy, a special issue in Science. It uses three themes to connect insights from social and behavioral sciences: people’s uncertainty about the consequences of privacy-related behaviors and their own preferences over those consequences; the context-dependence of people’s concern, or lack thereof, about privacy; and the degree to which privacy concerns are malleable—manipulable by commercial and governmental interests. Large-scale data sets of human behavior have the potential value. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use. The study including credit card records for 1.1 million people shows that four spatiotemporal points are enough to uniquely reidentify 90% of individuals and knowing the price of a transaction increases the risk of reidentification by 22%, on average.
     (Cloud Group) Performance analysis of distributed data stream processing systems 
    In the era of big data, with the gradual rise of open computing platform, distributed data stream processing systems are used to process distributed and continuously increasing flow data. According to the query tasks submitted by users, the stream processing platform often converts the query plan into DAG graph for decomposition and processing. This report is based on Storm as the processing platform, according to different types of benchmark, analysis of Storm in different data flow rate and the allocation of different degrees of parallelism in the use of resources and the corresponding processing delay and throughput, and other indicators. To lay the foundation for further fine grained analysis of storm scheduling mechanism and system bottleneck.
     2016.03.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) A Geometric Representation of Online Collective Attention Flows in USA and China 
    With the fast development of Internet and WWW, information overload has become an overwhelming problem, and collective attention of online users will play a more important role nowadays. Knowing how collective attention distributes and flows among different websites is the first step to understand the underlying dynamics of attention on WWW. In this presentation, we intoduce a method to embed a large number of web sites into a high dimensional Euclidean space according to the novel concept of flow distance, which both considers connection topology between sites and collective click behaviors of users. With this geometric representation, we visualize the attention flow in the data set of Indiana university and Chinese online users clickstream over one day.
     (Cloud Computing) AdaStorm: Resource Efficient Storm with Adaptive Configuration 
    As a distributed and scalable real-time processing system, Storm has been widely used in various scenarios, including real-time analytics, continuous computing, and alerting, etc. However, since the Storm configuration (e.g., the number of workers, spout parallelism, and bolt parallelism, etc.) is predetermined before the deployment of the execution topology (i.e., a graph consisting of tasks), the system cannot adapt to fluctuant data stream properties (e.g., arrival rate and value distribution), leading to either significant resource waste or limited processing throughput. To address this problem, we present AdaStorm, a resource-efficient system to dynamically adjust the Storm configuration according to current data stream properties. AdaStorm is designed to minimize the resource usage while still ensuring the same or even better real-time response. This is achieved by first incrementally training machine learning models for predicting resource usage with accumulated system behaviors, and then deploying the most resource-efficient configuration derived from the models. We demonstrate three scenarios on the effectiveness of AdaStorm, i.e., resource efficiency, data rate tolerance, and online model update.
     2016.3.4  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Human-level concept learning through probabilistic program induction 
    People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion.The main focus is in reference [1].
     (Web Group) Combining Language Model with Conceptualization for Definition Ranking 
    Question Answering is a hot trend in search engine field. "What" type of queries is one of the most common one in Q&A systems. To ensure the coverage of answers, we crawl definition sentences from the web for these queries. It leaves a lot to improve on how to tell good answers from bad ones and sort all candidate answers reasonably. Traditional method use SVM to rank definition sentences, but its features are all syntactic. Even we strengthen the semantic feature through language models, it still has some defects. Thus, we combine the implicit and the explicit models through adding the conceptualization process to RNNLM. The semantic relation between terms and their definitions are obtained, so that we improve both the precision and recall.

     2015.12.18  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Mobile Group) Privacy Integrated Queries---An Extensible Platform for Privacy-Preserving Data Analysis 
    PINQ is an platform named Privacy Integrated Queries (PINQ) for privacy-preserving data analysis. PINQ is built atop C# Language Integrated Queries. LINQ is a recent language extension to the .NET framework for integrating declarative access to data streams (using a language very much like SQL) into arbitrary C# programs.PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language.At the same time, the design of PINQ analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform.
     (Web Group) Knowledge Base and Matrix Factorization 
    With the development of Semantic Web, the automatic construction of large scale knowledge bases (KBs) has been receiving increasing attention in recent years. Although these KBs are very large, they are still often incomplete. Many existing approaches to KB completion focus on performing inference over a single KB and suffer from the feature sparsity problem. Moreover, traditional KB completion methods ignore complementarity which exists in various KBs implicitly.We will go through the basic ideas and the mathematics of matrix factorization, and then we will present a popular technology in KB completion with matrix faxtorization or tensor decomposition.The topic of this report focus in the embedding technology for KB completion.Several methods are show in PPT including RESCAL, TRESCAL, MF with similarity and negative datas and so on.
     2015.12.11  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Differentiallly private high-dimensional data publication via sampling-based inference 
    Releasing high-dimensional data enables a wide spectrum of data mining tasks. Yet, individual privacy has been a major obstacle to data sharing. In the paper, the problem of releasing high-dimensional data with differential privacy guarantees is considered and a novel solution to preserve the joint distribution of a high-dimensional dataset is proposed.It first develops a robust sampling-based framework to build a dependency graph, and then identifies a set of marginal tables from the dependency graph to approximate the joint distribution based on the junction tree algorithm while minimizing the resultant error.
     (Privacy Group) GUPT: Privacy Preserving Data Analysis Made Easy 
    GUPT uses a new model of data sensitivity that degrades privacy of data over time. This enables efficient allocation of different levels of privacy for different user applications while guaranteeing an overall constant level of privacy and maximizing the utility of each application. GUPT also introduces techniques that improve the accuracy of output while achieving the same level of privacy. These approaches enable GUPT to easily execute a wide variety of data analysis programs while providing both utility and privacy.
     2015.12.04  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Common Patterns of Online Collective Attention Flow 
    If we see the Web as a virtual living organism, according to the metabolic theory, it must absorb 鈥渆nergy鈥� to grow and evolve. We want to know: (1)where does the 鈥渆nergy鈥� of the Web come from? (2)what are the common patterns of this 鈥渆nergy鈥� flow? We make a conjecture that the websites survival and development highly rely on the energy, which is online collective attention flow. We analyze the empirical data obtained from CNNIC and find a number of interesting common patterns: the allometric scaling laws, the dissipation laws, the gravity law and the Heaps' law. These common patterns will play a more important role in quantifying the Web evolution and online collective behaviors prediction.
     (Cloud Group) Resource estimation and perfermance analysis for Storm and Spark 
    In the ear of big data, there are a number of distributed data stream processing systems for coping with different data processing models. This report focuses on the stream processing system-Storm and the hybrid processing system-Spark. Firstly, there are some drawbacks about Storm, such as limitation of rebalance and the dynamic change of data stream load. So, we design the prediction model by using the MOA framework, and achieve the dynamic optimization of the configuration parameters. Secondly, we construct the performance analysis platform based on Spark, and describe the bottleneck of Spark quantitatively. Finally, we sum up and look forward to the future research work.
     2015.11.27  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Mobile Group) SST: Privacy Preserving For Semantic Trajectories 
    To preserve privacy in trajectory data, most existing approaches adapt cloaking techniques to protect individual location points or clustering and perturbation techniques to protect entire trajectories. To confirm to the k-anonymity model, they first group locations/trajectories and then modify location points to ensure a cluster of k location points/trajectories are close to each other. However, when k is large or the time span of trajectory is long, the cluster based k-anonymity approaches will suffer from great distortion and lead to misleading analysis results. Observing that it is unnecessary to brutally provide the same level of privacy protection to all locations, we analyze the visiting status of a semantic place at which a point is situated as well as the distribution of neighboring semantic places and infer four privacy risk levels based on the risk of privacy breach. Then, we propose the semantic space translation algorithm that it can strike a good balance between privacy preserving and data utility.
     (Mobile Group) WISE2015 Report 
    WISE2015 participant: Lu Wang shares her experience and report involved sessions.
     (Cloud Group) Unified platform of resource managerment and scheduler 
    Apache Hadoop began as one of many open-source implementations of MapReduce [12], focused on tackling the unprecedented scale required to index web crawls.Its execution architecture was tuned for this use case, focusing on strong fault tolerance for massive, data-intensive computations. In many large web companies and startups, Hadoop clusters are the common place where operational data are stored and processed.This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs鈥� control flow, which resulted in endless scalability concerns for the scheduler.
     2015.11.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Privacy Group) Bayesian Differential Privacy on Correlated Data 
    Differential privacy provides a rigorous standard for evaluating the privacy of perturbation algorithms. It has widely been regarded that differential privacy is a universal definition that deals with both independent and correlated data and a differentially private algorithm can protect privacy against arbitrary adversaries. However, recent research indicates that differential privacy may not guarantee privacyag ainst arbitrary adversaries if the data are correlated.The paper focuses on the private perturbation algorithms on correlated data and propose Bayesian differential privacy, by which the privacy level of a probabilistic perturbation algorithm can be evaluated even when the data are correlated and when the prior knowledge is incomplete.
     (Cloud Group) Key Concept of Spark-RDD(Resilient Distritubed Dataset) 
    Spark is a fast and general engine for large-scale data processing. Its key concept is RDD(Resilient Distritubed Dataset). In order to demostrate the roots of Spark’s advantages,this report first introduces the oringin and overview of RDD.Then the lineage ,fault-tolerance and generality of RDD are presented. Finally,we have a sumarry and talk about our future work.
     2015.11.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) How to reduce system power using flash 
    With the development of cloud computing,the cluster which becomes bigger than ever consumes more power. How to reduce the power of a cluster? We know that a cluster is composed of many nodes.If we can reduce the power of a single node, we will reduce the power of a cluster. This report briefly introduces some general methods to reduce the power of a single node.Moreover,it introduces the method using flash to reduce system power in two papers.
     (knowledge fusion) Entity Linking in Biomedical Domain 
    The Entity Linking (EL) task is well-studied in news and social media, this problem has not received much attention in the life science domain.The first paper examine a key task in biomedical text processing, normalization of disorder mentions. It present a multi-pass sieve approach to this task, which has the advantage of simplicity and modularity. This approach is evaluated on two datasets, one comprising clinical reports and the other comprising biomedical abstracts, achieving state-of-the-art results.Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, the second paper propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking.It also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature.
     2015.11.6  Topic: Open-domain question answering
     (Web Group) Semantic Parsing for Single-Relation Question Answering 
    Open-domain question answering (QA) is an important and yet challenging problem that remains largely unsolved. In this paper, the author propose a semantic parsing framework based on semantic similarity for open domain question answering (QA). Using convolutional neural network models, this paper measures the similarity of entity mentions with entities in the knowledge base (KB) and the similarity of relation patterns and relations in the KB. Deep learning is nowadays very popular. These new techniques are worth investigation in knowledge base related field.
     2015.10.30  Topic: Unified platform of resource management and scheduler
     (Cloud Group) spark and mapreduce performance comparison 
    Since the report into the lab, and the understanding of the spark mapreduce platform made. The main content is an introductory performance comparison of spark and mapreduce paper. The article points out, for different tasks, you should use the right architecture - that mapreduce or spark.
     (Cloud Group) Unified platform of resource management and scheduler 
    Large-scale compute clusters are expensive, so it is important to use them well. Utilization and efficiency can be increased by running a mix of workloads on the same machines: CPU- and memory-intensive jobs, small and large ones, and a mix of batch and low-latency jobs – ones that serve end user requests or provide infrastructure services such as storage, naming or locking. This consolidation reduces the amount of hardware required for a workload, but it makes the scheduling problem (assigning jobs to machines) more complicated: a wider range of requirements and policies have to be taken into account. Meanwhile, clusters and their workloads keep growing, and since the scheduler’s workload is roughly proportional to the cluster size,the scheduler is at risk of becoming a scalability bottleneck.
     2015.10.23  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Work on Storm 
    Introduce features of storm system, mainly focusing on the difference from Hadoop and Spark that offer batch processing. For evaluating storm's performance,that is, CPU and memory, We implemented Linear Road Benchmark in storm and deployed ganglia tool to monitor storm cluster. From the result, we found parameter tuning in storm is essential for the CPU and memory usage. We will introduce DRS model risen by ICDSC 2015 before our machine learning proposal, after which feature choice and techniques for samples collection.
     (Web Group) Joint RNNLM for Definition Mining 
    Question Answering, which provides the direct answers instead of 10 blue links for user’s queries, has become a hot trend in web search.To answer "what" questions, one of the biggest segments in Q&A systems, we need to do definition mining from the web. Traditional approaches tend to use SVM to rank alternative definitions, but the features are usually syntactic. Even adding word embedding feature to comprehend definitions semantically, we still ignore some important relations for definitions, like the is-a relation between words. Therefore, we propose a joint RNNLM model for definition mining. It combines the explicit language model with the implicit model, namely the conceptualization and the word embedding, to capture the semantic relation between terms and their definitions.
     2015.10.16  Topic: Privacy Protection
     (Web Group) WISE Pre-Report 
    WISE participant: Lu Wang shares her pre-report.
     (Mobile Group) Provenance and Privacy
    Provenance in scientific workflows is a double-edged sword. On the one hand, recording information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions, enables transparency and reproducibility of results. On the other hand, a scientific workflow often contains private or confidential data and uses proprietary modules. Hence, providing exact answers to provenance queries over all executions of the workflow may reveal private information.
     2015.10.9  Venue: FL1, Meeting Room, Information Building
     (Cloud Group) Approximate Medians and other Quantiles in One Pass and with Limited Memory 
    Some new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude. And also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e. they apply with respect to a (user controlled) confidence parameter. Finally present the algorithms, their theoretical analysis and simulation results on different datasets.
     (Cloud Group) The parameter optimizing of Hadoop based on machine learning 
    Parameters in Hadoop are relevant to the system performance. When the hardware configuration is invariant, Different parameter configurations result in different run-times. Thus, we can learn the relevance between parameter choice and system performance by machine learning algorithm. We configure the Hadoop system by many figure vectors consisting of benchmarks and different hardware and software configuration, and finally we can get the run-time corresponding to figure vector. Many figure vectors plus the corresponding run-time form the figure dataset in which classification algorithms can learn a system action model. Using the model, we can predict the run time of the new figure vector, detect anomaly and estimate the smallest scale of the figure dataset in system hardware upgrading.
     2015.10.2  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Accelerating File System Access with Nonvolatile Memory 
    With the development of social network and electronic business,there are many data in our daily life.Everyone wants to store these data with high performance.However it depends on the traditional CPU-Memory-Disk system architecture,in which the speed of CPU and the speed of disk does not match.The new non-volatile memory has the advantages of fast access speed.How to improve the speed of dish access with the new non-volatile memory?This report introduced two related paper.And these methods can improve the speed of file system access and the speed of disk access.
     (Web Group) Using Encyclopedic Knowledge to Understand Queries 
    Query understanding is a challenging but beneficial task.In this paper, we propose a context-aware method to use the encyclopedic knowledge to aid in query understanding. Given a query, we first use a dictionary constructed from the encyclopedic knowledge bases to detect the possible entities and their associated categories. Then, we use a topic based method to derive semantic information from the query. By comparing the topical similarity between various candidate phrases, we get the most likely entities and their related categories.
     (Web Group) Question Answering over Linked Data Using First-order Logic 
    question answering over linked data(QALD) aims to evaluate answering system over structured data.The key objective of which is to translate questions posed using natural language into structured queies. This report introduce a novel method using Markov Logic Network to resolving the ambiguities and a holonomic framework to complete query translation
     2015.09.25  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Web Group) Plan Bouquets: Query Processing without Selectivity Estimation 
    Selectivity estimates for optimizing OLAP queries often differ significantly from those actually encountered during query execution,leading to poor plan choices. The article propose a new approach to address this problem, wherein the compile-time estimation process is completely eschewed for error-prone selectivities. Instead, a small “bouquet” of plans is identified from the set of optimal plans in the query’s selectivity error space, such that at least one among this subset is nearoptimal at each location in the space. Then, at run time, the actual selectivities of the query are incrementally “discovered” through a sequence of partial executions of bouquet plans, eventually identifying the appropriate bouquet plan to execute. The duration and switching of the partial executions is controlled by a graded progression of isocost surfaces projected onto the optimal performance profile.
     (Cloud Group) Head First DDSMS 
    In the era of big data, distributed data stream processing systems have emerged, which are used to process the distributed and increasing stream data. In order to build a complete distributed data stream management system(DDSMS), the query system is easy to use and improve the query processing system. Firstly,this report introduces the background and development of DDSMS from stream processing and stream query. Then, we summarizes the hot research fields of DDSMS(such as query language, system performance improvement and building-in new architectures). Finally, we point out the ongoing research work and future research directions.
     2015.09.18  Topic: Online users' behaviors evolution dynamics
     (Web Group) Online users'behaviors evolution dynamics: a collective attention flow perspective 
    If we see the Web as a virtual living organism, according to the metabolic theory, the websites must absorb “energy” to grow, reproduce and develop. We are interested in the following two questions: (1)where does the “energy” come from? (2)will the websites generate macro influence on the whole Web based on the “energy”? We make a conjecture that the websites grow at the cost of collective users’ attention flow as “energy” and produce influence on the whole Web. In other words, the more attention flow from online collective users a website, the more influence will be created. The results of data analysis in our experiments confirm this conjecture. Furthermore, we study collective surfing behavioral data from network science’s perspective and find that the evolution of the Web are governed by the rules that also leads to the evolution of living organism.
     2015.6.26  Theme: Emerging Programming Languages
     (Cloud Group) Introduction to Emerging Programming Languages 
    With the rapid development of mobile internet technology, it has appeared the new programming language that is suitable for different application platforms. This report starts with introducing to the programming languages used in the new data management systems. and describes the features and application scenarios of the popular programming languages. It focuses on the two newly released programming languages: the one is Swift released by Apple Corp, the other is Go released by Google Corp. A detailed analysis and comparision of two languages is carried out in the following aspects: historical origin, language features, compiling framework and performance comparison. Finally, we conclude the hardware platform and feature classification of new programming languages.
     2015.6.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Spark Actual Combat 
    After Hadoop era, with its incomparable advantages, the rapid development of the next generation big data core technology Spark technology is regarded as an alternative to Hadoop cloud computing. Next, I will introduce Spark clusters construction, architecture, core analysis, Shark, Spark on Yarn and JobServer through the actual examples.
     (Web Group) Horizon- detail-height 
    This report focuses on two articles about the clustering. These two articlesthe have a similar core idea , but there are many difference between the implementation details, the focus and the writing techniques ,and also have published different publications.The purpose is to study the article by comparing method , and the accumulation of writing methods.
     2015.6.12  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Introducing SSD to Hadoop 
    Hadoop is a distributed data processing system.It depends on the classical computer system architecture,which is composed of disk,DRAM and CPU.However,the speed of disk can not match that of CPU.The speed of SSD is faster than that of disk.So we introduce SSD to Hadoop.There are many problems between SSD and Hadoop.This report introduced two papers which are related with the problem.
     (Mobile Group) Practical knn queries with location privacy
    In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user.A solution for the mobile user to preserve his location privacy in kNN queries. The proposed solution is built on the Paillier public-key cryptosystem and can provide both location privacy and data privacy.
     2015.5.30  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The Phd grind [ppt]
    This report will share some interesting books that i read during my doctoral-study years.
     (Web Group) Short text clustering 
    Short texts are more and more popular.The classification and/or the clustering of shor text is a beneficial yet challenging problem due to the difficulty of setting the number of clusters, high-dimensionality, interpretability of results, scalability to large datasets, and sparsity. This seminar discusses about several existing method of short text classification and/or clustering, including topic model based methods, Dirichlet multinomial mixture model based methods and conceptual based methods.
     2015.05.22  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Live PPT 
    As a popular office application, Power Point acts important role for presentation. Making a good powerpoint presentition is not difficult, but is does require some forethought. The more attrative your presentition is, the more likely your audience will be to understand and remember the information you present.This report shows how to make a live powerpoint.
     (Web Group) semantic similarity measure of sentence and document 
    Sementic similarity measure is one of the most important topic of natural language processing.It is widely used in search engine,text mining,recommendation system and so on.This report first introduces the traditional word similarity based approach of measuring sentence and document similarity, furthermore introduces Word2Vec and Doc2Vec model,which are of deep learning architechture to measure the similarity and outperform other models.
     2015.05.16  Theme: Network Science and Complexity Science
     (Web Group) Special Lecture:Complex World,Simple Rules----From Network Science to Complexity Science 
    The explosive growth of World Wide Web during the past two decades has presents an important complex artificial system for scientists to unveil the universal patterns and principles of its organization. In this talk, we started by brief introduction of network science, it is the work by Pro.Barabási et al., which described how the tools of network science can help understand the Web's structure, development and weaknesses. Additionally, we give a brief survey of the various research fields in complexity science.We present key concepts and analyze state-of-the-art of the field.Finally some new works and ideas by our team are introduced,including online attention flow network study and allometric scaling law in Web sites growth.
     2015.5.8  Theme: DASFAA Report
     (Mobile Group) DASFAA Report 
    DASFAA participants: Jiangtao Wang, Lu Wang, Fengming Wang share their experience and report involved sessions separately.
     2015.04.25  Theme: Bootstrap
     (Cloud Group) Bootstrap for AQP and OLA systems 
    As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quility of estimates, bootstrap provides perhaps the most promising step in this direction. In this report, we survey the related work in approximate query processing(AQP) community recent years, and introduce main ideas of the paper "Knowing when you're wrong: building fast and reliable approximate query processing systems". Besides, we introduce the first work integrating bootstrap into on-line aggregation(OLA), "G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data".
     2015.4.17  Theme: DASFAA Pre-Report
    DASFAA Participants DASFAA Pre-Report 
    DASFAA participants: Jiangtao Wang, Lu Wang share their pre-report separately.
     2015.4.10  Theme: Cloud Data Management
     (Cloud Group) Real-time query processing for distributed data streams 
    We can design the processing mode of directed acyclic graph(DAG) focusing on the data intensive computing in the distributed environment. The optimization strategy of DAG is the key point of research. This report analyzed two processing strategies based on the testbed of STORM. The one is dynamic construction of degree of parallelism(dop) and processing batch size(bs) focusing on the continuous queries. The other is designing the adaptive join operator for intra-operator adaptivity. Finally, we concluded the two optimization strategies and planed the future work and study.
     (Cloud Group) Some issues in Online Aggregation 
    The appearance of big data has brought great challenges to traditional data manage-ment technology. With the rapid growth of data volumes,research in the field of relational databases has focused on the problem of aggregation queries under massive data, and proposed online aggregation approach to solve the massive data aggregate.
     2015.04.03  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Can SSDs help improve MapReduce performance? 
    MapReduce is a programming model for parallel computation of large data sets. This report analyses the I/O characteristics in Map, Shuffle and Reduce progresses and then research on the viability of improving MapReduce performance by adopting SSD.
     (Web Group) Semantic Similar Words Extraction from Large Corpus 
    Semantic similar words extraction is of important role in natural language processing research.The distributional hypothesis states that words with similar meanings tend to appear in similar contexts and construct word vector for each word, measure word similarity by vector similarity.Then filter and rerank the result from word vector using machine learning models and get the list of synonyms finally.This report introduce the distributional hypothesis and the rerank model respectively.
     2015.03.27  Theme: Memory Management
     (Cloud Group) The efficient management of hybrid PCM and DRAM memory 
    New applications and multi-core processors need more main memory.The increasing memory capacity can lead to the increasing energy consumption.How to reduce the energy consumption of computer?The hybrid PCM and DRAM memory can reduce the energy consumption.This report introduces two papers about it.
     (Cloud Group) Deep into Spark 
    Spark is distributed framework for parallel in-memory computing. It has widely attracted public attentions and is developing fast. This report is to introduce the spark ecosystem, the theorem and usage of Spark and the programming on Spark. At last but not least, compared with MapReduce, we discuss the advantages and disadvantages of Spark for big-data processing.
     2015.03.20  Theme: Web Data Management
     (Web Group) How Websites Influence Growth? 
    The availability of big data,such as those from human online surfing records,makes it possible to probe into and quantify the regular pattern of user long-range, complex interactions between Websites.We try to apply the approaches developed by complex weighted network and flow network study to the clickstream network. If we see the user's attention flow as energy flow, it is reasonable to conjecture that the patterns found in other weighted networks should be also suitable for the weighted clickstream networks. By analyzing the circulation of the collective attention flow,we discover the scaling relationship between the impact of websites and their attention flow traffic. We found that the websites influence growth linearly with attention flow size. We also examine the collective total time online of a website with website influence,which turn out that influence scales sub-linearly with time. This result is not consistent with the common sense of “the more user’s time take up, the greater influence of website is”.
     (Wen Group) Named entity recognition and conceptualization on short text 
    Named entity recognition (NER) is very important for many applications such as information integration, knowledge base population, question answering and so on. Though a plenty of work have been devoted to this task, the exsting methods cannot perform well on short text due to the sparsity, noise and lack of syntactical structure on short text. This seminar focuse on the NER task on shor text like search engine queries.
     2015.03.13  Theme: MapReduce
     (Cloud Group) MRSimJoin: MapReduce-based Similarity Join for Metric Spaces 
    Similarity join is one of the most popular techniques for the domain of data analysis and cleaning. Recently, there are much work that has proposed efficient solutions for texutal similarity join, while, not much for distance-based similarity join. This presentation is about to introduce an approach that uses grid partitioning and MapReduce for efficient distance-based similarity join processing.
     (Cloud Group) SpongeFiles:Mitigating Data Skew in MapReduce Using Distributed Memory 
    Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. Spilled data goes to SpongeFiles,which route it to the nearest location with sufficient capacity(local memory, remote memory, local disk, or remote disk as a last resort).
     2015.01.20  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) Random Sampling in Online Aggregation 
    OLA algorithms must rely on randomness to achieve statistical guarantees on accuracy of estimates. In the report, we introduce some classic probability sampling: including simple random sample, stratified sample, cluster sample and systematic sample. And we also introduce some efficient methods of answering random sampling queries of relational databases. Beside, two sampling techniques targeted at single-table and mutiple-table OLA are discussed.At last, We describe two sampling algorithms to process OLA on skewed data.
     (Cloud Group) MRSimJoin: Partition-based Textual Similarity Join 
    Textual similarity join is an important part of spatial-textual similarity join. Recently, there are many kinds of solutions for textual similarity join, such as prefix filtering, edit-distance based partitioning, etc. This presentation is to introduce an efficient partition-based similiarty join method, including its main idea and distributed implementation useing MapReduce.
     2015.01.13  Venue: FL1, Meeting Room, Wing Building for Science Complex
     (Cloud Group) The future of solid-state memory 
    The high density of memory will bring us challenges.How can we respond to the challenge?This report gives the five schemes of the hypothetical system which was proposed by microsoft research.
     (Cloud Group) Sampling for group-by queries 
    Sampling is an important technique of approximate query. For the group-by queries, the problem of small group can be caused by low selectivity. Random sampling would become invalid because of the existence of small group. This report will introduce the cause of small group and several approaches will be presented.
     (Cloud Group) Core Technology Analysis of Storm 
    MapReduce, Hadoop and some related technology allow more data than we can handle before. However, these techniques are not real-time data processing systems, they are not designed for real-time computation. There is no way you can easily make hadoop into a real-time computing system, real-time data processing system and batch data processing system have big differences in the nature of demand. The lack of a "real-time version of hadoop" data processing has become a huge missing, Storm fill this missing.
     2015.1.6  Theme: Spatio-Textual Query
     (Cloud Group) Quadtree Spatial Index for Spatio-Textual Query 
    Spatio-textual data is such kind of data that contains not only textual information but also spatial location information. These data is usually produced and used by LBS applications. For now, the query processing about this kind of data always use the hybrid solution of spatial and textual query techniques, and the choice of spatial and textual query techniques is the key to improving the query performance. Academic communities have paid attention to this kind of query for a long time and published large amounts of works. Through two new published papers, this report is to introduce how to use quadtree as the spatial part for spatio-textual query processing.