研讨会中国人民大学数据库研究组 WAMDM

研讨会2018

2019.01.18 FL1, Wing Building for Science Complex
张祎 (Web Group)	基于三元交互的并行知识图谱表示学习 Abstract: 大规模Embedding的挑战主要包括模型效果和运行时间两方面。仅仅从模型角度出发通过降低时间复杂度来降低运行时间所带来的效果十分有限。同时，还需要考虑模型本身准确性的问题。要在二者之间达成平衡更是难上加难。故，本报告将从模型和SGD并行两个角度出发对上述挑战进行分析。首先是模型角度，为了降低模型对数据集的依赖性，可以考虑从头实体、尾实体和关系交互的角度进行建模；其次，为了进一步降低大规模数据集上的运行时间，本报告进行了并行SGD相关文献阅读，考虑到技术成熟度，最终选择了一种无锁的SGD并行框架Hogwild!；最后，基于三元交互的表示学习模型，报告采用Hogwild!并行框架进行了相关实验，证明了三元交互模型在特殊数据集上的良好效果，以及多线程并行条件下Hogwild!框架带来的低误差。
2019.01.11 FL1, Wing Building for Science Complex
邵玉杰 (Web Group)	关于学者画像的研究 Abstract: 学着画像是一种用户角色，作为一种勾画目标用户、联系用户诉求与设计方向的有效工具，将用户的属性、行为与期待联结起来。研究学者画像的目的是为了更进一步进行学者数据挖掘，将学者信息标签化，方便数据集成。本次报告将从学者信息、学者研究兴趣和学者学术研究几个部分详细的介绍学者画像的研究。同时展示开发的ScholarRanking学者排名系统。
2019.01.04 FL1, Wing Building for Science Complex
陈珂锐	Methods for Interpreting and Understanding Deep Neural Networks Abstract: In the last years many deep neural networks have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. I will introduce some recently proposed techniques of interpretation, along with theory, tricks and recommendations, to make most efficient use of these techniques on many practical applications. This discussion provides an entry point to the problem of interpreting a deep neural network model and explaining its predictions.
杨晨 (Cloud Group)	AstroServ1.0介绍 Abstract: 本次报告主要介绍Cloud组参与的国家重点研发计划过程中经过长期开发和不断优化，历时2年半和10个版本迭代，最终可以与真实科学观测项目对接用于科学大数据管理与分析的正式版AstroServ1.0系统。

2018.12.28 FL1, Wing Building for Science Complex
杜永杰 (Cloud Group)	Data Exploration with SQL using Machine Learning Techniques Abstract: Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a "rewriting" technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained.
朱敏杰 (Privacy Group)	移动应用相关测评研究及AppPrivacy系统设计实现 Abstract: 指数作为反映复杂现象在各维度上的对比及变动情况的一种形式，目前在各行业内均得到应用。在隐私及安全领域，数字安全领域的全球领导者金雅拓（Gemalto）公司自2013年起开始发布"数据泄露水平指数"，揭露全球范围内数据泄露事件的严重程度。针对移动应用（Mobile Application，以下简称App）使用中的数据泄露，中消协于2018年8-10月开展了App个人信息保护收集测评活动，对每款App中隐私政策内的个人信息收集情况进行统计并打分。基于目前移动环境下用户数据泄露导致的隐私问题严重，及已提出的数据收集者（开发者）、数据拥有者（用户）隐私风险量化模型，我们设计移动应用隐私风险评量化评估系统AppPrivacy，该系统以揭示并评估移动应用场景下的用户隐私泄露程度为目的，主要对数据拥有者（用户）、数据收集者（开发者）、App三个对象在App使用过程中产生/获取的数据进行分析，进而对其面临/造成的隐私风险进行评估。
2018.12.21 FL1, Wing Building for Science Complex
吴新乐 (Web Group)	预训练技术在NLP领域的发展与现状 Abstract: 预训练模型在图像视频领域已经得到广泛使用，但在NLP领域长期以来只是作为一种辅助手段来增强task-specific模型的表现，如Word2Vec工具。直到近期ELMo，GPT，BERT等模型的相继推出，才充分证明了预训练技术在NLP领域的重要价值。我的报告将介绍该领域一些典型的工作，从而说明预训练技术在NLP领域的发展历程和现状。
2018.12.14 FL1, Wing Building for Science Complex
马超红 (Cloud Group)	机器学习化的数据库总结 Abstract: 近年来，基于数据驱动的机器学习应用程序的成功，促使了数据库领域在研究数据库系统和应用程序的设计中集成机器学习的相关技术，机器学习的成功为数据库领域的研究带来了研究机遇，同时也对数据库领域的发展带来了挑战。从20世纪70年代开始，数据库领域就致力于系统优化以及大规模数据驱动的应用，因此将数据库与机器学习具有天然的密切关系，两个领域的结合，将极大地推动大数据驱动应用领域的发展。传统数据库问题，如索引、事务、存储管理等，对应关键字与位置之间的映射，数据库管理系统也同样存在调优、预测工作负载等问题，这些都为机器学习，尤其深度学习在数据库研究中的应用带来的机遇。本次报告将对机器学习化的数据库近来的研究进行总结，主要分为：1）机器学习化的索引结构；2）机器学习化的查询优化；3）机器学习化的数据库配置；4）自动化数据库管理系统；5）其他
刘立新 (Privacy Group)	数据共享透明总结 Abstract: 数据共享是大数据价值实现的关键环节。然而，目前数据共享过程的不透明对大数据价值实现产生重要影响。一方面大数据可能包含大量个人隐私，当隐私泄露时存在追踪问责困难。另一方面，大数据是数据决策的基础，数据经多方共享导致数据可信性引起质疑。实现数据共享透明，能够在必要时进行追踪问责和数据溯源成为亟待解决的问题，区块链的去中心和不可篡改特性为实现数据共享透明提供了新的解决思路。本文首先分析了数据共享时攻击模型。之后，提出数据共享透明模型，并分析和总结其目前研究现状。最后，对现有工作进行总结。
2018.12.07 FL1, Wing Building for Science Complex
艾山 (Web Group)	Adversarial training for joint entity and relation extraction Abstract: Adversarial training (AT) is a regularization method that can be used to improve the ro- bustness of neural network methods by adding small perturbations in the training data. We show how to use AT for the tasks of entity recognition and relation extraction.In par- ticular, we demonstrate that applying AT to a general purpose baseline model for jointly ex- tracting entities and relations, allows improv- ing the state-of-the-art effectiveness on sev- eral datasets in different contexts (i.e., news, biomedical, and real estate data). Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing, POS tagging , relation extraction , translation , and joint tasks . However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence). Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model. Although AT has recently been applied in NLP tasks (e.g., text classifica- tion (Miyato et al., 2017)), this paper — to the best of our knowledge — is the first attempt investigat- ing regularization effects of AT in a joint setting for two related tasks.
吴永泰 (Web Group)	ScholarFinding系统进展报告 Abstract: 关系发现是利用知识图谱中现有的知识推断出未知的知识。人们通过将数据组织成RDF等数据格式存储于数据库当中，通过关系发现系统，用户对感兴趣的实体进行关系发现，探索未知的关系。现有的关系发现系统如：RelFinder、RECAP等已经相对完善并有效运用。ScholarFinding系统是基于国内计算机领域的学者、学校机构、杂志期刊、学术论文的一个关系发现系统。本系统以学者为中心，以ScholarSpace现有的数据集为基础，通过遍历现有的数据信息，获取已知的学者与学者、学者与学校机构、学者与学术论文及学者与杂志期刊的关系，通过前端可视化展示及交互，从而获取用户感兴趣的学者的关系信息。
2018.11.17 FL1, Wing Building for Science Complex
王硕 (Web Group)		Graph Convolutional Network Abstract: Many scientific fields study data with an underlying structure that is a non-Euclidean space. Some examples include social networks in computational social sciences, sensor networks in communications, functional networks in brain imaging, regulatory networks in genetics, and meshed surfaces in computer graphics.Geometric deep learning is an umbrella term for emerging techniques attempting to generalize (structured) deep neural models to non-Euclidean domains such as graphs and manifolds. Based on a paper about graph convolutional network, this ppt makes a preliminary introduction and summary of this kind of technology through preliminary learning.
2018.11.10 FL1, Wing Building for Science Complex
马超红 (Cloud Group)		机器学习化的数据库查询优化 Abstract: 查询优化是数据库领域最重要且充分研究的问题之一，对于关系型数据库，必须进行好的优化，才能够有可可接受的性能。传统的查询优化，使用多年来基于数据库开发者的经验，来仔细地调节和复杂地启发式设计，这些启发式算法通常需要专业的DBA在每一个单独的DBMS上来调节以改善查询性能。“fire and forget”：在未来查询优化的进程中，不利用观察到的已经执行计划的性能，因此导致查询优化器不能够系统地“learning from their mistakes”。机器学习化的数据库查询优化将深度学习与查询优化相结合，通过训练深度神经网络来模拟传统优化器，通过学习先前执行的查询计划的经验来自动调节模型，对每一个子查询给定一个reward，通过agent与environment的交互，选择cost低的action，生成query执行计划。
艾山 (Web Group)		细粒度用户评论情感分析 Abstract: 在线评论的细粒度情感分析对于深刻理解商家和用户、挖掘用户情感等方面有至关重要的价值，并且在互联网行业有极其广泛的应用，主要用于个性化推荐、智能搜索、产品反馈、业务安全等。数据集，共包含6大类20个细粒度要素的情感倾向。需根据标注的细粒度要素的情感倾向建立算法，训练模型，最终进行预测。这问题属于文本分类问题，所以本文对有关问分类方法进行研究，找出以下分类方法： 1.sklearn svm 2.onevsrestClassifier 3.XGboost 4.机器学习+情感词典。用以上方法建立模型并训练，对比结果。
2018.11.02 FL1, Wing Building for Science Complex
杜永杰 (Cloud Group)		Real-Time Query Enabled by Variable Precision in Astronomy Abstract: As sky survey projects are coming out, petabytes and exabytes of astronomical data are continuously collected from highly productive space missions. Especial-ly, in time-domain astronomy, Short-Timescale and Large Field-of-view (STLF) sky survey not only requires real-time analysis on short-time data, but also need precise astronomical data for special phenomena. Additionally, it is important to find a partition method and build an index based on it for effective storage and query. However, the existing methods cannot simultaneously support real-time and variable-precision query in astronomy. In this paper, we propose a novel as-tronomical real-time and variable precision query method based on data partition-ing with Hierarchical Equal Area isoLatitude Pixelation (HEALPix for short). Our method calculates the time through model and predict precision by machine learning, which can accurately predict the partition level number of HEALPix which can effectively reduce the cost of time for query by layer and layer. The method can meet the user's requirements of real-time and variable-precision que-ry. The experimental results show that our method can optimize previous query strategies and reach a better performance.
刘立新 (Privacy Group)		数据透明技术研究综述 Abstract: 大数据蕴含着巨大的价值，已经成为信息社会的核心资源，然而发挥其价值的同时也带来了隐私泄露、数据操纵、数据滥用和算法“黑盒”等问题。这些问题产生的根本原因是大数据价值实现过程的不透明性及其特点导致的监管困难，人们迫切希望大数据价值实现过程是透明的和可验证的。区块链具有公开、透明、不篡改等特点，已经逐步应用大数据生命周期的各阶段中，增强大数据价值实现过程的透明性，促进大数据的问责使用。本文总结和对比分析了区块链在数据获取、共享、分析和删除几个阶段的研究进展，最后探讨了数据透明技术未来发展方向。
2018.10.26 FL1, Wing Building for Science Complex
Xin Yang (Privacy Group)		区块链技术概述与投票系统实现 Abstract: 区块链（Blockchain）是一种由多方共同维护，使用密码学保证传输和访问安全，能够实现数据一致存储、难以篡改、防止抵赖的记账技术，也称为分布式账本技术（Distributed Ledger Technology）。作为一种在不可信的竞争环境中低成本建立信任的新型计算范式和协作模式，区块链凭借其独有的信任建立机制，正在改变诸多行业的应用场景和运行规则，是未来发展数字经济、构建新型信任体系不可或缺的技术之一。为了进一步了解区块链技术，开发了基于以太坊的投票系统，从智能合约角度深入了解区块链技术，并从中找到研究点．
Yi Zhang (Privacy Group)		基于优化算法改进的大规模Embedding Abstract: Embedding是多模态数据（包括文本、图像、音频和视频等）和机器学习算法的桥梁。大数据时代下的数据规模成为现有Embedding的挑战。从首次Embedding的完成到静态批量更新，再到动态更新以及高频动态更新，每种情况都会面临或多或少的挑战。目前，现有支持大规模Embedding的解决方案多从数据切割和相似数据合并等角度进行考虑。但所有Embedding过程都离不开基于优化算法的迭代过程。所以，这次报告将首先举例说明数据切割和相似数据合并的方法；然后基于个人思考从优化算法改进的角度来分析大规模Embedding问题；同时，报告已有的Weighted SGD和Diffused SGD方法；最后，总结个人思考和已有模型之间的差距和异同点，为下一步工作奠定基础.
2018.10.19 Wing Building for Science Complex
Xinle Wu (Web Group)		关系抽取模型实现及自动化 Abstract: Pipeline方法是实现关系抽取模型的一种简单有效的方法，但该方法在公共数据集上的表现往往不尽人意。本次报告主要讲述实现关系分类子模型时遇到的类别不平衡问题、句子中的长距离依赖问题，以及分别用来解决这两个问题的focal loss损失函数和self-attention机制。最后，我将和大家分享一下关于将关系抽取模型系统化方面的一些调研工作以及我的目标规划.
2018.10.12 Wing Building for Science Complex
Chen Yang (Cloud Group)		数据库中数据搬移的能耗分析与优化 Abstract: 当代计算机体系结构下，多级缓存层能缓解计算与存储部件速度差不断扩大造成的存储墙现象，但是会导致数据在缓存层间频繁搬移引发高能耗．在数据库中，查询作为核心操作是一类典型的数据密集型计算任务，数据搬移的能量开销更是巨大的，不仅造成能源的浪费，而且限制了复杂查询操作在嵌入式环境下的实现．然而，尚未有相关工作深入评估数据库中查询操作的能量消耗特征．本文针对该问题，提出一个能量特征化方法用于量化查询操作在不同缓存层上能耗，分为“基础能耗测量-单位能耗量化和验证-实际能耗特征化”三个步骤,具体为针对查询操作实际能耗的完整特征向量表示和量化模型方法、用具备单一缓存层访问特点的基准测试集测量基础能耗数据以及将其转化为单位数据搬移操作能耗的量化和验证方法．通过实验本文揭示了一个有价值的现象：L1D缓存（Level-one Data Cache）的读写是数据库查询操作中数据搬移的主要能耗瓶颈,占总能耗的47.5%，该规律有别于一般的计算任务，它具有数据库应用的独特性且适用绝大多数查询任务，具有很高的优化潜力．
Junxu Liu (Privacy Group)		机器学习的安全与隐私问题 Abstract: 现今，机器学习是自然语言处理，图像、语音识别等众多主流计算机技术的基石。国内外许多大型互联网企业都在自己的云平台上部署了机器学习模型及其接口，从而允许用户借助于云平台资源训练机器学习模型，执行查询任务，即“machine learning as a service”，如Microsoft Azure Machine Learning (Azure ML), Amazon Machine Learning (Amazon ML), Google Prediction API等。根据互联网企业对其机器学习模型的开放程度，“machine learning as a service”可分为白盒与黑盒两种服务模式：白盒模式下，用户可以下载训练好的模型并部署到本地；黑盒模式下，用户对模型结构及参数全然不知，只能通过调用接口执行查询任务。企业往往基于数据安全和隐私问题以及商业利益的考虑，大都采用黑盒模式提供服务，然而针对机器学习模型安全和隐私的攻击模型却层出不穷。随着“machine learning as a service”应用愈发广泛，如若不加管控，对个人数据隐私以及企业利益的损害将是致命性的。本次报告将分别从隐私和安全两个角度，对近年来受关注的几大机器学习攻击模型及其算法做较为全面和系统的梳理，并对不同攻击模型各自的特点加以对比。
2018.09.30 Wing Building for Science Complex
Minjie Zhu (Privacy Group)		The privacy problem of mobile application and compliance practice for GDPR Abstract: Service providers of mobile application collect large amounts of user data through App permissions. How to quantify these privacy risks is a major challenge nowadays. Based on previous related studies, this paper proposes a permission-based quantification model of app privacy risk under each category. By defining importance, sensitivity, and usage rate of each permission, we calculate the privacy risk value of each app, which also can be used to discriminate normal App and malicious App. The experimental results indicates the availability and effectiveness of the quantitative model.eneral Data Protection Regulation(GDPR) came into force on 2018-05-25, which has a huge impact on the pretect of user privacy. According to our studies, we can see the compliance improvement for mobile application development after GDPR is mainly reflected in three aspects: privacy policy, privacy dashboard, and data transparency of third-party developers.
Zhiqiangduan (Cloud Group)		Continuous Cross Match in Large-scale Dynamic Astronomical Data Flow Abstract: In modern astronomy, Short-Timescale and Large Field-of-view (STLF) sky survey produce large volume data and face a great challenge in cross identification. Furthermore, transient survey projects are required to select the candidates fast from large volume data. However, traditional cross identification methods didn鈥檛 satisfy the observation of transient survey. We present a fast and efficient cross identification system for large-scale astronomical data streams. By receiving a high-frequency star catalog and maintaining a local star catalog, the system partitions the star catalog and cross identification with the object catalog. Additionally, our system shows good performance in low latency large volume astronomical data processing.
2018.09.21 Wing Building for Science Complex
Qingqing Ye (Privacy Group)		Graph Analysis with Local Differential Privacy Abstract: A large amount of valuable information resides in decentralized graphs, where no entity has access to the complete graph structure. Instead, each node maintains locally a limited view of the graph. In this report, we first propose a graph perturbation mechanism to ensure local differential privacy (LDP) of individuals, based on which we design two use cases to mining useful information of graph. For one thing, we propose an algorithm to estimate local clustering coefficient, which is an important metric in graph data. This can effectively improve the usability of method proposed in CCS'17. For another, we design a solution for privacy-preserving community detection, which dose not have to iteratively access the original data, thus improving the accuracy.
Shuo Wang (Web Group)		ParaGraphE: A Library for Parallel Knowledge Graph Embedding Abstract: Knowledge graph embedding aims at translating the knowledge graph into numerical representations by transforming the entities and relations into continuous low-dimensional vectors,but existing single-thread implementations of them are time-consuming for large-scale knowledge graphs.This paper designs a unified parallel framework to parallelize these methods based on Lock-Free, which achieves a significant time reduction without in fluencing the accuracy.
2018.06.14 Wing Building for Science Complex
Yongjie Du (Cloud Group)		AstroSpark:A Unified Astronomical Big Data Processing Engine over Spark Abstract: The next decade promises to be an exciting time for astronomers. Large volumes of astronomical data are continuously collected from highly productive space missions. This data has to be efficiently stored and analyzed in such a way that astronomers maximize their scientific return from these missions.In this talk, we present AstroSpark, a distributed data server for astronomical data. AstroSpark introduces effective methods for efficient astronomical query execution on Spark through data partitioning with HEALPix and customized optimizer. AstroSpark offers a simple, expressive and unified interface through ADQL, a standard language for querying databases in astronomy. Experiments have shown that AstroSpark is effective in processing astronomical data, scalable and overperforms the state-of-the-art.
Shuo Wang (Web Group)		ParaGraphE: A Library for Parallel Knowledge Graph Embedding Abstract: Knowledge graph embedding aims at translating the knowledge graph into numerical representations by transforming the entities and relations into continuous low-dimensional vectors, but existing single-thread implementations of them are time-consuming for large-scale knowledge graphs.This paper designs a unified parallel framework to parallelize these methods based on Lock-Free, which achieves a significant time reduction without in fluencing the accuracy.
2018.05.31 Wing Building for Science Complex
Xinle Wu (Web Group)		Relation extraction based on deep learning Abstract: The method of relation extraction based on deep learning achieves the best effect on the open data sets,and we will introduce these methods from different perspectives, including the pipeline method, the end-to-end method, and the distant supervision method.
2018.05.24 Wing Building for Science Complex
Liu Lixin (Privacy Group)		data transparency and blockchain Abstract: Big data has become the core resource of the information society. At the same time, it also brings problems such as privacy leakage, data manipulation, data abuse, and "black boxes" of algorithm . The fundamental method to solve these problems is to improve the data transparency in the process of big data , so as to strengthen the supervision of the use of data, which involves the cross-disciplinary research of law and computer science. Blockchain-based data transparency technology allows data to be recorded on the blockchain at every step of data record, sharing, analysis, and deletion, facilitating the implementation of big data value and data accountability. This paper proposes data transparency, including transparent data record, transparent data sharing, transparent data analysis, and transparent data deletion. And research progress was summarized and analyzed. Finally, it summarizes the future development direction of data transparency technology.
Rihui Xin (Cloud Group)		Jupyter Notebook Introduction Abstract: Jupyter Notebook is an application based on webpage for interactive computation. It can be applied to whole process computation: development, documentation, operation code and display results. This tool that integrates programming development and result display is a new working experience for everyone. This report will start with several examples and show the skills of using Jupyter Notebook at the scene. I hope this demonstration will enable you to basically learn the use of Jupyter Notebook.

2018.05.18 Wing Building for Science Complex
Zhiqiang duan (Cloud Group)	Replication-based State Management in Distributed Stream Processing Systems Abstract: Storm's state management is achieved by a checkpointing framework, which commits states regularly and recovers lost states from the latest checkpoint. However, this method involves a remote data store for state preservation and access, resulting in significant overheads to the performance of error-free execution. E-Storm, a replication-based state management system that actively maintains multiple state back ups on different worker nodes.
Qingqing Ye (Privacy Group)	Attacks on Machine Learning Model Abstract: Machine learning (ML) model may be deemed confidential due to their sensitive training data, commercial value, or use in security applications. Increasingly often, confidential ML models are being deployed with publicly accessible query interfaces. The tension between model confidentiality and public access motivates the investigation of model inversion and extraction attacks. In such attacks, an adversary with black-box access, but no prior knowledge of an ML model's parameters or training data, aims to steal the training data or directly duplicate the functionality of (i.e.,"steal") the model. In this report, these attacks will be demonstrated and some basic countermeasures will be further presented.
2018.05.10 Wing Building for Science Complex
Yi Zhang (Web Group)	Report of XLDB2018 Abstract: The 11th Extremely Large Databases Conference(XLDB2018) was held in California, America, April 30 - May 2, 2018. This year, special consideration is given to large-scale data management for Machine Learning and Artificial Intelligence in production use cases. "KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building", a project of our laboratory research, has made a lightening talk at the conference.

2018.04.28 Wing Building for Science Complex
Wenmei Wu (Cloud Group)	A Hybrid NVM-DRAM Storage Engine for fast data Recovery Abstract: Many applications need response to customer quickly.Then it need efficient storage system to respond to customer.In-memory key-value storage is used by many applications.However,due to the limitation of DRAM itself,DRAM is impossilbe to develop on a large scale.Thus we must introduce new memory.This report tell us how to build a hybrid nvm-dram storage engine for fast data recovery.
Du Zhi Juan (Web Group)	The Methods of Knowledge Graph Embedding Based on Convolution Abstract: For the knowledge graph embedding methods, the deep model can capture more features than the translation or bilinear model. This report presents ConvE and ConvKB methods. They are the latest papers in 2018. Both them are based on convolutional neural networks. The knowledge graph is modeled by 2D convolutional embedding and multi-layer non-linear feature in ConvE. It uses the following techniques to improve performance: 1-N fast scoring, multi-layer non-linear feature, batch normalization and dropout. However, it is a very simple convolutional model that can only capture local relations. To this end, ConvKB uses a convolutional neural network to capture the global relations and transformation features between entities and relations in the knowledge graph.
2018.04.19 Wing Building for Science Complex
Yi Zhang (Web Group)	KGBuilder: A System for Large-Scale Scientific Domain Knowledge Graph Building Abstract: In scientific domain, cutting edge creations and discoveries are often open accessible as texts in Web, papers and other carriers. Decentralized and unordered, they are hard to follow. By information extraction and reorganization, knowledge graph can help professionals follow latest and discover unknown scientific facts more effectively. Therefore, we will take microbiology for an example to show the process of large-scale scientific domain knowledge graph building by KGBuilder.Taking texts as input, KGBuilder creates domain knowledge graph in 3 steps: identifying entities, relations and new links respectively by named entity recognition, relation extraction and linking prediction. By combining BiLSTM, CRF and probabilistic methods, domain knowledge is added to named entity recognition in a simple and extensible way. Its relation extraction can automatically generate large amount of annotated data and extract features via distant supervision and neural network, reducing annotating cost. Inspired by Question Answering and rectilinear propagation of lights, KGBuilder puts forward TransMT to deal with the head-tail unbalanced problem in link prediction, performing better than traditional models, especially translation ones.
Junxu Liu (Privacy Group)	Case Analysis of Facebook's Data Abuse and A state-of-art Framework for anti-proling. Abstract: This report includes two main topics as follows: First, I will give a brief introduction for the Facebook-Cambridge Analytica data analysis and political advertising uproar. Then I will show everyone the main contents of the process from data to personalized advertising. Experiments show that a wide variety of people's personal attributes can be automatically and accurately inferred using their Facebook Likes, and computer-based models maybe know you better than your friends. Second, I will introduce a state-of-art framework for anti-proling. This framework not only can reconcile privacy and user utility, but control their trade-off. This approach make the services provider only sees an intermediate layer consists of many Mediator Accounts(MAs), instead of the real users, so that users' personal information can be protected and will not produce accurate profiles.
2018.04.12 Wing Building for Science Complex
Qing Tang (Cloud Group)	Application of scientific workflow in software defined Abstract: The Aserver+ system introduces the concept of Software Define, which aims to make a combination of data dependencies between complex applications and programs, and to control each part under the constraints of time, space, and resources, and provide scientific data management, analysis, and visualization for scientific data management, analysis, and automation. The management platform. This report gives a brief explanation of the overall design of Software Defined and the design of each module, focusing on the analysis of the front-end scheduling system based on BPEL language.
Yujie Shao (Web Group)	Latent Dirichlet Allocation Topic Model Abstract: LDA is an unsupervised machine learning model and USES the word bag model. An article will construct a word vector in the word bag model. LDA (Latent Dirichlet Allocation) is a kind of document generation model. It thinks that an article has multiple topics, and each one corresponds to a different word. The construction of the passage, first choose a topic in a certain probability, and then under the theme of choose one word at a certain probability, thus generating the first word in this article. This process is repeated, and the entire article is generated. Of course, there is no order between the words and the words.
2018.04.08 Wing Building for Science Complex
Zhu Minjie (Privacy Group)	Mobile Privacy Survey—Evaluating App Privacy and User Privacy Solution Abstract: Privacy has become a key concern for smartphone users as many apps tend to access and share sensitive data. There are three main approaches to surveying sensitive data collection status on mobile phone: permissions analysis, static code analysis and dynamic analysis in researches. As mobile privacy is defined as collect sensitive data without user’s consent later, permission-based and privacy policy based analysis methods are proposed to evaluate privacy leakage. At the same time, a few privacy-preserving techniques are offered to prevent data collection process or anonymize sensitive information. And Now There is Local Differential Privacy method which applies differential privacy to small mobile device to protect user privacy.
Zehui Hao (Web Group)	Technic of Man-Machine Dialogue Abstract: Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring labelled datasets. In this talk, I will first explain the common components of task-oriented dialogue. Then, I will introduce a neural network-based text-in, text-out end-to-end trainable dialogue system. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. Last, I will show a demo using a man-machine diologue training platform and a speech recognition system.
2018.03.29 Wing Building for Science Complex
Wu Yongtai (Web Group)	Traversing Knowledge Graph in Vector Space without Symbolic Space Guidance Abstract: Recently, based on the observed facts in the knowledge base to finding the missing facts , shows the importance of learning multi-step relations in vector space. The contents of the report include Compositionalization model and Implicit ReasoNets model of knowledge base complementary (IRNs), and the comparison between two models.
Chen Yang (Cloud Group)	An Approach to Quantify The Resource Impact in Big Data Systems Abstract: The performance of big data systems is always affected simultaneously by the CPU, memory, disk and network. It is important for the bottleneck analysis to quantify the resource impacts. However, the existing approaches do not address the comparable quantitative impacts on four major resources. Although some works can work on the specific resources, the results are error-prone. At this speech, we present an approach to address this problem by isolating the resource impact when observing the performance variation. Our approach is general-purpose due to having no knowledge about execution frameworks. We have developed two high-level end-to-end performance models to build new performance metrics, which can normalize the performance variation into the resource impact. A general performance model is built to capture the performance of big data systems. It can ensure that our methodology is general-purpose. The other uses the speedups obtained by the system to evaluate the impact factors of four major resources.
2018.03.22 Wing Building for Science Complex
Xin Yang (Privacy Group)	A Novel Security Framework for Managing Android Permissions Using Blockchain Technology Abstract: The Android system still occupies the dominant position of the market, and it is all attributed to its open source. The number of Google Apps in 2016 has reached 27 million. Its popularity has also become the target of many malicious software attacks. Of course, Google has also made a lot of efforts, starting from the linux security model to Android 6.0 users to manage their own permissions, but there are always its loopholes. This report will select a brand-new framework and use blockchain decentralization, self-control, non-destructive, open and transparent features to better and more effective management of Android system permissions.
Shuo Wang (Web Group)	Emotion Dictionary Applied in Text and Word Embedding for Sentiment Analysis Abstract: The contents of the report mainly cover the followings:(1)Emotional analysis of emotional dictionary in Shakespeare's dramas;(2)Emotional analysis applied in world famous works;(3)Sentiment analysis application of emotional dictionary in Sina Weibo datasets;(4)The application of emotional dictionary in word embedded learning.
2018.3.15 Wing Building for Science Complex
Qingqing Ye (Privacy Group)	Graph Data Release with Local Differential Privacy Abstract: A large amount of valuable information resides in decentralized social graphs, where no entity has access to the complete graph structure. Instead, each user maintains locally a limited view of the graph. In this report, we mainly investigate techniques to ensure local differential privacy (LDP) of individuals while collecting structural information and generating representative synthetic social graphs. We first demonstrate the importance of calibration in LDP methods, and then we present the details of the existing BTER model. To overcome the drawback of the existing solution based on BTER, i.e., LDPGen, we propose to combine the perturbed node degrees and neighbor list to generate a more accurate synthetic graph based on BTER.
Liu Lixin (Privacy Group)	Blockchain and decentralized data storage Abstract: Right now, our data, including web data, Internet of Things data and our files, are collected and stored by third-party service. These are based on the fact that we have to trust third-party services. At the same time, we have lost the ownership of data.And,there are a single point of failure and data silos. Decentralized storage based on Blockchain allows individuals to control their own data and solve single-point failures and data silos. This presentation mainly introduces several decentralized storage and sharing systems.

Maintained by WAMDM Administrator()

Zhongyuan's Website