2018.12.28 FL1, Wing Building for Science Complex |
杜永杰 (Cloud Group) |
Data Exploration with SQL using Machine Learning Techniques Abstract: Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a "rewriting" technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. |
朱敏杰 (Privacy Group) |
移动应用相关测评研究及AppPrivacy系统设计实现 Abstract: 指数作为反映复杂现象在各维度上的对比及变动情况的一种形式,目前在各行业内均得到应用。在隐私及安全领域,数字安全领域的全球领导者金雅拓(Gemalto)公司自2013年起开始发布"数据泄露水平指数",揭露全球范围内数据泄露事件的严重程度。针对移动应用(Mobile Application,以下简称App)使用中的数据泄露,中消协于2018年8-10月开展了App个人信息保护收集测评活动,对每款App中隐私政策内的个人信息收集情况进行统计并打分。 基于目前移动环境下用户数据泄露导致的隐私问题严重,及已提出的数据收集者(开发者)、数据拥有者(用户)隐私风险量化模型,我们设计移动应用隐私风险评量化评估系统AppPrivacy,该系统以揭示并评估移动应用场景下的用户隐私泄露程度为目的,主要对数据拥有者(用户)、数据收集者(开发者)、App三个对象在App使用过程中产生/获取的数据进行分析,进而对其面临/造成的隐私风险进行评估。 |
2018.12.21 FL1, Wing Building for Science Complex |
吴新乐 (Web Group) |
预训练技术在NLP领域的发展与现状 Abstract: 预训练模型在图像视频领域已经得到广泛使用,但在NLP领域长期以来只是作为一种辅助手段来增强task-specific模型的表现,如Word2Vec工具。直到近期ELMo,GPT,BERT等模型的相继推出,才充分证明了预训练技术在NLP领域的重要价值。我的报告将介绍该领域一些典型的工作,从而说明预训练技术在NLP领域的发展历程和现状。 |
2018.12.14 FL1, Wing Building for Science Complex |
马超红 (Cloud Group) |
机器学习化的数据库总结 Abstract: 近年来,基于数据驱动的机器学习应用程序的成功,促使了数据库领域在研究数据库系统和应用程序的设计中集成机器学习的相关技术,机器学习的成功为数据库领域的研究带来了研究机遇,同时也对数据库领域的发展带来了挑战。 从20世纪70年代开始,数据库领域就致力于系统优化以及大规模数据驱动的应用,因此将数据库与机器学习具有天然的密切关系,两个领域的结合,将极大地推动大数据驱动应用领域的发展。传统数据库问题,如索引、事务、存储管理等,对应关键字与位置之间的映射,数据库管理系统也同样存在调优、预测工作负载等问题,这些都为机器学习,尤其深度学习在数据库研究中的应用带来的机遇。 本次报告将对机器学习化的数据库近来的研究进行总结,主要分为:1)机器学习化的索引结构;2)机器学习化的查询优化;3)机器学习化的数据库配置;4)自动化数据库管理系统;5)其他 |
刘立新 (Privacy Group) |
数据共享透明总结 Abstract: 数据共享是大数据价值实现的关键环节。然而,目前数据共享过程的不透明对大数据价值实现产生重要影响。一方面大数据可能包含大量个人隐私,当隐私泄露时存在追踪问责困难。另一方面,大数据是数据决策的基础,数据经多方共享导致数据可信性引起质疑。实现数据共享透明,能够在必要时进行追踪问责和数据溯源成为亟待解决的问题,区块链的去中心和不可篡改特性为实现数据共享透明提供了新的解决思路。本文首先分析了数据共享时攻击模型。之后,提出数据共享透明模型,并分析和总结其目前研究现状。最后,对现有工作进行总结。 |
2018.12.07 FL1, Wing Building for Science Complex |
艾山 (Web Group) |
Adversarial training for joint entity and relation extraction Abstract: Adversarial training (AT) is a regularization method that can be used to improve the ro- bustness of neural network methods by adding small perturbations in the training data. We show how to use AT for the tasks of entity recognition and relation extraction.In par- ticular, we demonstrate that applying AT to a general purpose baseline model for jointly ex- tracting entities and relations, allows improv- ing the state-of-the-art effectiveness on sev- eral datasets in different contexts (i.e., news, biomedical, and real estate data). Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing, POS tagging , relation extraction , translation , and joint tasks . However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence). Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model. Although AT has recently been applied in NLP tasks (e.g., text classifica- tion (Miyato et al., 2017)), this paper — to the best of our knowledge — is the first attempt investigat- ing regularization effects of AT in a joint setting for two related tasks. |
吴永泰 (Web Group) |
ScholarFinding系统进展报告 Abstract: 关系发现是利用知识图谱中现有的知识推断出未知的知识。人们通过将数据组织成RDF等数据格式存储于数据库当中,通过关系发现系统,用户对感兴趣的实体进行关系发现,探索未知的关系。现有的关系发现系统如:RelFinder、RECAP等已经相对完善并有效运用。ScholarFinding系统是基于国内计算机领域的学者、学校机构、杂志期刊、学术论文的一个关系发现系统。本系统以学者为中心,以ScholarSpace现有的数据集为基础,通过遍历现有的数据信息,获取已知的学者与学者、学者与学校机构、学者与学术论文及学者与杂志期刊的关系,通过前端可视化展示及交互,从而获取用户感兴趣的学者的关系信息。 |
2018.11.17 FL1, Wing Building for Science Complex |
王硕 (Web Group) |
Graph Convolutional Network Abstract: Many scientific fields study data with an underlying structure that is a non-Euclidean space. Some examples include social networks in computational social sciences, sensor networks in communications, functional networks in brain imaging, regulatory networks in genetics, and meshed surfaces in computer graphics.Geometric deep learning is an umbrella term for emerging techniques attempting to generalize (structured) deep neural models to non-Euclidean domains such as graphs and manifolds. Based on a paper about graph convolutional network, this ppt makes a preliminary introduction and summary of this kind of technology through preliminary learning. |
2018.11.10 FL1, Wing Building for Science Complex |
马超红 (Cloud Group) |
机器学习化的数据库查询优化 Abstract: 查询优化是数据库领域最重要且充分研究的问题之一,对于关系型数据库,必须进行好的优化,才能够有可可接受的性能。 传统的查询优化,使用多年来基于数据库开发者的经验,来仔细地调节和复杂地启发式设计,这些启发式算法通常需要专业的DBA在每一个单独的DBMS上来调节以改善查询性能。“fire and forget”:在未来查询优化的进程中,不利用观察到的已经执行计划的性能,因此导致查询优化器不能够系统地“learning from their mistakes”。 机器学习化的数据库查询优化将深度学习与查询优化相结合,通过训练深度神经网络来模拟传统优化器,通过学习先前执行的查询计划的经验来自动调节模型,对每一个子查询给定一个reward,通过agent与environment的交互,选择cost低的action,生成query执行计划。 |
艾山 (Web Group) |
细粒度用户评论情感分析 Abstract: 在线评论的细粒度情感分析对于深刻理解商家和用户、挖掘用户情感等方面有至关重要的价值,并且在互联网行业有极其广泛的应用,主要用于个性化推荐、智能搜索、产品反馈、业务安全等。数据集,共包含6大类20个细粒度要素的情感倾向。需根据标注的细粒度要素的情感倾向建立算法,训练模型,最终进行预测。 这问题属于文本分类问题,所以本文对有关问分类方法进行研究,找出以下分类方法: 1.sklearn svm 2.onevsrestClassifier 3.XGboost 4.机器学习+情感词典。 用以上方法建立模型并训练,对比结果。 |
2018.11.02 FL1, Wing Building for Science Complex |
杜永杰 (Cloud Group) |
Real-Time Query Enabled by Variable Precision in Astronomy Abstract: As sky survey projects are coming out, petabytes and exabytes of astronomical data are continuously collected from highly productive space missions. Especial-ly, in time-domain astronomy, Short-Timescale and Large Field-of-view (STLF) sky survey not only requires real-time analysis on short-time data, but also need precise astronomical data for special phenomena. Additionally, it is important to find a partition method and build an index based on it for effective storage and query. However, the existing methods cannot simultaneously support real-time and variable-precision query in astronomy. In this paper, we propose a novel as-tronomical real-time and variable precision query method based on data partition-ing with Hierarchical Equal Area isoLatitude Pixelation (HEALPix for short). Our method calculates the time through model and predict precision by machine learning, which can accurately predict the partition level number of HEALPix which can effectively reduce the cost of time for query by layer and layer. The method can meet the user's requirements of real-time and variable-precision que-ry. The experimental results show that our method can optimize previous query strategies and reach a better performance. |
刘立新 (Privacy Group) |
数据透明技术研究综述 Abstract: 大数据蕴含着巨大的价值,已经成为信息社会的核心资源,然而发挥其价值的同时也带来了隐私泄露、数据操纵、数据滥用和算法“黑盒”等问题。这些问题产生的根本原因是大数据价值实现过程的不透明性及其特点导致的监管困难,人们迫切希望大数据价值实现过程是透明的和可验证的。区块链具有公开、透明、不篡改等特点,已经逐步应用大数据生命周期的各阶段中,增强大数据价值实现过程的透明性,促进大数据的问责使用。本文总结和对比分析了区块链在数据获取、共享、分析和删除几个阶段的研究进展,最后探讨了数据透明技术未来发展方向。 |
2018.10.26 FL1, Wing Building for Science Complex |
Xin Yang (Privacy Group) |
区块链技术概述与投票系统实现 Abstract: 区块链(Blockchain)是一种由多方共同维护,使用密码学保证传输和访问安全,能够实现数据一致存储、难以篡改、防止抵赖的记账技术,也称为分布式账本技术(Distributed Ledger Technology)。作为一种在不可信的竞争环境中低成本建立信任的新型计算范式和协作模式,区块链凭借其独有的信任建立机制,正在改变诸多行业的应用场景和运行规则,是未来发展数字经济、构建新型信任体系不可或缺的技术之一。为了进一步了解区块链技术,开发了基于以太坊的投票系统,从智能合约角度深入了解区块链技术,并从中找到研究点. |
Yi Zhang (Privacy Group) |
基于优化算法改进的大规模Embedding Abstract: Embedding是多模态数据(包括文本、图像、音频和视频等)和机器学习算法的桥梁。大数据时代下的数据规模成为现有Embedding的挑战。从首次Embedding的完成到静态批量更新,再到动态更新以及高频动态更新,每种情况都会面临或多或少的挑战。目前,现有支持大规模Embedding的解决方案多从数据切割和相似数据合并等角度进行考虑。但所有Embedding过程都离不开基于优化算法的迭代过程。 所以,这次报告将首先举例说明数据切割和相似数据合并的方法;然后基于个人思考从优化算法改进的角度来分析大规模Embedding问题;同时,报告已有的Weighted SGD和Diffused SGD方法;最后,总结个人思考和已有模型之间的差距和异同点,为下一步工作奠定基础. |
2018.10.19 Wing Building for Science Complex |
Xinle Wu (Web Group) |
关系抽取模型实现及自动化 Abstract: Pipeline方法是实现关系抽取模型的一种简单有效的方法,但该方法在公共数据集上的表现往往不尽人意。本次报告主要讲述实现关系分类子模型时遇到的类别不平衡问题、句子中的长距离依赖问题,以及分别用来解决这两个问题的focal loss损失函数和self-attention机制。最后,我将和大家分享一下关于将关系抽取模型系统化方面的一些调研工作以及我的目标规划. |
2018.10.12 Wing Building for Science Complex |
Chen Yang (Cloud Group) |
数据库中数据搬移的能耗分析与优化 Abstract: 当代计算机体系结构下,多级缓存层能缓解计算与存储部件速度差不断扩大造成的存储墙现象,但是会导致数据在缓存层间频繁搬移引发高能耗.在数据库中,查询作为核心操作是一类典型的数据密集型计算任务,数据搬移的能量开销更是巨大的,不仅造成能源的浪费,而且限制了复杂查询操作在嵌入式环境下的实现.然而,尚未有相关工作深入评估数据库中查询操作的能量消耗特征.本文针对该问题,提出一个能量特征化方法用于量化查询操作在不同缓存层上能耗,分为“基础能耗测量-单位能耗量化和验证-实际能耗特征化”三个步骤,具体为针对查询操作实际能耗的完整特征向量表示和量化模型方法、用具备单一缓存层访问特点的基准测试集测量基础能耗数据以及将其转化为单位数据搬移操作能耗的量化和验证方法.通过实验本文揭示了一个有价值的现象:L1D缓存(Level-one Data Cache)的读写是数据库查询操作中数据搬移的主要能耗瓶颈,占总能耗的47.5%,该规律有别于一般的计算任务,它具有数据库应用的独特性且适用绝大多数查询任务,具有很高的优化潜力. |
Junxu Liu (Privacy Group) |
机器学习的安全与隐私问题 Abstract: 现今,机器学习是自然语言处理,图像、语音识别等众多主流计算机技术的基石。国内外许多大型互联网企业都在自己的云平台上部署了机器学习模型及其接口,从而允许用户借助于云平台资源训练机器学习模型,执行查询任务,即“machine learning as a service”,如Microsoft Azure Machine Learning (Azure ML), Amazon Machine Learning (Amazon ML), Google Prediction API等。根据互联网企业对其机器学习模型的开放程度,“machine learning as a service”可分为白盒与黑盒两种服务模式:白盒模式下,用户可以下载训练好的模型并部署到本地;黑盒模式下,用户对模型结构及参数全然不知,只能通过调用接口执行查询任务。企业往往基于数据安全和隐私问题以及商业利益的考虑,大都采用黑盒模式提供服务,然而针对机器学习模型安全和隐私的攻击模型却层出不穷。随着“machine learning as a service”应用愈发广泛,如若不加管控,对个人数据隐私以及企业利益的损害将是致命性的。本次报告将分别从隐私和安全两个角度,对近年来受关注的几大机器学习攻击模型及其算法做较为全面和系统的梳理,并对不同攻击模型各自的特点加以对比。 |
2018.09.30 Wing Building for Science Complex |
Minjie Zhu (Privacy Group) |
The privacy problem of mobile application and compliance practice for GDPR Abstract: Service providers of mobile application collect large amounts of user data through App permissions. How to quantify these privacy risks is a major challenge nowadays. Based on previous related studies, this paper proposes a permission-based quantification model of app privacy risk under each category. By defining importance, sensitivity, and usage rate of each permission, we calculate the privacy risk value of each app, which also can be used to discriminate normal App and malicious App. The experimental results indicates the availability and effectiveness of the quantitative model.eneral Data Protection Regulation(GDPR) came into force on 2018-05-25, which has a huge impact on the pretect of user privacy. According to our studies, we can see the compliance improvement for mobile application development after GDPR is mainly reflected in three aspects: privacy policy, privacy dashboard, and data transparency of third-party developers. |
Zhiqiangduan (Cloud Group) |
Continuous Cross Match in Large-scale Dynamic Astronomical Data Flow Abstract: In modern astronomy, Short-Timescale and Large Field-of-view (STLF) sky survey produce large volume data and face a great challenge in cross identification. Furthermore, transient survey projects are required to select the candidates fast from large volume data. However, traditional cross identification methods didn鈥檛 satisfy the observation of transient survey. We present a fast and efficient cross identification system for large-scale astronomical data streams. By receiving a high-frequency star catalog and maintaining a local star catalog, the system partitions the star catalog and cross identification with the object catalog. Additionally, our system shows good performance in low latency large volume astronomical data processing. |
2018.09.21 Wing Building for Science Complex |
Qingqing Ye (Privacy Group) |
Graph Analysis with Local Differential Privacy Abstract: A large amount of valuable information resides in decentralized graphs, where no entity has access to the complete graph structure. Instead, each node maintains locally a limited view of the graph. In this report, we first propose a graph perturbation mechanism to ensure local differential privacy (LDP) of individuals, based on which we design two use cases to mining useful information of graph. For one thing, we propose an algorithm to estimate local clustering coefficient, which is an important metric in graph data. This can effectively improve the usability of method proposed in CCS'17. For another, we design a solution for privacy-preserving community detection, which dose not have to iteratively access the original data, thus improving the accuracy. |
Shuo Wang (Web Group) |
ParaGraphE: A Library for Parallel Knowledge Graph Embedding Abstract: Knowledge graph embedding aims at translating the knowledge graph into numerical representations by transforming the entities and relations into continuous low-dimensional vectors,but existing single-thread implementations of them are time-consuming for large-scale knowledge graphs.This paper designs a unified parallel framework to parallelize these methods based on Lock-Free, which achieves a significant time reduction without in fluencing the accuracy. |
2018.06.14 Wing Building for Science Complex |
Yongjie Du (Cloud Group) |
AstroSpark:A Unified Astronomical Big Data Processing Engine over Spark Abstract: The next decade promises to be an exciting time for astronomers. Large volumes of astronomical data are continuously collected from highly productive space missions. This data has to be efficiently stored and analyzed in such a way that astronomers maximize their scientific return from these missions.In this talk, we present AstroSpark, a distributed data server for astronomical data. AstroSpark introduces effective methods for efficient astronomical query execution on Spark through data partitioning with HEALPix and customized optimizer. AstroSpark offers a simple, expressive and unified interface through ADQL, a standard language for querying databases in astronomy. Experiments have shown that AstroSpark is effective in processing astronomical data, scalable and overperforms the state-of-the-art. |
Shuo Wang (Web Group) |
ParaGraphE: A Library for Parallel Knowledge Graph Embedding Abstract: Knowledge graph embedding aims at translating the knowledge graph into numerical representations by transforming the entities and relations into continuous low-dimensional vectors, but existing single-thread implementations of them are time-consuming for large-scale knowledge graphs.This paper designs a unified parallel framework to parallelize these methods based on Lock-Free, which achieves a significant time reduction without in fluencing the accuracy. |
2018.05.31 Wing Building for Science Complex |
Xinle Wu (Web Group) |
Relation extraction based on deep learning Abstract: The method of relation extraction based on deep learning achieves the best effect on the open data sets,and we will introduce these methods from different perspectives, including the pipeline method, the end-to-end method, and the distant supervision method. |
2018.05.24 Wing Building for Science Complex |
Liu Lixin (Privacy Group) |
data transparency and blockchain Abstract: Big data has become the core resource of the information society. At the same time, it also brings problems such as privacy leakage, data manipulation, data abuse, and "black boxes" of algorithm . The fundamental method to solve these problems is to improve the data transparency in the process of big data , so as to strengthen the supervision of the use of data, which involves the cross-disciplinary research of law and computer science. Blockchain-based data transparency technology allows data to be recorded on the blockchain at every step of data record, sharing, analysis, and deletion, facilitating the implementation of big data value and data accountability. This paper proposes data transparency, including transparent data record, transparent data sharing, transparent data analysis, and transparent data deletion. And research progress was summarized and analyzed. Finally, it summarizes the future development direction of data transparency technology. |
Rihui Xin (Cloud Group) |
Jupyter Notebook Introduction Abstract: Jupyter Notebook is an application based on webpage for interactive computation. It can be applied to whole process computation: development, documentation, operation code and display results. This tool that integrates programming development and result display is a new working experience for everyone. This report will start with several examples and show the skills of using Jupyter Notebook at the scene. I hope this demonstration will enable you to basically learn the use of Jupyter Notebook. |