
2019.12.27 FL1, Wing Building for Science Complex |
|
段志强 (Cloud Group) |
报告题目:Forecasting Wavelet Transformed Time Series with Attentive Neural Networks 报告摘要:A time series is defined as a sequence of data points listed in time order. Many real-life time series data are driven by multiple latent components which occur at different frequencies.Existing solutions to time series forecasting fail to identify and discriminate these frequency-domain components. Inspired by the recent advent of signal processing and speech recognition techniques that decompose a time series signal into its time frequency representation – a scalogram (or spectrogram), this paper proposes to explicitly disclose frequency-domain information from a univariate time series using wavelet transform, towards improving forecasting accuracy. Based on the transformed data, we leverage different neural networks to capture local time frequency features and global long-term trend simultaneously. The experimental results on real time series show that our proposed approach achieves better performance than various baseline methods. |
2019.12.20 FL1, Wing Building for Science Complex |
|
吴新乐 (Web Group) |
报告题目:半监督多目标领域自适应方法研究 报告摘要:本次报告关注多目标领域自适应问题。尽管已经提出了很多解决领域自适应问题的方法,但是他们中的大多数都是关注单源域单目标域领域自适应问题。而在很多现实的场景中,我们经常需要在多个不同域之间迁移知识。提出一个通用的自提升深度学习模型,来解决半监督多目标领域自适应问题。它适用于一个源域、多个目标域的半监督领域自适应场景,可以同时利用源域和所有目标域中的标注与未标注数据。在标准情感分析数据集上的实验结果证明了模型的有效性。 |
杜永杰 (Cloud Group) |
报告题目:Time Series Forecasting with Temporal Attention Learning 报告摘要:Traditional methods rely on setting up temporal dependencies manually to explore related patterns in historical data, which is unrealistic in forecasting long-term series on real-world data. Instead, we propose to explicitly learn constructing hidden patterns’ representations with deep neural networks and attending to different parts of the history for forecasting the future. In this paper, we propose an end-to-end deep-learning framework for multi-horizon time series forecasting, with temporal attention mechanisms to better capture latent patterns in historical data which are useful in predicting the future. Forecasts of multiple quantiles on multiple future horizons can be generated simultaneously based on the learned latent pattern features. |
2019.12.13 FL1, Wing Building for Science Complex |
|
郝新丽 (Cloud Group) |
报告题目:Modeling Extreme Events in Time Series Prediction 报告摘要:时间序列预测在很多领域是一个研究的热点,相对于传统的统计学方法,深度学习方法的预测效果有了较大的提升,但是当时间序列中包含极端事件时,深度学习的方法还是不尽人意。本次报告介绍一种提升深度学习对极端事件的预测能力的方法,主要基于统计学中的极值理论和机器学习中的记忆网络两个主要内容。 |
王雷霞 (Privacy Group) |
报告题目:Answering Multi-Dimensional Analytical Queries under Local Differential Privacy 报告摘要:Multi-dimensional analytical (MDA) queries are often issued against a fact table with predicates on (categorical or ordinal) dimensions and aggregations on one or more measures. To handle a large class of MDA queries with different types of predicates and aggregation functions, several LDP encoders and estimation algorithms are proposed. Especially, the predicate constraints include range constraints, so that this work also answers range queries in LDP. Through the analysis of these algorithms, we will know more clearly about how the errors are generated in LDP. And through the summary, we found that this problem may be can be improved by ESA or other space partitioning techniques. |
2019.12.06 FL1, Wing Building for Science Complex |
|
王 飞 (Web Group) |
报告题目:Domain Adaptation for Person-Job Fit with Transferable Deep Global Match Network 报告摘要:Local differential privacy (LDP), where each user perturbs her data locally before sending to an untrusted data collector, is a new and promising technique for privacy-preserving distributed data collection. The advantage of LDP is to enable the collector to obtain accurate statistical estimation on sensitive user data (e.g., location and app usage) without accessing them. However, existing work on LDP is limited to simple data types, such as categorical, numerical, and set-valued data. Currently, PrivKV is the only one work we proposed for key-value data collection with local differential privacy. In this report, I will first briefly introduce some preliminaries of PrivKV, and then present our extension work in some aspects. Finally, I will also briefly summarize how to extend a conference paper to a journal version from my experimence. |
2019.11.29 FL1, Wing Building for Science Complex |
|
叶青青 (Privacy Group) |
报告题目:An Extension: Key-Value Data Collection with Local Differential Privacy 报告摘要:Local differential privacy (LDP), where each user perturbs her data locally before sending to an untrusted data collector, is a new and promising technique for privacy-preserving distributed data collection. The advantage of LDP is to enable the collector to obtain accurate statistical estimation on sensitive user data (e.g., location and app usage) without accessing them. However, existing work on LDP is limited to simple data types, such as categorical, numerical, and set-valued data. Currently, PrivKV is the only one work we proposed for key-value data collection with local differential privacy. In this report, I will first briefly introduce some preliminaries of PrivKV, and then present our extension work in some aspects. Finally, I will also briefly summarize how to extend a conference paper to a journal version from my experimence. |
马超红 (Cloud Group) |
报告题目:Data Alchemist and learned index for spatial queries 报告摘要: 数据结构在计算机领域占有重要地位,数据结构的设计存在于系统、算法等各领域,如何使用数据结构的基本元素,自动地设计针对特定应用的数据结构?关键在于找到数据结构设计的基本元素,并衡量不同元素结合产生的性能影响。哈佛大学数据系统实验室基于自动化数据结构设计的思想,提出了Data Alchemist。本次报告,简要介绍关于自动化数据结构设计的概念,同时在学习化索引设计的过程中借鉴从这篇论文中领悟的思想,对近来的工作进行汇报,介绍learned index在空间查询中的应用。 |
2019.11.23 FL1, Wing Building for Science Complex |
|
王雷霞 (Privacy Group) |
报告题目:ESA: From Local to Central Differential Privacy via Shuffling 报告摘要:The large-scale monitoring of computer users’ software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. In order to protect the privacy of users in this scene, Google proposed a principled systems architecture—Encode, Shuffle, Analyze (ESA)—for performing such monitoring with high utility while also protecting user privacy. ESA extends existing best-practice methods for sensitive- data analytics, by using cryptography and statistical techniques to make explicit how data is elided and reduced in precision, how only common-enough, anonymous data is analyzed, and how this is done for only specific, permitted purposes. In this report, we introduce what is ESA and its implement PROCHLO, and discuss its advantages and disadvantages, as well as the status quo of development. |
杜永杰 (Cloud Group) |
报告题目:Data Motifs: A Lens Towards Fully Understanding Big Data and AI Workloads 报告摘要:The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. This paper proposes a new approach to modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. Among a wide variety of big data and AI workloads, we identify eight data motifs that take up most of the run time of those workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. |
2019.11.16 FL1, Wing Building for Science Complex |
|
郝新丽 (Cloud Group) |
报告题目:SSIM—A Deep Learning Approach for Recovering Missing Time Series Sensor Data 报告摘要:Missing data are unavoidable in wireless sensor networks, due to issues such as network communication outage, sensor maintenance or failure, etc. Although a plethora of methods have been proposed for imputing sensor data, limitations still exist. The SSIM uses the state-of-the-art sequence-to-sequence deep learning architecture, and the long short-term memory network is chosen to utilize both past and future information for a given time. Moreover, a variable-length sliding window algorithm is developed to generate a large number of training samples so the SSIM can be trained with small data sets. |
吴永泰 (Web Group) |
报告题目:Knowledge Representation Learning Based on Translation-model and its Application on ScholarSpace 报告摘要:知识图谱在智能问答、个性化推荐、用户画像等领域有着诸多应用。对知识图谱有更好的知识表示,能够更好的提高应用的性能。对知识图谱的知识表示学习从传统的 RDF 三元组,发展到低位向量空间表示, transE 系列是经典的知识表示学习方法。本次报告包括transE系列的介绍和发展历程,以及 transE 在 ScholarSpace 系统中的应用。 |
2019.11.08 FL1, Wing Building for Science Complex |
|
汤庆 (Cloud Group) |
报告题目: TDengine原理分析和性能测试 报告摘要:世界是由数据构成的,在这个世界上存在的每个物体,每时每刻都在产生着数据,而对这些数据的高效管理尤为重要。Time series database(时序数据库)蓬勃发展,“TDengine是一个针对物联网领域优化的开源大数据平台。除了是一个速度比hadoop快10倍的时序数据库,它还提供了缓存,流式计算,消息队列和其他以减少开发和运维的复杂度和成本的功能。” 2019年7月12日TDengine开源,10多万行C代码,包括最核心的存储引擎和计算引擎都上传到了GitHub上,仅两周时间,GitHub上Star已经超过7300,Fork数已经超过1800。本次汇报包含两部分,一部分是TDengine 数据库的原理调研报告,另一部分是针对TDengine数据库和主流时序数据库InfluxDB的Benckmark对比测试实验报告,测试数据包含1亿条时序数据。 |
2019.11.01 FL1, Wing Building for Science Complex |
|
刘立新 (Privacy Group) |
题目:HealthChain: A Blockchain-Based Privacy-preserving and accountable electronic health systems |
唐子立 (Cloud Group) |
题目:Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations |
2019.10.18 FL1, Wing Building for Science Complex |
|
段志强 (Cloud Group) |
题目:AstroServ中的数据质量控制策略 摘要:本次报告对AstroServ系统中数据质量的控制和判别策略进行报告,由于真实数据的噪音和无用数据过多,会极大地影响异常检测算法的精度,因此需要对观测数据进行严格的质量控制。本次报告主要是关于目前AstroServ系统中已经使用的几种策略进行总结和汇报。 |
刘俊旭 (Privacy Group) |
题目:Privacy-preserving Image Recognition 摘要:随着以人脸识别为代表的深度学习应用的蓬勃发展,个人数据滥用而产生的隐私与安全问题也愈发成为人们关注的焦点。以移动设备场景下的图像处理任务为例,由于目前的技术大多基于云计算平台,如果开发者希望在自己的App上应用深度学习,往往需要先在云平台(服务器)上训练一个可用模型,进而构建模型与App间的数据传输通道,将用户使用App时产生的新数据发送到服务器并返回预测结果。这一过程便使得开发者能够在用户不知情下收集大量数据。解决上述问题的一种理想处理框架是不把用户的真实数据发送给服务器,同时保证数据中的主要特征不被破坏,即仍能得到正确的预测结果,实现隐私与可用性的平衡。本次报告讨论移动场景下隐私保护的图像识别方法,主要包含以下内容: 1.以iPhone中的3D人脸识别为例,介绍Apple为实现隐私与效率的共赢所采取的主要策略; 2.针对更通用的2D图像识别问题(一般国产设备中的人脸识别方案),介绍并对比现有的隐私保护框架,主要分为基于数据的方法和基于模型的方法,前者主要包括图像加密、图像模糊,后者主要指模型切分和联邦学习。 |
2019.10.11 FL1, Wing Building for Science Complex |
|
吴新乐 (Web Group) |
题目:How to realize human-like intelligence? Brain-like computing or combination generalization 摘要:现阶段的人工智能技术因为对大规模标注数据的依赖和缺乏推理能力而广受诟病,被讽刺为只能做曲线拟合,离真正的human-like intelligence还很远。那么类脑计算和组合优化谁才是打开类人智能这扇门的钥匙呢?本次报告将抛砖引玉,对以上两个方面的研究做一个粗浅的介绍,旨在引发大家的思考。 |
马超红 (Cloud Group) |
题目:Reducing the storage overhead of main-memory OLTP databases with hybrid-index and fiting-tree index 摘要:本次报告主要介绍对于数据库系统索引的优化。一方面,索引对数据库的查询具有重要作用,随着数据集的增大,索引的意义更加突出。而另一方面,建立索引需要消耗大量的空间,2016年的研究指出索引在内存数据库中消耗55%的内存空间。索引的建立能够带来查询性能的显著提升,但消耗内存,内存并非免费也不是无限的,这不仅带来购置内存设备的费用增大,同时大量的内存消耗,导致系统用于存储新数据和处理数据的空间减少。本次报告针对内存数据库,从性能和空间占用两个方面介绍对索引的优化,主要介绍的工作为hybrid-index、fiting-tree index。 |
2019.09.27 FL1, Wing Building for Science Complex |
|
艾山 (Web Group) |
题目:Evolving Knowledge Graphs 摘要:Many practical applications have observed knowledge evolution, i.e., continuous born of new knowledge, with its formation influenced by the structure of historical knowledge. This observation gives rise to evolving knowledge graphs whose structure temporally grows over time. However, both the modal characterization and the algorithmic implementation of evolving knowledge graphs remain unexplored. To this end, we propose EvolveKG, a framework that reveals cross-time knowledge interaction with desirable performance of storage and computation. The novelty of EvolveKG lies in Derivative Graph – a static weighted snapshot of evolution at a certain time. Particularly, each weight quantifies knowledge effectiveness with a temporarily decaying function of consistency and attenuation, two proposed factors depicting whether or not the effectiveness of a fact fades away with time. Thanks to the cross-time interaction, EvolveKG allows future knowledge prediction by virtue of the influence from the historical ones. |
杨鑫 (Privacy Group) |
题目:探索推荐功能相似而隐私风险较小APP的方法 摘要:软硬件和物联网技术的迅猛发展使得移动设备大量普及,尤其是智能手机。而随着移动设备普及带来的便是第三方应用市场App的蓬勃发展。为了更好地保证用户的隐私,选择一个隐私风险值最小的APP变得至关重要。本报告旨在探索,在保证功能相似前提下,推荐给用户隐私风险值较小APP的方法及其实验结果。 |
2019.09.20 FL1, Wing Building for Science Complex |
|
刘立新 (Privacy Group) |
题目:Privacy-preserving and accountable electronic health systems based on blockchain 摘要:基于区块链的电子医疗系统得到了研究者的重视,使患者能够控制自己的电子医疗记录(EHR),实现数据共享透明。这些系统为患者带来更好的医疗服务、节约医疗资源、并为患者的EHR提供更好的隐私保护。但是,这些系统将元数据存储于区块链,区块链的公开透明特性可能会泄露患者和医生之间的就诊关系隐私。此外,大多数系统并不支持多医生诊断的会诊场景。本次报告主要汇报最近的研究工作:提出一种即保护隐私又支持问责的电子医疗系统,该系统具有可用性、隐私保护、支持问责和透明的特征。 |
叶青青 (Privacy Group) |
题目:Beyond Value Perturbation: Differential Privacy in the Temporal Setting 摘要:A time series is a sequence of values in a temporal order. It has numerous applications in signal processing, Internet of things, financial trading, speech, and other multimedia domains. As with other data types, many time series originate from personal data and releasing them could cause privacy infringement. As such, many privacy-preserving time series publishing techniques have been proposed, most of which are based on the differential privacy mechanism. Differential privacy injects noise to the published time series to guarantee no individual value that contributes to a time series is disclosed with high confidence. However, existing differential privacy mechanisms all aim at perturbing the values, each of which is an aggregate of individual values, while retaining the original temporal order. Unfortunately, in many applications the values must not be perturbed whereas the temporal order does not to be strictly follow. As such, we propose temporal perturbation mechanisms that move the values along their temporal axis, i.e., Differential Privacy in the Temporal Setting. |
2019.06.21 FL1, Wing Building for Science Complex |
|
樊敏 |
题目:心电图自动识别 摘要:心血管疾病是全球非传染性死亡的主要原因,约占全球死亡总数的三分之一,《中国心血管病报告2018》概要中指出,心血管疾病死亡率仍居首位,患病人数逐年增加,及早防治势在必行。心电图是心血管疾病诊断和评估心律失常风险最有效的工具,随着可穿戴心电图设备的出现,心电图监测需求不断增加,需要一种可靠的自动识别心电图类型的方法对病人的健康情况进行监测。本次报告综述现有心电图识别的相关工作,探讨一种可穿戴设备场景下的心电图自适应算法,并对实验结果进行分析,提出可穿戴设备进行心电图识别的挑战。 |
陈珂锐 |
题目:Interpretation and Differential Privacy in Deep 摘要:Learning Interpretation methods can improve the usability of differential privacy in deep learning? We attempt to add “more noise” into features which are “less relevant” to the model output by Interpretation methods. The LRP (layer-Wise Relevance Propagation)algorithm might be a good solution, experiments conducted on MNIST and CIFAR-10 datasets show that our mechanism is highly effective and outperforms existing solutions. |
2019.06.14 FL1, Wing Building for Science Complex |
|
王飞 (Web Group) |
题目:异构信息网络在推荐算法等方面的应用 摘要:传统的图算法所涉及到的网络对象一般都是同种类型的homogeneous。这种剥离其它信息,只关心对象间的直接联系的网络建模方式,大大简化了我们的处理思路,但是也很容易造成信息的损失。在现实世界中,对象之间的联系以及对象的类型往往是多种多样的,异构信息网络的诞生就是用来从这类丰富的对象以及联系中挖掘数据的潜在价值。作为网络数据的一种新的建模方式,HIN 更契合异构的数据本征,它能够包含更多的信息以及整合丰富的语义关系,这是 HIN 的优点,也是 HIN 研究的难点。在学术界,HIN 已经成为网络数据挖掘的重要工具,各类任务如 相似度度量、分类、排序、推荐等等都已得到广泛应用。总体来说,相较于传统的基于User-Item矩阵之上的方法,由于融合了更多的 Side Information 以及以 Meta path 为核心的计算体系,使得基于 HIN 的推荐系统不管是在效果上,还是在可解释性以及多样性上,都能达到非常高的指标。 |
吴新乐 (Web Group) |
题目:迁移学习中的领域自适应方法介绍 摘要:在经典的机器学习问题中,我们往往假设训练集和测试集分布一致,在训练集上训练模型,在测试集上测试。然而在实际问题中,测试集与训练集的分布可能会有很大差异,这时在训练集上训练的模型就无法在测试集上有好的表现,即出现“过拟合”问题。领域自适应是迁移学习中的一种代表性方法,目标是将在源域上训练的模型应用到边缘特征分布不同的目标域,并取得好的效果。我的报告将对迁移学习做简单介绍,然后重点介绍其中的领域自适应方法和元学习方法。 |
2019.06.06 FL1, Wing Building for Science Complex |
|
段志强 (Cloud Group) |
题目:Distribution-Aware Stream Partitioning for Distributed Stream 摘要:The performance of modern distributed stream processing systems is largely dependent on balanced distribution of the workload across cluster. Input streams with large, skewed domains pose challenges to these systems, especially for stateful applications. Key splitting, where state of a single key is partially maintained across multiple workers, is a simple yet effective technique to reduce load imbalance in such systems. However it comes with the cost of increased memory overhead which has been neglected by existing techniques so far. We talk about a novel stream partitioning algorithm for intra-operator parallelism which adapts to the underlying stream distribution in an online manner and provides near-optimal load imbalance with minimal memory overhead. The technique relies on explicitly routing frequent items using a greedy heuristic which considers both load imbalance and space requirements. It uses hashing for infrequent items to keep the size of routing table small. Through extensive experimentation with real and synthetic datasets,that solution consistently provides near-optimal load imbalance and memory footprint over variety of distributions. |
杨晨 (Cloud Group) |
题目:Exploratory Analysis (Model Join) on ModelBase 摘要:Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets. |
2019.05.31 FL1, Wing Building for Science Complex |
|
马超红 (Cloud Group) |
题目:机器学习化数据库系统总结 摘要:本次报告主要介绍机器学习化数据库系统的研究背景。数据库系统如何结合机器学习来改进性能瓶颈。一方面,借助机器学习的加速算法、现代加速器,提高数据库系统的处理速度。另一方面,借助机器学习的智能来提高数据库系统的易用性,使得数据库系统面对多样化的工作负载能够更加智能地动态调整数据库系统的配置。并就这两个方面介绍近来阅读的两篇论文。 |
刘立新 (Privacy Group) |
题目:区块链在分布式数据统计分析和联邦学习中应用 摘要:人们越来越依赖于大数据决策,但是针对分布式数据场景,例如医疗、国家安防、企业数据,考虑到隐私和安全问题,往往不能直接传输共享数据,影响了大数据价值实现。在隐私保护前提下实现分布式数据统计分析和机器学习,间接实现数据共享是目前大数据价值实现面临亟待解决问题。目前,大部分方法假设参与者(数据提供者、统计分析的计算节点、联邦学习的Master节点)是诚实且好奇的(Honest-but-Curious),而且统计分析和机器学习过程不支持审计和缺乏公平性。然而在实际应用中,任何一方都可能是不可信的,都可能犯错(Faulty)或者是恶意的(Malicious)。区块链具有去中心、不可篡改和公开透明的特性,为解决这些问题提供了技术支持。本次报告总结基于区块链实现分布式数据集统计分析和机器学习,从而支持审计和增强大数据决策透明。 |
2019.05.25 FL1, Wing Building for Science Complex |
|
吴永泰 (Web Group) |
题目:Shared Embedding Based Neural Networks for Knowledge Graph Completion 摘要:知识图谱嵌入是现在主要的知识图谱补全方法,包括双线性模型、平移模型、以及基于神经网络的补全模型。随着神经网络在自然语言处理和计算机视觉方面的发展,将神经网络应用于知识图谱补全也成为当下的研究热点。本次报告包括介绍知识图谱补全的经典方法,以及基于共享嵌入的知识图谱补全方法。 |
刘俊旭 (Privacy Group) |
题目:联邦学习与隐私保护 摘要:随着以深度学习为代表的机器学习技术逐渐被广泛应用到人们生活的方方面面,而用来训练模型的数据中也不乏包含了大量的个人敏感信息。联邦学习是目前一种新形式的机器学习方法,旨在将用户数据保留在用户端的同时,利用各方数据协作地训练一个机器学习模型。与传统通过将用户数据集中收集至数据中心进而训练模型,联邦学习具有明显的隐私保护优势,但仍存在潜在的隐私问题。本报告首先对联邦学习过程加以概述,进而指出联邦学习中存在的隐私问题及目前已有的隐私保护方法,最后探讨联邦学习在实际应用与部署中面临的挑战与未来的发展前景 |
2019.05.17 FL1, Wing Building for Science Complex |
|
朱敏杰 (Privacy Group) |
题目:PAKDD2019参会报告 摘要: 本次参会报告主要内容有以下六个方面: (1)Overview of PAKDD 2019 (2)AutoML Challenge - Data Contest (3)Data Driven AI Thinking (4)The good, the bad and the ugly of AI (5)Privacy Protect for Knowledge Discovery and Data Mining (6)Summarize of PAKDD Papers——Data,Application Scenarios, Question and Technology |
叶青青 (Privacy Group) |
题目:SP2019参会预演 |
2019.05.10 FL1, Wing Building for Science Complex |
|
艾山 (Web Group) |
题目:Building Your DNN Models Like Playing Lego 摘要:When building deep neural network models for natural language processing tasks, engineers often spend a lot of efforts on coding details and debugging, instead of focusing on model architecture design and hyper-parameter tuning. we introduce NeuronBlocks,ML.NET,MLflow,Ludwig, deep neural network toolkit . In those platforms, a suite of neural network layers are encapsulated as building blocks, which can easily be used to build complicated deep neural network models by simple configuring. NeuronBlocks empowers engineers to build and train various NLP models in seconds even without a single line of code. A series of experiments on real NLP datasets such as GLUE and WikiQA have been conducted, which demonstrates the effectiveness of NeuronBlocks. |
汤庆 (Cloud Group) |
题目:A Demonstration of the OtterTune 副标题:Automatic Database Management System Tuning Through Large-scale Machine Learning 摘要:Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration “knobs” that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized, not independent, and not universal. Worse, information about the effects of the knobs typically comes only from (expensive) experience.To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert. |
2019.04.26 FL1, Wing Building for Science Complex |
|
王硕 (Web Group) |
摘要:(1)会议情况介绍;(2)知识图谱相关;(3)推荐系统相关。 |
段志强 (Cloud Group) |
摘要:本次报告对ICDE2019中的BDSM、query、learn from data 三个主题进行报告。随着数据体量的不断增加,大数据管理系统是如何设计的, 它包含哪些困难和挑战,而数据存储和管理之后我们如何对数据进行快速有效的查询,怎样的查询才能满足我们的需求同时,我们能从如此大体量的数据中获得什么。本次报告将介绍ICDE中关于大数据的管理,大数据的查询以及如何从数据中寻找出它蕴含的深层内容。 |
杜永杰 (Cloud Group) |
摘要:随着科学技术的发展以及传感器的普及,时空时间序列数据量呈爆炸式增长。近年来,时空数据已成为数据挖掘领域的研究热点,在国内外赢得了广泛关注。时空数据挖掘也在许多领域得到应用,如交通管理、犯罪分析、疾病监控、环境监测、公共卫生与医疗健康等。时空数据挖掘作为一个新兴的研究领域,正致力于开发和应用新兴的计算技术来分析海量、高维的时空数据,发掘时空数据中有价值的信息。但与传统数据挖掘相比,时空数据挖掘研究还远未成熟。随着时空数据采集效率的不断提高,时空数据积累越来越大,时空数据挖掘也面临诸多挑战。International Conference on Data Engineering (ICDE 2019)系列会议着眼于时空数据集的管理和分析问题,旨在探讨现实挑战、实际问题以及具体的解决方案。本报告将对ICDE2019会议的主要内容加以总结,主要从时空数据的索引以及数据挖掘入手和大家分享本次参会感悟。 |
2019.04.20 FL1, Wing Building for Science Complex |
|
杨晨 (Cloud Group) |
题目:A Frequency Scaling Based Performance Indicator Framework for Big Data Systems(DASFAA预) 摘要:It is important for big data systems to identify their performance bottleneck. However, the popular indicators such as resource utilizations, are often misleading and incomparable with each other. In this paper, a novel indicator framework which can directly compare the impact of different indicators with each other is proposed to identify and analyze the performance bottleneck efficiently. A methodology which can construct the indicator from the performance change with the CPU frequency scaling is described. Spark is used as an example of a big data system and two typical SQL benchmarks are used as the workloads to evaluate the proposed method. Experimental results show that the proposed method is accurate compared with the resource utilization method and easy to implement compared with the white-box method. Meanwhile, the analysis with our indicators leads to some interesting findings andvaluable performance optimization suggestions for big data systems. |
段志强 (Cloud Group) |
题目:Presto: A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. 摘要:Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. Presto is designed to be adaptive, flexible, and extensible. It supports a wide variety of use cases with diverse characteristics. These range from user-facing reporting applications with subsecond latency requirements to multi-hour ETL jobs that aggregate or join terabytes of data. Presto’s Connector API allows plugins to provide a high performance I/O interface to dozens of data sources, including Hadoop data warehouses, RDBMSs, NoSQL systems, and stream processing systems. |
2019.04.12 FL1, Wing Building for Science Complex |
|
刘俊旭 (Privacy Group) |
题目:SysML&XLDB(2019)参会报告 摘要:机器学习技术已在多个领域得到广泛研究与应用。然而在实际部署中,机器学习系统在设计和实现上仍面临诸多障碍。SysML是一个关注机器学习与计算机系统交叉领域的新的学术会议,重点探讨支持机器学习的硬件系统、软件系统以及针对易用、时延、公平等(除预测精度外的)指标的系统优化问题。SysML会议的项目委员会成员由系统与ML领域的专家组成,延续了去年首届会议的强大阵容,本届会议的演讲者既有来自Stanford、CMU、Cambridge等世界名校的学术团队,也有来自Google, Microsoft, IBM等互联网企业的研究机构。 Extremely Large Databases(XLDB)系列会议着眼于超大数据集的管理和分析问题,旨在探讨现实挑战、实际问题以及具体的解决方案。与传统的通过论文投稿的学术会议不同,XLDB主要由来自工业界的研究者带来目前最新的应用前沿进展报告,参会者往往涉及工界、学界和政界的开发人员、研究人员和供应商等。随着ML的迅猛发展,XLDB的关注点也逐渐转变为超大规模数据管理与机器学习技术的交叉领域研究。 本报告将对SysML2019和XLDB2019两个会议的主要内容加以总结,并向大家传达五个前沿主题,分别为:1. ML as a workload;2. Push AI to the edge/device;3. Easy-to-use ML;4. Software 2.0;5. Fairness & Security of ML。报告最后会向大家分享在本次参会过程中的感悟与学习到的一些听会Tips。 |
2019.03.29 FL1, Wing Building for Science Complex |
|
马超红 (Cloud Group) |
题目:SysML和XLDB参会计划 摘要:汇报SysML和XLDB的参会计划,主要介绍会议主题所涉及到的各方面内容,同时听取大家的需求。 |
王硕 (Web Group) |
题目: Knowledge Representation for Emotion Intelligence(ICDE2019 PhD Symposium预) 摘要:Emotion intelligence (EI) is a traditional topic for psychology, sociology, biology and medical science. Because emotion is related with the personality, interpersonal effect, social function, disease treatment, etc. Analyzing the emotion from the Web data by computer technology becomes more and more popular, and the scientists from the non-computer domains need more helpful computing models to deal with professional problems that are not traditional for computer science. Knowledge representation is a basic and possible solution as a bridge between emotion intelligence and artificial intelligence. For the sentiment words, word embedding can map the words to vectors that represent the semantic context of the words. Sentiment embedding based on the word embedding can capture both semantics and the emotion information. We have introduced two kinds of improving embedding methods (MEC and Emo2Vec) for the sentiment words embedding. For emotion structure based on the psychology of emotion, knowledge graph can represent the cognitive relations between different emotion types. The same emotional expressions can affect the reaction and behaviors of the recipient in different ways due to factors such as social relations, information processing, time pressure, etc. Knowledge graph can represent these complicated situations as the relations between the entities and attributes. Based on this graph, we make the inference or prediction of the emotion influence on decision making. |
张祎 (Web Group) |
题目:EMT: A Tail-Oriented Method for Specific Domain Knowledge Graph Completion(PAKDD预) 摘要:The basic unit of knowledge graph is triplet, including head entity, relation and tail entity. Centering on knowledge graph, knowledge graph completion has attracted more and more attention and made great progress. However, these models are all verified by open domain data sets. When applied in specific domain case, they will be challenged by practical data distributions. For example, due to poor presentation of tail entities caused by their relation-oriented feature, they can not deal with the completion of enzyme knowledge graph. Inspired by question answering and rectilinear propagation of lights, this paper puts forward a tail-oriented method - Embedding for Multi-Tails knowledge graph (EMT). Specifically, it first represents head and relation in question space; then, finishes projection to answer one by tail-related matrix; finally, gets tail entity via translating operation in answer space. To overcome time-space complexity of EMT, this paper includes two improved models: EMTv and EMTs. Taking some optimal translation and composition models as baselines, link prediction and triplets classification on an enzyme knowledge graph sample and Kinship proved our performance improvements, especially in tails prediction. |
2019.03.22 FL1, Wing Building for Science Complex |
|
艾山 (Web Group) |
题目:Experts for Review: Intelligent Recommendation Methods 摘要:The rapid development of the internet has led to the accumulation of massive amounts of data,and thus we find ourselves entering the age of big data. Obtaining useful information from these big data is a crucial issue. The aim of this article is to solve the problem of recommending experts to provide peer reviews for universities and other scientific research institutions.Our proposed recommendation method has two stages. An information filtering method is first offered to identify proper experts as a candidate set. Then, an aggregation model with various constraints is suggested to recommend appropriate experts for each applicant. The proposed method has been implemented in an online research community, and the results exhibit that the proposed method is more effective than existing ones. |
杨鑫 (Privacy Group) |
题目:区块链在机器学习中的应用 摘要:大数据环境下,不同的用户需求对机器学习任务提出了新的挑战。而区块链作为一种在不可信的竞争环境中低成本建立信任的新型计算范式和协作模式,正在改变诸多行业的应用场景和运行规则。本次报告针对两种机器学习场景,从机器学习任务外包和分布式机器学习两个方面,运用区块链不可篡改、防止抵赖的特性,达到获得最好机器学习模型的同时,更好地保护用户的隐私不被泄露。最后总结一下对未来工作的思考。 |
2019.03.15 FL1, Wing Building for Science Complex |
|
吴新乐 (Web Group) |
题目:低资源场景下的实体识别和关系抽取任务实现 摘要:随着深度学习技术的蓬勃发展,有监督条件下的实体识别和关系抽取技术取得巨大进展,然而在实际运用中,手工标注有监督数据集是一项费事费力并且容易出错的工作,所以探索弱监督条件下的实体识别和关系抽取实现方案是很有必要的。远程监督作为一种成功的弱监督方法,在实体识别和关系抽取领域都有很好的表现,然而当目标实体类型不存在于通用知识库中时,远程监督方法就无法使用了。所以我们需要探索一种对资源依赖性极低的实体识别和关系抽取方法,比如当只有几百个关系实例的时候,怎样构建实体识别和关系抽取模型。我的报告主要介绍目前最先进的在半监督和无监督场景下做实体识别和关系抽取的方法。 |
唐子立 (Cloud Group) |
题目:Learned Cardinalities: Estimating Correlated Joins with Deep Learning 摘要:This report describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. The evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization. |
2019.03.09 FL1, Wing Building for Science Complex |
|
刘立新 (Privacy Group) |
题目:区块链在数据交易中的应用 摘要:目前,数据交易是数据流通的一种重要方式,能促进数据价值实现,已经出现很多数据交易市场,如Dataexchange、Datacoup。但是现有的数据交易市场存在数据提供者失去数据控制权、缺乏公平、问责困难等问题。基于区块链的去中心、不可篡改特性,建立去中心的数据交易平台为解决上述问题带来新思路。本次报告首先分析了区块链在数据交易中的应用优势以及其带来的新挑战。接着,详细讲述目前基于区块链实现交易公平的三类协议。最后指出基于区块链实现公平性在云服务等其他场景中应用。 |
2019.03.01 FL1, Wing Building for Science Complex |
|
王硕 (Web Group) |
题目:Knowledge Representation for Emotion Intelligence 摘要:Emotion intelligence (EI) is a traditional topic for psychology, sociology, biology and medical science. Because emotion has relations with the personality, interpersonal effect, social function, disease treatment, etc. Analyzing the emotion from the Web data by computer technology becomes more and more popular, and the scientists from the non-computer domains need more helpful computing models to deal with professional problems that are not the traditional for computer science. Knowledge representation is a basic and possible solution as a bridge between emotion intelligence and artificial intelligence. Sentiment embedding based on the word embedding can capture both semantics and the emotion information. We have introduced MEC method for the sentiment words embedding. For emotion structure based on the psychology of emotion, knowledge graph can represent the cognitive relations between different emotion types. The same emotional expressions can affect the reaction and behaviors of the recipient in different ways due to factors such as social relations, information processing, time pressure, etc. Knowledge graph can represent these complicated situations as the relations between the entities and attributes. Based on this graph, we can make the inference or prediction of the emotion influence on decision making. |
杜永杰 (Cloud Group) |
题目:SageDB: A Learned Database System 摘要:Modern data processing systems are designed to be general purpose, in that they can handle a wide variety of different schemas, data types, and data distributions, and aim to provide efficient access to that data via the use of optimizers and cost models. This general purpose nature results in systems that do not take advantage of the characteristics of the particular application and data of the user. With SageDB we present a vision towards a new type of a data processing system, one which highly specializes to an application through code synthesis and machine learning. By modeling the data distribution, workload, and hardware, SageDB learns the structure of the data and optimal access methods and query plans. These learned models are deeply embedded, through code synthesis, in essentially every component of the database. As such, SageDB presents radical departure from the way database systems are currently developed, raising a host of new problems in databases, machine learning and programming systems. |
2019.02.22 FL1, Wing Building for Science Complex |
|
第一次组会由孟老师进行新学期展望,并由各组汇报本学期计划安排。 |

Maintained by WAMDM Administrator() | Copyright © 2007-2017 WAMDM, All rights reserved |