研讨会(2021) 

2021.6.17  FL1, Wing Building for Science Complex

马超红

报告题目:SeriesLearned index for heterogeneous storage

报告摘要:索引结构有利于高效的数据库查询,但消耗大量的存储空间,研究表明,对于内存数据库而言,index 占到了内存空间的55%learned index的研究能够显著减少索引的空间占用。当前的learned index研究只考虑了一种存储设备,数据完全存储于内存或完全存储于外存,并没有考虑异构存储的情况,本次报告将简要介绍learned index,并探索在异构存储之上构建learned index

李梓童

报告题目:移动应用程序隐私风险分析

报告摘要:汇报由三部分组成,一是移动用户APP隐私风险分析工具与实现,主要介绍计算隐私风险APP制作的问题背景、相关工作、设计思路和方案分析;二是移动应用程序隐私风险量化与分级治理研究,主要介绍专门针对移动应用程序的风险量化模型,以及对其分级的方法,并给出分级后的数据分析结果;三是论文阅读,简单介绍本学期看的一些论文

2021.6.24  FL1, Wing Building for Science Complex

刘立新


报告题目:基于区块链的数据共享透明技术研究

报告摘要:区块链在数据共享流通过程中为数据主体提供了透明性和去中心化等需求。然而,区块链自身的透明特性也会带来身份隐私泄漏和数据隐私泄漏的可能。如何平衡透明与隐私是影响区块链实际应用的关键问题之一。本次汇报主要介绍自己的近期工作:A blockchain-based data aggregation framework。提出一种能同时保护用户身份隐私、数据机密性、结果可验证的数据聚集框架。

彭迎涛

报告题目:一种知识图谱的跨品类多样化推荐方法

报告摘要:随着物质生活的丰富,人们对物质文化生活的丰富性、多样性、新奇性等的追求越来越高。多样化推荐作为一种实现方法引起了相关学者极大的兴趣,这类方法不仅可以缓解上述问题,同时也兼顾了推荐的准确性。现有知识图谱的推荐多样化工作,往往基于已有结点和关系,采用高阶消息聚合传递机制来计算相似,这导致召回的结果大多是属性、品类等相同的同一物品或同类可替代品。本次报告简要介绍受水漂启发,引入stone skipping的一种知识图谱的跨品类多样化推荐方法的proposal


但唐鹏

报告题目:SIGSPATIAL GISCUP 2021比赛进展汇报

报告摘要:本周我们在大赛提供的数据集上进行了训练和测试,将重点汇报实际训练中碰见的难题与其对应的解决方案,展示目前模型最新的提交结果。


2021.7.1  FL1, Wing Building for Science Complex

王雷霞

报告题目:Frequency Estimation of Set-Valued data in the Shuffle Model

报告摘要:Recently the Encode-Shuffle-Analyze framework (also known as the shuffle model) was widely researched, which improves the accuracy of results O(√n) than local differential privacy via shuffling. However, existing researches of the shuffle model focus on the case that each user has one numerical or categorical value, and no work study the set-valued data case, which is more challenging, due to the variable number of values for each user.

In this paper, we systematacially study the frequency estimation over set-valued data in the shuffle model, and propose two mechanisms SSSampling & Shuffling) and PSS (Padding&Sampling&Shuffling), both of which gain the additional privacy amplification by combing sampling with shuffling and padding with shuffling respectivley. Further, we analyze the optimal sampling rate for SS mechanism to obtain the minimum mean square error, and are working on optimizing the error of PSS mechanism. SS and PSS mechanism are best suitable for different range of domain of user items.

范卓娅


报告题目:价格歧视与公平

报告摘要:首先从非技术角度与技术角度概览各领域公平的定义与研究问题,然后介绍一篇在大数据杀熟中加入公平限制的论文。禁止杀熟,社会总福利一定会增加吗?经济学家得出的结论是否定的,但这不意味着价格歧视不应该受限制。因此,论文站在政府的角度探讨在价格歧视中引入公平限制对社会总福利的影响。

2021.9.24  FL1, Wing Building for Science Complex

张旭康

报告题目:空间数据智能管理

报告摘要:随着车联网、云计算、移动互联网的飞速发展,空间数据进入了大数据时代。智慧城市、智慧交通等概念的发展使空间数据从传统的单一数据维度向多模态形式发展。空间数据数据采集精度、实时度高,规模日趋庞大,组织结构逐渐复杂,并存在数据隐私泄露风险,所以传统的空间数据处理和存储模型已不适用,由此空间数据智能管理需求应运而生。空间数据智能管理旨在使用数据智能方法,优化空间数据存储,提升数据处理效率,增加数据隐私安全系数。然而,空间数据智能发现缺少整体框架设计,具体表现为缺乏空间数据的一体化“采集-存储-分析”体系和异构数据高效知识融合机制,海量历史数据长期存储及挖掘低效。本文从数据管理的角度提出空间数据智能管理框架和相关挑战,以期推动空间数据治理与发展。


但唐朋

报告题目:动态路网中的多规则约束最优路径查询

报告摘要:最短路径查询问题一直是空间数据库研究中的热点问题,它在导航、轨迹规划、车流疏导、物流运输和路由算法等领域中得到了广泛的应用。近年来,随着智慧城市、智能交通等建设的开展,传统的单一的简单数据模型难以抽象整个现实世界。为了更加合理的度量现实世界,解决现实需求驱动的个性化查询,人们开始关注复杂限制条件下的查询问题。最佳排序路径查询(Optimal Sequenced Route)应运而生。最佳排序路径有许多查询限制,具体的我将这些查询中的限制抽象为具体的查询规则,提出一种新的最佳排序路径查询方法,以支撑动态路网环境下的OSR查询。


彭迎涛

报告题目:一种知识图谱的跨品类多样化推荐方法

报告摘要:随着人们需求的多样化,多样化的推荐受到工业界和学术界的关注。 许多现有的通过知识图谱进行多样化推荐的方法已经取得了良好的效果,这些方法通常通过在知识图谱的不同节点之间聚集和传递信号来进行再推荐。然而,大多数工作忽略了类别级别的信号,而这些信号可以发现客户潜在的多样化需求。为了解决这个问题,我们提出了一个知识图谱的模型(KSSM)来增加推荐的多样性。 本文中,KSSM开发了一个基于知识感知的LSTM,从用户点击的历史序列中捕捉用户的历史兴趣,并提出了类别级的信号捕获机制,以寻找用户与其他类别的物品之间的潜在意向。


2021.10.8  FL1, Wing Building for Science Complex

王文礼

报告题目:入学申请文书内容与家庭收入和SAT分数的关系

报告摘要:大量证据表明,在使用标准化考试(SAT)来评估美国大学本科申请者时,存在潜在的阶层偏见,但很少有研究考虑到申请者选择美国大学时被要求的申请文书。论文使用2016116万名加州大学申请者提交的24万篇入学文书,衡量申请文书的内容、家庭收入和标准化考试(SAT)分数之间的关系,通过话题建模(CTM)和语言研判和单词计数(LIWC)系统对文书进行量化分析。研究结果显示,与SAT分数相比,申请文书与家庭收入的相关性更强这种关系随着家庭收入的增加而减弱;文书内容也解释了SAT分数差异的部分原因,表明文书隐含了一些与SAT相同的信息。

李梓童

报告题目:深度神经网络对抗样例研究综述

报告摘要:深度神经网络在当下社会得到了广泛的应用,需要注意的是,神经网络在训练过程中,容易受到人为精心设计的特殊样例——即对抗样例(adversarial examples)的影响。对抗样例与普通样例往往非常相似、难以通过人类感官来直接区别,但在深度学习模型的学习训练中,它们却可能对模型最终的决策边界等产生作用,“骗过”神经网络。本次组会主要分享一篇关于对抗样例的综述,介绍其产生的原因、有哪些分类,重点介绍在计算机视觉、语音识别和自然语言处理中对抗样例的具体作用方式,末尾涉及可能的防御方法。

2021.10.15  FL1, Wing Building for Science Complex

清华大学交叉信息研究院(姚班)助理教授 张焕晨博士

报告题目:Memory-Efficient Search Trees for Database Management Systems

报告摘要: The growing cost gap between DRAM and storage together with increasing database sizes means that database management systems (DBMSs) now operate with a lower memory to storage size ratio than before. On the other hand, modern DBMSs rely on in-memory search trees (e.g., indexes and €filters) to achieve high throughput and low latency. ‘These search trees, however, consume a large portion of the total memory available to the DBMS. ‘This dissertation seeks to address the challenge of building compact yet fast in-memory search trees to allow more efficient use of memory in data processing systems. We €first present techniques to obtain maximum compression on fast read-optimized search trees. We identi€fied sources of memory waste in existing trees and designed new succinct data structures to reduce the memory to the theoretical limit. We then introduce ways to amortize the cost of modifying static data structures with bounded and modest cost in performance and space. Finally, we approach the search tree compression problem from an orthogonal direction by building a fast string compressor that can encode arbitrary input keys while preserving their order. Together, these three pieces form a practical recipe for achieving memory-efficiency in search trees and in DBMSs.

报告人简介:Huanchen Zhang is an Assistant Professor in the IIIS (Yao Class) at Tsinghua University. His research interest is in database management systems with particular interests in indexing/filtering data structures, data compression, and cloud databases. He received his Ph.D. degree from the Computer Science Department at Carnegie Mellon University. He is the recipient of the 2021 SIGMOD Jim Gray Dissertation Award.


但唐朋

报告题目:时空轨迹相似度Top-k搜索中的Why-Not问题

报告摘要:Extensive efforts have been made to improve the efficiency of the top-k trajectory similarity search(TkTSS), which retrieves k similarity trajectories for a given trajectory with a similarity function. When a user issues a initial query, s/he may find some desired trajectories are not in the result and may question why these expected trajectories are missing. To address this problem, we develop a so-called why-not spatial temporal TkTSS that is able to minimally modify the original top-k result into a result which contains the expected missing trajectories. In this paper, a novel hybrid SGP index is developed to organize the trajectories. Based on this index, an efficient time-first TkTSS framework is proposed to retrieve TkTSS. In order to refine the initial query to make all missing trajectories appear in the result, an innovative trajectory projection approach is designed to transfer the why-not question on TkTSS into a two-dimensional geometrical problem. Two type boundary areas pruned area (PA) and refined area (RA) are calculated to shrink the searching space. By constructing the compact area of RA, the searching space can be shrunk in a further step. Some pruning methods are proposed to accelerate the query process. Finally, extensive experiments with real-world and synthetic data offer evidence that the proposed solution performs much better than its competitors with respect to both effectiveness and efficiency.


马超红

报告题目:Index compression methods

报告摘要:索引结构对于数据库的性能至关重要,然而索引结构占据大量的系统资源,特别是内存资源。内存不可无限扩展,且内存的成本较高。索引占据内存资源,使得用于存储原始数据和处理现有数据的空间越来越少,影响系统性能。传统的方法主要通过压缩节点中key占用的空间来降低index size,如Prefix/suffix truncationdictionary compressionkey normalization等。近年来提出了learned indexes model 的角度压缩索引结构。本次报告主要分享learned indexes的相关工作,并比较传统索引压缩和learned indexes两种不同思路的优缺点。

王雷霞

报告题目:混洗差分隐私方法研究

报告摘要:随着数据基础设施不断构建与完善,大规模用户数据的收集与使用加剧了用户隐私泄露的风险。为保护用户隐私,中心化差分隐私技术与本地化差分隐私技术被广泛应用,前者依赖于可信第三方对数据进行处理造成了隐私性的下降,后者在用户端直接对数据进行扰动造成了可用性的明显下降。为实现数据可用性与隐私性间的平衡,研究者们提出了混洗差分隐私框架,即在本地差分隐私框架的基础上引入安全的混洗机制,基于隐私放大理论实现数据可用性的提高。但当前的混洗差分隐私方法存在着诸多的挑战问题,如方法的鲁棒性较差,仅能处理简单理想场景下的统计问题等。本次报告对当前的混洗差分隐私方法进行概括,总结挑战问题,并提出集值场景下的混洗差分隐私方法与个性化的混洗差分隐私方法。


2021.10.22  FL1, Wing Building for Science Complex

范卓娅

报告题目:Fairness in Clustering

报告摘要:目前针对分类任务的公平性研究较多,提出了如统计均等、机会均等等公平性指标。那么在聚类任务中,上述指标是否适用?本次介绍一篇研究聚类任务中公平性的论文,文章说明了有监督的分类任务中的公平性指标在一些情况下并不适用于无监督的聚类任务,并针对聚类算法k-meansLloyd启发式实现加以改进,提出了Fair k-means算法。


郝新丽

报告题目:科学发现中的机器学习方法研究

报告摘要:大规模科学装置的发展使得科学发现进入了大数据时代,也使得科学发现无法完全依赖专家经验从海量数据中捕捉并研究稀有的科学现象,借助蓬勃发展的人工智能技术(AI)促进科学事件的智能发现势在必行。本次组会在调研了机器学习在科学领域研究现状的基础上,以时域天文学为例,使用3种传统的机器学习方法和4种典型的深度学习方法进行耀发事件的发现,以期用实验数据回答如下3个问题:机器学习能否适用于传统而严谨的科学领域?如何看待传统方法与机器学习方法在科学发现中的关系?机器学习落地科学发现有哪些挑战及未来的发展方向?

2021.10.29  FL1, Wing Building for Science Complex

艾山

报告题目:AgileML中的高效训练模型和提高模型泛化能力

报告摘要:训练机器学习模型是一个比较复杂的过程,大多数情况下我们只关注模型效果,准确性等指标,很少关注在训练过程中的时间和空间效率问题和在真实场景下的泛化能力。本文主要讨论我们的agile-ml系统中如何有效解决以上问题并提出结论。先讨论我们的系统如何解决了训练模型过程当中的冗余式操作及节省时间,再对模型训练过程当中的人工调超参过程可以用模型替代,释放双手,进一步自动化的训练模型并快速达到最佳收敛状态,最终提出了超参学习模型。最后讨论在相同的模型和数据在不同的训练方式有不同结果,对于三种不同方式训练的结果在收敛速度和泛化能力上分析和讨论。

刘俊旭

报告题目:Projected Gradient Descent for Deep Learning

报告摘要:尽管深度学习的参数维度很高,但已有研究工作已经发现,梯度下降过程往往发生在一个低维子空间中,这个低维子空间可以通过求解Hessian矩阵的特征值与特征向量得到。基于这一发现,可以利用此低维子空间优化梯度下降过程,从而解决深度学习特定问题中的模型性能问题。本次报告将介绍上述基于投影的梯度下降方法在不同场景中的应用:1)利用平行投影降低差分隐私噪声,优化模型性能;2)利用垂直投影解决持续学习中的灾难性遗忘问题;3)利用垂直投影解决联邦学习中的数据异构问题,为实现个性化联邦学习提供新思路。


2021.11.13  FL1, Wing Building for Science Complex

张旭康

报告题目:The Snowflake Elastic Data Warehouse

报告摘要:We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterpriseclass systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the clouds elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semstructured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterpriseready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or “Snowflakefor short. Snowflake is a multitenant, transactional, secure, highly scalable and elastic system with full SQL support and builtin extensions for semistructured and schemaless data. The system is offered as a payasyougoservice in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data. In this paper, we describe the design of Snowflake and its novel multi-cluster, shared data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.

彭迎涛

报告题目:KFGN: A Knowledge-aware Fusion Graph Neural Network for Diversity Recommendation in Different Categories

报告摘要:近年来,随着人们多样化需求的增加,多样化推荐系统受到学术界的广泛关注。许多现有的基于知识图谱(KG)的多样化推荐方法通常通过在 KG 中的不同节点之间聚合和传输信息来进行推荐。然而,他们中的大多数都忽略了发现客户潜在多样化需求的类别级别信号。这些导致系统无法满足用户多样化的推荐需求,推荐性能受到限制。为了解决这个问题,我们提出了一个知识感知融合图神经网络(KFGN)来增加推荐的多样性。具体来说,KFGN 开发了一个基于知识感知 LSTM 的注意力模块(KLAM),以从用户的点击历史序列中捕捉用户的潜在偏好。然后,在品类级别,KFGN 提出了类别感知图神经网络模块(CGNM)来发现用户和不同类别项目之间的不同意图。 KFGN基于融合图神经网络,预测不同商品的点击概率,寻找候选商品。

2021.11.19  FL1, Wing Building for Science Complex

王文礼

报告题目:MetaCI: Meta-Learning for Causal Inference in a Heterogeneous Population

报告摘要:万物互联时代的到来,数据被赋予了更高的价值。传统的统计方法和机器学习大多基于独立同分布的数据,而实际上独立同分布的假设可能会被多种形式打破,例如施加干预和分布变化。因果模型较概率模型包含更多的信息,从因果模型得出结论的过程(因果推理)比概率推理更有力,可被用于分析干预和分布变化的影响。本次组会介绍一篇因果推理方向的论文,其提出MetaCI框架,将元学习应用于因果推理,旨在因果推理背景下回答反事实问题,以解决协变量转移的问题。

李梓童

报告题目:A Lightweight Privacy-Aware Continuous Authentication Protocol-PACA

报告摘要:As many vulnerabilities of one-time authentication systems have already been uncovered, there is a growing need and trend to adopt continuous authentication systems. Biometrics provides an excellent means for periodic verification of the authenticated users without breaking the continuity of a session. Nevertheless, as attacks to computing systems increase, biometric systems demand more user information in their operations, yielding privacy issues for users in biometric-based continuous authentication systems. this articleintroduce a novel, lightweight, privacy-aware, and secure continuous authentication protocol called PACA. PACA is initiated through a password-based key exchange (PAKE) mechanism, and it continuously authenticates users based on their biometrics in a privacyaware manner. This paper also designed an actual continuous user authentication system under the proposed protocol.

2021.11.26  FL1, Wing Building for Science Complex

刘立新

报告题目:Blockchain-assisted differentially private aggregation

报告摘要:Blockchain have emerged as a promising direction in revolutionizing exiting data-driven system relying on centralized service provider. In this paper, we try to employ the blockchain to replace the trusted service provider in centralized differential privacy (CDP). Although promising, it is non-trivial and has to overcome the following challenges. First, the integrity of aggregation. The participants from the public blockchain may misbehave, so they may aggregate incorrectly. Second, the conflict between the confidentiality and the correctness of the noise for differential privacy. If the noise is added by the smart contract for correctness, anyone can remove it without efforts. If the noise is added through the ciphertexts, we must make sure of the correctness. Third, the balance between on-chain and off-chain. The on-chain processing incurs monetary cost, so putting all computation on-chain is uneconomical. In this paper, we propose a framework for building blockchain-assisted differentially private aggregation.


艾山

报告题目:敏捷和自动知识图谱构建方法

报告摘要:随着人工智能的发展知识图谱成为很多领域比较关心的部分了,基于知识感知的研究也逐步增多,专家们从60年代开始在构建各种优秀的知识图谱,可因研究和应用需要不同需要不同的知识图谱,已有的知识图谱远远无法满足目前的需求。知识图谱构建是个复杂的过程,人工构建知识图谱需要大量的领域专家参与,耗时耗力,且难以形成规模,用深度学习模型来构建也不能单个模型完成的,因此自动化构建是当前比较迫切的需求。本文对知识图谱自动化构建方法进行研究,主要讨论在低资源下的敏捷和自动化构建知识图谱方法和技术。以知识发现,关系抽取,知识验证,表示学习等核心方法解释及以知识搜索应用体现自动化知识图谱构建的研究价值和应用价值。


马超红

报告题目:Learned index for larger-than-memory databases

报告摘要:在实际应用中,main-memory DBMSs 的性能显著优于disk-resident DBMSslarger-than-memory database解决了main-memory DBMSs 中数据需要完全驻留在内存的限制。索引结构对于高效的数据库查询至关重要,但消耗大量的内存资源,研究表明,对于纯内存数据库而言,index 占到了内存空间的55%Larger-than-memory database的场景下加剧了索引占用的内存资源。Learned index的研究能够显著减少索引的空间占用。但当前的研究只考虑了一种存储设备。本次报告将介绍在跨存储设备的larger-than-memory 场景下采用learned structures来解决索引问题。


2021.12.2  FL1, Wing Building for Science Complex

马超红

报告题目:Learned index for larger-than-memory databases

报告摘要:在实际应用中,main-memory DBMSs 的性能显著优于disk-resident DBMSslarger-than-memory database解决了main-memory DBMSs 中数据需要完全驻留在内存的限制。索引结构对于高效的数据库查询至关重要,但消耗大量的内存资源,研究表明,对于纯内存数据库而言,index 占到了内存空间的55%Larger-than-memory database的场景下加剧了索引占用的内存资源。Learned index的研究能够显著减少索引的空间占用。但当前的研究只考虑了一种存储设备。本次报告将介绍在跨存储设备的larger-than-memory 场景下采用learned structures来解决索引问题。

郝新丽

报告题目:科学发现中的机器学习方法研究

报告摘要:大规模科学装置的发展与重大科学实验的开展使得科学领域进入了数据密集型的第四研究范式,借助蓬勃发展的人工智能技术促进科学事件的智能发现势在必行。机器学习是当今人工智能技术的一个重要研究领域,已广泛应用于各个科学领域进行科学发现。本次报告将介绍机器学习在科学领域的研究现状、总结概括科学大数据与科学发现任务的三大特点,并进一步总结出机器学习应用于科学大数据的科学发现任务时所面临的五大挑战。在此基础上,提出了通用的智能科学发现框架,阐述了将机器学习应用于科学领域时一种高效的智能科学发现模式,最后以大视场短时标天体恒星耀发事件为例,使用7种机器学习方法和领域传统方法完成科学事件的发现,通过实例验证该框架的有效性。

范卓娅

报告题目:Fairness in Insurance

报告摘要:保险领域的互助共济原则和精算公平的争论由来已久,但这两种视角都忽略了外部规模经济理论。本次介绍论文 Better TogetherHow Externalities of Size Complicate  Notions of Solidarity and Actuarial Fairness,探讨外部性对互助共济原则与精算公平的影响。

2021.12.31  FL1, Wing Building for Science Complex

王雷霞

报告题目:Heterogenous Privacy with Differential Privacy Techniques

报告摘要:In the real world, privacy cannot treat as the same. Different survey questions, different answers, and different people may have different privacy demands. To meet such heterogeneous privacy needs, we have to consider two critical problems. The first one is how to define this privacy and satisfy it with various mechanisms. The second one is how to estimate the statistical results based on these different privacy level-based data. With these two objectives, we summarize several privacy-preserving techniques based on heterogeneous privacy. All of them are based on differential privacy or local differential privacy. Eventually, we try to get inspiration from these technologies to realize a utility optimal shuffle differential privacy scheme based on heterogeneous privacy.

刘俊旭

报告题目:大规模深度学习模型的差分隐私保护方法

报告摘要:对深度学习模型的差分隐私保护研究而言,当模型参数量规模十分庞大时,添加的随机噪声将严重影响模型的可用性。本次报告介绍一种先进的表示学习方法,其主要思想是在预训练模型的基础上,对于新的学习任务,不再对原神经网络参数进行调优,而是转为学习一个线性模型,从而大大降低模型训练的复杂性。这种方法具有两个好处,一方面在引入差分隐私保护时,由于模型的简化,为解决隐私与可用性的权衡问题提供一种解决途径。另一方面,该方法可进一步应用于联邦学习场景下,各参与方只需在本地训练一个线性模型,从而有效降低了计算和传输代价。


Maintained by WAMDM Administrator() Copyright © 2007-2017 WAMDM, All rights reserved