2023年实验室例会 

W425:

2024年1月4日 会议地点:理工配楼101会议室

425-1:蒋希文

(cloud group)

报告题目:OOD在图神经网络中的研究

报告摘要:越来越多的证据表明,神经网络对分布变化非常敏感,因此,关于分布外泛化(OOD)的研究成为焦点。尽管如此,目前的研究主要集中在欧几里得数据上,而对于图结构数据,由于存在两方面的基本挑战,仍未得到充分探索: 1) 图中节点之间的相互连接,即使在相同环境下也会导致数据点的非IID生成;2) 图的结构信息也是预测的信息来源。本次报告将介绍两篇来自ICLR 2022的文章,并试图提取出共同的OOD解决范式。

知识概念:OOD; Distribution Shift; GNN

参考文献:

[1]Wu, Qitian, et al. "Handling Distribution Shifts on Graphs: An Invariance Perspective." International Conference on Learning Representations. 2022.

[2]Wu, Yingxin, et al. "Discovering Invariant Rationales for Graph Neural Networks." International Conference on Learning Representations. 2022.

425-2:李维

(cloud group)

报告题目:Timely Stopping: Iterative Annotation Method for Real-time Classification of Time Series Stream

报告摘要:In recent years, early time series classification (ETSC) has made significant breakthroughs in scenario-specific applications by generating effective classifications at early stopping (ES). However, ES does not have sufficient elasticity and accuracy. In particular, when using it on time series stream (TSS), imbalanced information makes it difficult to find the right stopping point, high cost breaks real-time and conceptual drift leads to a significant decrease in accuracy. Therefore, in this paper, we introduce timely stopping (TS) for the first time, which is a new primitive under classification of TSS, which focuses more on accurate, stable, and efficient timely stopping rather than early stopping, and is able to satisfy the different demand of TSS in terms of accuracy and earliness at the same time. On this basis, this paper proposes the timely time series classification (TTSC) for TSS, which fully utilizes the known information to annotate and manage classification. In addition, this paper used 19 UCR time-series datasets for experiments, which verified that TTSC compares favorably with many state-of-the-art ETSC models in terms of accuracy and earliness. Importantly, it provides the timely stopping point to improve the elasticity and accuracy of model, which leads to stable and effective classifications.

知识概念:Timely Stopping;Early Time Series Classification;Real Time Classification;Machine Learning

参考文献:

[1] Yehuda Y, Freedman D, Radinsky K., “Self-supervised Classification of Clinical Multivariate Time Series using Time Series Dynamics,” KDD, 2023, pp. 5416-5427.

[2] Kim S H, Kim H, Yun E G, et al., “Probabilistic imputation for time-series classification with missing data,” ICML, 2023, pp. 16654-16667.

W424:

2023年12月28日 会议地点:理工配楼101会议室

424-1:郝新丽

(cloud group)

报告题目:From Chaos to Clarity: Anomaly Detection in Astronomical Observations

报告摘要:With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics of astronomical observations where each celestial object is independent but interfered with by correlated yet random concurrent noise, resulting in a high rate of false alarms. To overcome the challenges, we propose AERO, a novel two-stage framework tailored for unsupervised anomaly detection in astronomical observations. In the first stage, we employ a Transformer-based encoder-decoder architecture on each channel in alignment with the characteristic of object independence. In the second stage, we enhance the graph neural network with a window-wise graph structure learning to tackle the occurrence of concurrent noise characterized by spatial and temporal randomness. In this way, AERO is not only capable of distinguishing normal temporal patterns from potential anomalies but also effectively differentiating concurrent noise, thus decreasing the number of false alarms. We conducted extensive experiments on three synthetic datasets and three realworld datasets. The results demonstrate that AERO outperforms the compared baselines.

知识概念:concurrent noise;false positive;graph structure learning

参考文献:

[1]S. Tuli, G. Casale, and N. R. Jennings, “Tranad: Deep transformer networks for anomaly detection in multivariate time series data,” Proc.VLDB Endow., vol. 15, no. 6, p. 1201–1214, feb 2022.

[2]S. Han and S. S. Woo, “Learning sparse latent graph representations for anomaly detection in multivariate time series,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’22. New.

424-2:吴弘博

(cloud group)

报告题目:Urban Region Embedding via Multi-View Contrastive Prediction

报告摘要:深入了解城市中土地利用或人口分布等各种社会经济因素的空间分布,对于城市规划和管理具有重要意义。近年来,城市计算界越来越流行的趋势是将城市划分为多个区域,并利用各种城市感知数据来学习这些区域的潜在表示,这些数据随后可以用于不同的城市感知任务,例如土地利用聚类,房价预测和人口密度推断。

知识概念:Urban Representation Learning;Contrastive Learning;mutual information;conditional entropy

参考文献:

[1] Li, Z., Huang, W., Zhao, K., Yang, M., Gong, Y., & Chen, M. (2023). Urban Region Embedding via Multi-View Contrastive Prediction. arXiv preprint arXiv:2312.09681.

[2] Zhang, M.; Li, T.; Li, Y.; and Hui, P. 2021. Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 4431–4437.

W423:

2023年12月21日 会议地点:理工配楼101会议室

423-1:王文礼

(web group)

报告题目:To be forgotten or to be fair

报告摘要:被遗忘权的提出催生了Unlearning的发展,相关研究涌现。然而,在Unlearning研究中,公平目标几乎未被考虑,相关研究屈指可数,亦有学者认为遗忘与公平存在内生冲突。本次组会主要分享公平与遗忘不相容的原因及公平遗忘的解决方案。

知识概念:Fair Machine Unlearning;被遗忘权

参考文献:

[1]Oesterling A, Ma J, Calmon F P, et al. Fair machine unlearning: Data removal while mitigating disparities[J]. arXiv preprint arXiv:2307.14754, 2023.

[2]Zhang D, Pan S, Hoang T, et al. To be forgotten or to be fair: Unveiling fairness implications of machine unlearning methods[J]. arXiv preprint arXiv:2302.03350, 2023.

[3]Kadhe S R, Halimi A, Rawat A, et al. FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs[J]. arXiv preprint arXiv:2312.07420, 2023.

423-2:但唐朋

(cloud group)

报告题目:CAN LARGE LANGUAGE MODELS BE GOOD PATH PLANNERS?

报告摘要:Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed Path Planning from Natural Language (PPNL). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating “path planning” tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies and BART and T5 of various sizes via fine-tuning.

知识概念:路径规划;LLM;Prompt

参考文献:

[1] CAN LARGE LANGUAGE MODELS BE GOOD PATH PLANNERS? A BENCHMARK AND INVESTIGATION ON SPATIAL-TEMPORAL REASONING. Mohamed Aghzal, Erion Plaku, Ziyu Yao, In: arXiv:2310.03249.

[2] LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics. Hengjia Xiao and Peng Wang, In: arXiv:2312.01797.

W422:

2023年12月14日 会议地点:理工配楼101会议室

422-1:许婧楠

(privacy group)

报告题目:DPMLBench:Holistic Evaluation of Differentially Private Machine Learning

报告摘要:DP-SGD算法是一种应用差分隐私保护隐私的机器学习算法,它可以被应用于深度学习任务中。目前对DP-SGD的改进算法有很多,例如对裁剪大小的、加噪大小的改进等等,但是现有的改进算法是否真实有效,对可用性的影响、隐私泄漏情况的影响有多大,都是未知的。本次汇报将介绍一种benchmark方法,该benchmark方法将12种DP-SGD相关的改进算法,在4种网络上分别进行测试并对比,并得出相应结论。

知识概念:DP-SGD;DP合成数据;模型聚合 等

参考文献:

[1] DPMLBench: Holistic Evaluation of Differentially Private Machine Learning. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23). Association for Computing Machinery, New York, NY, USA, 2621–2635.

[2] Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, USA, 308–318.

422-2:李梓童

(privacy group)

报告题目:Unlearning in LLM

报告摘要:随着大语言模型的广泛使用,目前也出现了不少针对大语言模型的遗忘学习工作。本次组会将介绍三种针对大语言模型的以往工作,分别用到了梯度上升、训练adapter和在脱敏数据集上微调的方法。我们将讨论大语言模型中的遗忘和传统的遗忘学习有何不同,以及模型增大对遗忘的定义、遗忘的难度带来的影响等问题。

知识概念:unlearning, gradient ascend, adapters, finetune

参考文献:

[1]Yao Y, Xu X, Liu Y. Large Language Model Unlearning[J]. arXiv preprint arXiv:2310.10683, 2023.

[2] Chen, J., & Yang, D. (2023). Unlearn What You Want to Forget: Efficient Unlearning for LLMs. arXiv preprint arXiv:2310.20150.

[3] Eldan, R., & Russinovich, M. (2023). Who's Harry Potter? Approximate Unlearning in LLMs. arXiv preprint arXiv:2310.02238.

W421:

2023年11月30日 会议地点:理工配楼101会议室

W421-1:刘立新

(privacy group)

报告题目:Blockchain$\epsilon$: blockchain-assisted differentially private aggregation

报告摘要:Differential privacy is a promising privacy protection method. We implement differential privacy based on blockchain and cryptographic primitives. It does not need any reliable third party, and ensures the same utility as centralized differential privacy. Although it is promising, blockchain also brings new challenges. Firstly, due to the transparency of blockchain, it is hard to keep confidentiality of both data and results simultaneously. Simple encryption does not protect against collusion attacks. Secondly, we should reduce economic costs while keeping the integrity of results. Because nodes in blockchain are from the open environment and may misbehave, it is helpful to employ smart contracts to do aggregation to keep the integrity of the results. However, it may bring significant economic costs. Thirdly, how to generate noise for differential privacy and keep its confidentiality in blockchain is also a challenge. In this paper, we propose a framework for building a blockchain-assisted differentially private aggregation, where blockchain nodes can rent out their computing resource to serve aggregation with effective mechanism to tackle the above challenges. Security analysis and extensive experiments demonstrate our design's practically.

知识概念:VRF,ZPK, Smart contract

参考文献:

[1] C.Cai,Y.Zheng,Y.Du,Z.Qin,andC.Wang,“Towards private,robust, and verifiable crowdsensing systems via public blockchains,” IEEE Trans. Dependable Secur. Comput., vol. 18, no. 4, pp. 1893–1907, 2021.

[2] W. Dai, C. Dai, K. R. Choo, C. Cui, D. Zou, and H. Jin, “SDTE: A secure blockchain-based data trading ecosystem,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 725–737, 2020.

[3] A. R. Chowdhury, C. Wang, X. He, A. Machanavajjhala, and S. Jha, “Cryptε: Crypto-assisted differential privacy on untrusted servers,” in Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, D. Maier, R. Pottinger, A. Doan, W. Tan, A. Alawini, and H. Q. Ngo, Eds. ACM, 2020, pp. 603–619.

[4] D. Froelicher, J. R. Troncoso-Pastoriza, J. S. Sousa, and J. Hubaux, “Drynx: Decentralized, secure, verifiable system for statistical queries and machine learning on distributed datasets,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 3035–3050, 2020.

[5] E. Roth, D. Noble, B. H. Falk, and A. Haeberlen, “Honeycrisp: large-scale differentially private aggregation without a trusted core,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, T. Brecht and C. Williamson, Eds. ACM, 2019, pp. 196–210

[6] H. Duan, Y. Zheng, Y. Du, A. Zhou, C. Wang, and M. H. Au, “Aggregating crowd wisdom via blockchain: A private, correct, and robust realization,” in 2019 IEEE International Conference on Pervasive Computing and Communications, PerCom, Kyoto, Japan, March 11-15, 2019. IEEE, 2019, pp. 1–10.

W421-2:李晨阳

(cloud group)

报告题目:A Deep Reinforcement Learning-Based Approach to Discovering eRs

报告摘要:Editing rules specify the conditions for applying high-quality master data to repair low-quality input data. However, discovering editing rules is challenging because it is extremely difficult to consider not only carefully curated master data but also large-scale input data. Unlike traditional enumeration methods, researchers have recently proposed a reinforcement learning-based approach for editing rule discovery, which avoids the need to mine all rules with possible conditions from both master data and input data at a high cost.

知识概念:Editing rules,Reinforcement learning

参考文献:

[1] Mei Y, Song S, Fang C, et al. Discovering editing rules by deep reinforcement learning[C]//2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2023: 355-367.

[2] Fan W, Li J, Ma S, et al. Towards certain fixes with editing rules and master data[J]. The VLDB journal, 2012, 21: 213-238.

W420:

2023年11月23日 会议地点:理工配楼101会议室

W420:张旭康

(cloud group)

报告题目:Ditto: An Elastic and Adaptive Memory-Disaggregated Caching System

报告摘要:内存缓存系统是云服务的基本构件。然而,由于单片服务器上的 CPU 和内存是耦合的,现有的缓存系统无法以资源高效和灵活的方式弹性调整资源。为了实现更好的弹性,我们建议将内存缓存系统移植到分解内存(DM)架构,在这种架构中,计算资源和内存资源是分离的,可以灵活分配。然而,在 DM 上构建弹性缓存系统具有挑战性,因为使用绕过 CPU 的远程内存访问缓存对象会阻碍缓存算法的执行。此外,DM 上计算和内存资源的弹性变化会影响缓存数据的访问模式,从而影响缓存算法的命中率。本次组会介绍最新的基于分解内存架构的缓存系统设计,了解和探讨在新硬件、新架构设计下的问题解决和模型设计与传统场景下的区别。

知识概念:云服务,缓存系统,分解内存架构,弹性

参考文献:

[1] Jiacheng Shen et al. Ditto: An Elastic and Adaptive Memory-Disaggregated Caching System. SOSP 2023: 675-691

[2] Hasan Al Maruf, Mosharaf Chowdhury: Memory Disaggregation: Advances and Open Challenges. ACM SIGOPS Oper. Syst. Rev. 57(1): 29-37 (2023)

[3] Jiacheng Shen et al. FUSEE: A Fully Memory-Disaggregated Key-Value Store. FAST 2023: 81-98

W420-2:徐冰冰

(privacy group)

报告题目:SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions

报告摘要:The growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better serve humans and satisfy human preferences. Large “instruction-tuned” language models depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the model. Today we will discuss the relevant concepts of AI Alignment and a method of using LM generation instructions to align and optimize language models. Finally, I summarize and reflect on how to study alignment issues from a data perspective.

知识概念:AI Alignment,Instruction Generation

参考文献:

[1] Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., ACL 2023).

[2] Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks (Wang et al., EMNLP 2022).

[3] From Instructions to Intrinsic Human Values--A Survey of Alignment Goals for Big Models (Yao et al., arXiv 2023).

[4] Unpacking the Ethical Value Alignment in Big Models[J] (Yi et al., Journal of Computer Research and Development 2023).

W419:

2023年11月16日 会议地点:理工配楼101会议室

W419-1:彭迎涛

(web group)

报告题目:DCKR: A Diffusion Contrastive Model for Knowledge-aware Recommendation

报告摘要:The mainstream approach employed in a knowledge graph (KG) based recommendation system (RS) aggregates information from higher-order nodes within the graphs. However, this aggregation paradigm has limitations in capturing uncertain user preferences. Recently, diffusion models have excelled in computer vision (CV) because they can handle uncertainty and noise through representation generation. Inspired by this, we propose a Diffusion Contrastive model for Knowledge-aware Recommendation (DCKR), which enhances the system’s performance and alleviates the impact of noise by injecting uncertainty signals and fusing multi-preference information. Specifically, we embed user representations into Gaussian distributions by adding noise in the diffusion module, thereby achieving preference distribution generation and uncertainty injection. Subsequently, we inject the generated user distribution through the reverse process into a multiple preference awareness module. DCKR effectively models complex interactions and mitigates noiserelated issues through denoising training and iterative feedback. We also designed a diffusion contrastive learning component to refine preference representations, eliminate noise, and improve system performance.

知识概念:diffusion model, contrastive learning, multi-preference information

参考文献:

[1] Wang, W., Xu, Y., et al.: Diffusion recommender model. SIGIR pp. 832–841 (2023).

[2] Li, Z., Sun, A., Li, C.: Diffurec: A diffusion model for sequential recommendation. arXiv (2023).

[3] Wang, C., Ma, W., Chen, C., et al.: Sequential recommendation with multiple contrast signals. TOIS 41(1), 1–27 (2023).

W419-1:王雷霞

(privacy group)

报告题目:Privacy Preserving from a perspective of Mutual Information

报告摘要:除差分隐私外,互信息(Mutual-Information Privacy)也可度量隐私泄漏一种方式。在数据扰动的前提下,互信息度量真实数据与失真数据的相关度,从而约束隐私泄漏的范围。本次报告以物联网场景下时序数据共享的隐私问题为例,介绍互信息隐私,以及该场景下的隐私与效用权衡问题。最终,我们探究互信息隐私与(ε,)-差分隐私的关联,探讨从不同数据角度隐私的定义。

知识概念:Mutual-Information Privacy, Markov chain, Reinforcement Learning

参考文献:

[1] Erdemir E, Dragotti P L, Gündüz D. Privacy-aware time-series data sharing with deep reinforcement learning[J]. IEEE Transactions on Information Forensics and Security, 2020, 16: 389-401.

[2] Cuff P, Yu L. Differential privacy as a mutual information constraint[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016: 43-54.

[3] Near J P, Abuah C. Programming Differential Privacy[J]. URL: https://uvm, 2021.

W418:

2023年11月9日 会议地点:理工配楼101会议室

W418-1: 蒋希文

(cloud group)

报告题目:Pre-Training Across Different Cities for Next POI Recommendation

报告摘要:对于不同的城市,兴趣点(POI)变迁行为可能会以非常不同的方式保持绝对稀疏性和相对稀疏性。因此,在POI推荐问题中,直观的做法是跨城市转移知识,以缓解数据稀疏和不平衡问题。最近,通过大规模数据集进行预训练在计算机视觉和自然语言处理等许多相关领域取得了巨大成功。通过设计各种自监督目标,预训练模型可以为下游任务生成更强大的表征。然而,由于不同城市之间缺乏共同的语义对象(用户或项目),因此直接将现有的预训练技术用于下一个 POI 推荐并非易事。因此,本文提出一种新的预训练模型,学习在不同城市间传递类别级通用过渡知识,解决了跨城市预训练的新研究课题。

知识概念:POI推荐;预训练模型;稀疏性

参考文献:

[1] Ke Sun, Tieyun Qian, Chenliang Li, et al. 2023. Pre-Training Across Different Cities for Next POI Recommendation. ACM Trans. Web 17, 4, Article 31.

[2] Yue Cui, Hao Sun, Yan Zhao, Hongzhi Yin, and Kai Zheng. 2021. Sequential-knowledge-aware next POI recommendation: A meta-learning approach. TOIS 40, 2 (2021), 1–22.

W418-2: 李维

(cloud group)

报告题目:WaterDrift: An Iterative Annotation Method in Time-Series Streams

报告摘要:Most of the existing researches focus on prediction and anomaly detection in time-series data streams rather than real-time classification. In particular, it is difficult to obtain highly accurate classification results for conceptually drifting streams with less ground truth. We are inspired by "water drift", which is stabilized by multiple iterations, and propose an iterative annotation method in time series streams. By optimizing the time window, state management, and distribution strategy on DSPS, we have achieved the improvement of classification accuracy with high efficiency and low cost, and solved the concept drift problem essentially. In order to better utilize this approach, we also design a generalized intelligent framework, which allows users to observe the classification results in real-time and intervene to effectively reduce the training cost and improve the recognition ability of the model.

知识概念:Stream Classification, Time Series, Active Learning

参考文献:

[1] C. Fahy, S. Yang, and M. Gongora, “Classification in Dynamic Data Streams With a Scarcity of Labels,” TKDE, vol. 35, no. 4, pp. 3512–3524, Apr. 2023.

[2] J. Karimov and H.-A. Jacobsen, “SASPAR: Shared Adaptive Stream Partitioning,” ICDE, pp. 922–935, Apr. 2023.

W417:

2023年11月2日 会议地点:理工配楼101会议室

W417-1: 郝新丽

(cloud group)

报告题目:时间序列中的预训练研究

报告摘要:当前时间序列的预训练主流方法为:基于自监督的范式,专门为某一领域的时间序列数据从头构建一个预训练模型,进而完成该领域的多种下游任务。此类方法对数据量和算力的需求相对较小,模型体量也较小,但难以扩展到多领域的时间序列数据中。随着ChatGPT掀起巨大浪潮,人们开始思考:大语言模型(LLM)能否成为时间序列数据的基础模型?如何利用已构建的LLM进行时间序列分析成为新的研究热点。本次报告将总结梳理时间序列中的预训练方法,包括LLM和non-LLM两大类,并具体讨论当前研究的解决方案。

知识概念:parameter effcient fine tuning(PEFT);Model Reprogramming

参考文献:

[1] Chang C, Peng W C, Chen T F. LLM4TS: Two-stage fine-tuning for time-series forecasting with pre-trained llms[J]. arXiv preprint arXiv:2308.08469, 2023.

[2] Zhou T, Niu P, Wang X, et al. One Fits All: Power General Time Series Analysis by Pretrained LM[J]. arXiv preprint arXiv:2302.11939, 2023.

[3] Xue H, Salim F D. PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting[J]. 2022.

[4] Jin M, Wang S, Ma L, et al. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models[J]. arXiv preprint arXiv:2310.01728, 2023.

W417-2: 吴弘博

(cloud group)

报告题目:Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics

报告摘要:随着GPS设备的快速发展,城市中可以收集到大量的轨迹数据。轨迹数据的分析和管理,如基于轨迹的预测、交通预测、城市危险品管理和轨迹相似度计算,已经成为数据工程界的一个热门话题。传统的轨迹数据分析研究需要人工特征工程和针对特定任务的独特模型,这使得它们难以转移到不同的应用中去。为了提高轨迹数据分析工具的通用性,近年来出现了轨迹表征学习任务。轨迹表征学习任务旨在将原始轨迹转化为通用的低维表征向量,可以应用于各种下游任务,而不是局限于某个特定任务。

知识概念:Trajectory Representation Learning(TRL);Self-supervised

参考文献:

[1] J. Jiang, D. Pan, H. Ren, X. Jiang, C. Li and J. Wang, "Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics," 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 2023, pp. 843-855, doi: 10.1109/ICDE55515.2023.00070.

[2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Bengio, "Graph attention networks," CoRR, vol. abs/1710.10903, 2017.

W416:

2023年10月26日 会议地点:理工配楼101会议室

W416-1:王文礼

(web group)

报告题目:大模型公平问题

报告摘要:大语言模型(LLM)快速发展,公平问题逐渐凸显。与传统机器学习不同,LLM依赖语料库且模型参数体量庞大,关注的公平性问题也有所不同。本次组会结合近日学习内容,分享LLM场景下的造成不公平问题的原因、相关解决方案以及和传统公平问题的异同。

知识概念:LLM公平;社会偏见

参考文献:

[1] Gallegos I O, Rossi R A, Barrow J, et al. Bias and fairness in large language models: A survey[J]. arXiv preprint arXiv:2309.00770, 2023.

[2] Li Y, Du M, Song R, et al. A Survey on Fairness in Large Language Models[J]. arXiv preprint arXiv:2308.10149, 2023.

W416-2: 但唐朋

(cloud group)

报告题目:Self-Supervised Spatial-Temporal Bottleneck Attentive Network for Efficient Long-term Traffic Forecasting

报告摘要:Aiming at solving the long-term traffic forecasting problem and facilitating the deployment of traffic forecasting models in practice, this paper proposes an efficient and effective Self-supervised Spatial-Temporal Bottleneck Attentive Network (SSTBAN). Specifically, SSTBAN follows a multi-task framework by incorporating a self-supervised learner to produce robust latent representations for historical traffic data, so as to improve its generalization performance and robustness for forecasting. Besides, we design a spatial-temporal bottleneck attention mechanism, reducing the computational complexity meanwhile encoding global spatial-temporal dynamics. Extensive experiments on real-world long-term traffic forecasting tasks, including traffic speed forecasting and traffic flow forecasting under nine scenarios, demonstrate that SSTBAN not only achieves the overall best performance but also has good computation efficiency and data utilization efficiency.

知识概念:注意力机制;自监督学习

参考文献:

[1] Self-Supervised Spatial-Temporal Bottleneck Attentive Network for Efficient Long-term Traffic Forecasting. Shengnan Guo, Youfang Lin, Letian Gong, Chenyu Wang, Zeyu Zhou, Zekai Shen, Yiheng Huang, Huaiyu Wan, In: ICDE 2023.

[2] Set transformer: A framework for attention-based permutation-invariant neural networks. J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, In: ICML 2019.

W415:

2023年10月19日 会议地点:理工配楼101会议室

W415-1:艾山

(web group)

报告题目:Active Few-Shot Prompting for knowledge graph construction

报告摘要:随着大型语言模型的迅速发展,基于prompt的自然语言处理方法在信息抽取任务方面表现出极强的优势。传统的基于监督学习的信息抽取方法需要大量标注数据,成本较高并且需要多个模型才能完成,而大型语言模型具有强大的语言理解和生成能力,可以在不需要过多数据的情况下完成多个任务。然而,大型语言模型的性能与其所使用的Prompt(提示)密切相关,而如何设计最佳的Prompt仍然是一个难题。当大语言模型做信息抽取任务时需要选few-shot中的例子,现有的方法充分考虑样本的选择问题。因此,本文提出了一种主动学习的Prompt样本选择方法,通过计算不确定性来选择最不确定和可信的样本,并提高大型语言模型在任务上的效率。该方法的核心思想是先学习简单明确的知识,然后再应用于复杂的任务。初步实验结果在一个数据集上显示该方法优于传统方法。

知识概念:Zero -Shot,few -shot,Large Language Models(LLMs),Prompt(提示)

参考文献:

[1] S. Diao, P. Wang, Y. Lin, and T. Zhang, “Active Prompting with Chain-of-Thought for Large Language Models,” arXiv, 23-May-2023.

[2] C.-L. Liu, H. Lee, and W. Yih, “Structured Prompt Tuning,” arXiv, 24-May-2022.

[3] H. Luo, P. Liu, and S. Esping, “Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification,” arXiv, 26-Sep-2023.

[4] A. Köksal, T. Schick, and H. Schütze, “MEAL: Stable and Active Learning for Few-Shot Prompting,” arXiv, 22-May-2023.

W415-2: 李梓童

(privacy group)

报告题目:Efficient Graph Unsummarization for the Right to Be Forgotten

报告摘要:Graph data has become increasingly important in and big data era. Since analyzing such data in large volume can be timeconsuming, graph summaries have been proposed. However, the use of graph summaries raises privacy concerns when users request to delete their information from the original graph, because this deletion must be synchronized to the summary. Since re-summarizing the graph from scratch can be costly, this paper presents a novel approach to graph summarization that takes potential deletion requests into account. Inspired from machine unlearning, this paper define this problem as graph unsummarization and propose SUGPT, a graph summarization and unsummarization method based on matrix partition and trie. The essence of SUGPT is efficiently identifying similarities between vertices by embedding matrix partitions into trie structure, so as to accelerate summary updating upon deletion requests. Extensive experiments demonstrate that SUGPT offers significant speedup of summary update while retaining high utility in graph analysis.

知识概念:graph summary,supergraph,vertices embedding

参考文献:

[1]Shabani, N., Wu, J., Beheshti, A., Foo, J., Hanif, A., & Shahabikargar, M. (2023). A Survey on Graph Neural Networks for Graph Summarization. arXiv preprint arXiv:2302.06114.

[2] Kifayat Ullah Khan, Waqas Nawaz, and Young-Koo Lee. 2015. Set-Based Approximate Approach for Lossless Graph Summarization. Computing 97, 12 (dec 2015), 1185–1207. https://doi.org/10.1007/s00607-015-0454-9.

[3] Kijung Shin, Amol Ghoting, Myunghwan Kim, and Hema Raghavan. 2019. SWeG: Lossless and Lossy Summarization of Web-Scale Graphs. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1679–1690. https://doi.org/10.1145/3308558. 3313402.

[4] Jihoon Ko, Yunbum Kook, and Kijung Shin. 2020. Incremental Lossless Graph Summarization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Disco.

W414:

2023年10月12日 会议地点:理工配楼101会议室

W414-1: 许婧楠

(privacy group)

报告题目:Measuring Forgetting of Memorized Training Examples

报告摘要:机器学习模型通常会表现出两种看似矛盾的现象:训练数据的记忆和多种形式的遗忘,在模型的记忆现象中,过拟合的训练数据更容易受到隐私攻击。在遗忘现象中,训练初期的数据会被遗忘。本次报告将介绍一种度量模型遗忘程度的方法,该方法利用审计的策略对模型的遗忘现象进行度量,并给出了三个引起遗忘的原因。

知识概念:数据遗忘

参考文献:

[1] Measuring Forgetting of Memorized Training Examples. Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, Chiyuan Zhang. Proceedings of the 11th International Conference on Learning Representations.

W414-2: 李晨阳

(cloud group)

报告题目:An influence analysis method for repairing RNN

报告摘要:Deep neural networks are vulnerable to adversarial attacks. Due to their black-box nature, it is rather challenging to interpret and properly repair these incorrect behaviors. This paper focuses on interpreting and repairing the incorrect behaviors of Recurrent Neural Networks (RNNs). Compared with the existing techniques,the technique of this paper could effectively interpret and properly repair from not only an entire testing sequence but also a segment within that sequence.

知识概念:State Abstraction,Influence Analysis

参考文献:

[1] Xie X , Guo W , Ma L ,et al.RNNRepair: Automatic RNN Repair via Model-based Analysis[C]//International Conference on Machine Learning.PMLR, 2021.

[2] Satoshi Hara, Atsushi Nitanda, Takanori Maehara. Data Cleansing for Models Trained with SGD. NeurIPS 2019.

W413:

2023年9月28日 会议地点:理工配楼101会议室

W413-1: 刘立新

(privacy group)

报告题目:Verifiable Ledger Database Systems

报告摘要:A verifiable database protects the integrity of user data and query execution on untrusted database providers. The blockchain is an example of verifiable ledger databases. However, the blockchain does not support database transactions and is less efficient. I will introduce LedgerDB and GlassDB,which try to address these shortcomings. I will analyze their principles in detail and make a comparison.

知识概念:blockchain;integrity

参考文献:

[1] Xinying Yang, Yuan Zhang, Sheng Wang, Benquan Yu, Feifei Li, Yize Li, Wenyuan Yan:LedgerDB: A Centralized Ledger Database for Universal Audit and Verification. Proc. VLDB Endow. 13(12): 3138-3151 (2020).

[2] Cong Yue, Tien Tuan Anh Dinh, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, Xiaokui Xiao:GlassDB: An Efficient Verifiable Ledger Database System Through Transparency. Proc. VLDB Endow. 16(6): 1359-1371 (2023).

W413-2: 马超红

(cloud group)

报告题目:Learned Index Benefits: Machine Learning Based Index Performance Estimation

报告摘要:Index selection remains one of the most challenging problems in relational database management systems. Creating an appropriate set of indexes can boost the query performance by orders of magnitude. However, it remains challenging to find an optimum index configuration, since accurately and efficiently quantifying the benefits of each candidate index configuration is indispensable. Most index tuners rely on the cost estimations from optimizers with "what-if" API to avoid the high cost of materializing each index configuration candidate and physically executing queries. Nonetheless, "what-if“ based index benefit estimations have many limitations. This paper proposes an effective end-to-end machine learning-based index benefit estimator that does not rely on "what-if“ calls. Extensive experimental results show that the proposed method outperforms "what-if" based index benefit estimations in terms of accuracy and efficiency.

知识概念:Index configuration;Index selection

参考文献:

[1] Jiachen Shi, Gao Cong, and Xiao-Li Li. Learned Index Benefits: Machine Learning Based Index Performance Estimation. VLDB 2023, 15(13): 3950 - 3962.

[2] Ding B, Das S, Marcus R, et al. Ai meets ai: Leveraging query executions to improve index recommendations. SIGMOD 2019: 1241-1258.

W412:

2023年9月21日 会议地点:理工配楼101会议室

W410-1: 张旭康

(cloud group)

报告题目:Intra-Query Runtime Scaling on MPP SQL Engine

报告摘要:MPP架构因其具备高弹性、高并行度、可实现存算分离等优点在云原生数据库的构建中被广泛应用。但是在云上执行单个MPP OLAP查询,其并发度的确定是个难题。MPP查询并发度定义过多会导致资源浪费,过少则会导致查询无法维持最佳的执行速率。而且现有的MPP框架查询计划确定后使用的资源也随之确定,云上软硬件环境、查询workload的变化也会对正在执行查询的效率产生显著的影响。随着云计算技术的发展,Auto-Scaling技术被逐渐应用到云上应用的构建中。该技术提供了潜在的机会,使得我们可以为单个MPP查询在运行时添加新的fresh node使得单个查询的速率维持在最佳水平。为了让MPP查询能够与Auto Scaling技术结合,我们探索了“IQRS(Intra-Query Runtime Scaling)”MPP SQL Engine的构建。该引擎允许对单个MPP查询进行运行时的并发度调整优化,可使得查询在执行的过程中高效增加或者减少节点的使用,从而实现单个查询的执行速率的调整。本次组会介绍这个引擎的实现。

知识概念:Cloud-Naive Database;OLAP;MPP;Elastic;Auto-Scaling

参考文献:

[1] Yutian Sun et al.Presto: A Decade of SQL Analytics at Meta. Proc. ACM Manag. Data 1(2): 189:1-189:25 (2023).

[2] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, Thierry Cruanes:Building An Elastic Query Engine on Disaggregated Storage. NSDI 2020: 449-462.

[3] Gaurav Saxena et al:Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift. SIGMOD Conference Companion 2023: 225-237.

W410-2: 徐冰冰

(privacy group)

报告题目:Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

报告摘要:Large Language Models (LLMs) are increasingly being integrated into various applications. The functionalities of recent LLMs can be flexibly modulated via natural language prompts. This renders them susceptible to targeted adversarial prompting, e.g., Prompt Injection (PI) attacks enable attackers to override original instructions and employed controls. So far, it was assumed that the user is directly prompting the LLM. But, what if it is not the user prompting? Today we will talk about a new attack vectors, using Indirect Prompt Injection, that enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. Systematically analyze the security and privacy threats it poses.

知识概念:Indirect prompt injection;LLM-integrated application

参考文献:

[1] Greshake K, Abdelnabi S, Mishra S, et al. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models[J]. arXiv preprint arXiv:2302.12173, 2023.

[2] Pedro R, Castro D, Carreira P, et al. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?[J]. arXiv preprint arXiv:2308.01990, 2023.

W410:

2023年6月21日 会议地点:理工配楼101会议室

W410-1许婧楠(privacy group)

报告题目:PATE & Scalable PATE

报告摘要:DP-SGD是深度学习与差分隐私相结合的经典算法,但DP-SGD的训练过程比较缓慢,且加入的噪声与训练周期数成正比,模型性能受限。因此Nicolas Papernot等人提出PATE训练框架,该框架利用在不相交的私有数据子集上训练的大型“教师”模型集合,将知识转移给“学生”模型,然后在保证隐私的情况下发布。本次汇报将介绍PATE框架及其改进框架。

知识概念:semi-supervised learning;GAN;

参考文献:

[1]Papernot, Nicolas , et al. "Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data." (2016).

[2]Papernot N , Song S , Mironov I ,et al.Scalable Private Learning with PATE[J].2018.DOI:10.48550/arXiv.1802.08908.

W410-2王雷霞(privacy group)

报告题目:Robust Histogram Estimation Against Poisoning Attack to Local Randomizer

报告摘要:被广泛应用的本地化差分隐私(LDP)由于本地随机器(Local Randomizer,LR的存在,易遭受投毒攻击,从而进一步影响了数据的可用性。我们从基础的直方图估计出发,对该问题进行研究。在许多现实场景下,直方图往往体现出一定程度的平滑特征,而当前的投毒攻击往往会打破这种平滑性,呈现出破坏结果可用性与维持结果平滑性之间的trade-off。由此,我们该角度出发,提出投毒后直方图的矫正方法MDR及其优化方法MDR*,使得矫正后的直方图尽可能满足一定程度的平滑特征以减轻攻击者对结果的影响。该方法作为一种后处理方法,可以适应不同的LR算法与场景。我们将其扩展至当前广泛应用的混洗模型(shuffle model),通过噪声添加的方式解决该模型下攻击者带来的隐私损失的问题。最终,实验结果验证了该方法的有效性。

知识概念:Local Randomizer,Histogram Estimation,Poisoning Attack

参考文献:

[1] Albert Cheu, Adam Smith, and Jonathan Ullman. 2021. Manipulation attacks in local differential privacy. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 883–900.

[2] Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. 2021. Data poisoning attacks to local differential privacy protocols. In 30th {USENIX} Security Symposium ({USENIX} Security 21).

[3] Yongji Wu, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. 2022. Poisoning Attacks to Local Differential Privacy Protocols for {Key-Value} Data. In 31st USENIX Security Symposium (USENIX Security 22). 519–536.

W409:

2023年6月13日 会议地点:理工配楼101会议室

W409-1:马超红(cloud group)

报告题目:Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems

报告摘要:数据驱动系统做出决策的可靠性,取决于数据是否与系统初始假设相一致(conformance)。当服务数据( serving data )偏离训练数据(training data)的概况(profile)时,决策的推理将变得不可靠。本次分享的论文,提出一致性约束(conformance constraints),一种新的数据概况原语( data profiling primitive ),用于量化serving data相对于training data的不一致性程度,进而描述系统对某个样本的推理是否可信。

知识概念:数据概况,一致性,可信机器学习,数据漂移

参考文献:

 [1] Anna Fariha,Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, and Alexandra Meliou. Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems. SIGMOD 2021.

 [2] Papenbrock, Thorsten, et al. "Functional dependency discovery: An experimental evaluation of seven algorithms." VLDB 2015.

W409-2:徐冰冰(privacy group)

报告题目:Attacks on Prompt Learning

报告摘要:目前大语言模型的安全隐私问题备受关注,我们主要探讨集中在prompt上的安全隐私问题。本次报告总结了三类在prompt上的攻击来探索大语言模型中的安全隐私问题。选取了其中典型的一篇论文BadPrompt:Backdoor Attack on Continuous Prompt作为本次汇报的主要内容,其核心就是在连续提示上进行后门攻击来探索连续提示的脆弱性,以期引起人们对prompt安全隐私问题的关注。

知识概念:  Backdoor Attack,Prompt Learning

参考文献:

 [1]Cai X, Xu H, Xu S, et al. Badprompt: Backdoor attacks on continuous prompts[J]. Advances in Neural Information Processing Systems, 2022, 35: 37068-37080.

 [2]Li Y, Jiang Y, Li Z, et al. Backdoor learning: A survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022.

 [3]Liu X, Zheng Y, Du Z, et al. GPT understands, too[J]. arXiv preprint arXiv:2103.10385, 2021.

W408:

2023年6月6日 会议地点:理工配楼101会议室

W408-1:刘俊旭(privacy group)

报告题目:A Generalized Sampling Approach for Personalized Differential Privacys

报告摘要:实现个性化的隐私保护已日渐成为保障个人数据隐私安全的一项基本要求。本次报告介绍一种有效的个性化差分隐私解决方法,该方法能够灵活应用在三种具体且常见的隐私保护场景——隐私统计分析、隐私梯度下降和隐私联邦学习,从而实现通用、高效、精确的个性化的隐私分析。实验结果证明了方法的有效性。

知识概念:Personalized Differential Privacy; Renyi Differential Privacy; Federated Learning

参考文献:

 [1] Y. Zhu and Y.-X. Wang. Poission subsampled rényi differential privacy. In ICML,2019

 [2] Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In ICDE, 2015

 [3] K. Liu, S. Hu, S. Z. Wu, and V. Smith. On privacy and personalization in cross-silo federated learning. In NeurIPS, 2022

W408-2:刘立新(privacy group)

报告题目:联盟链身份隐私保护

报告摘要:为实现用户的准入控制并符合交易监管要求,联盟链关注强监管环境下用户身份识别和认证。目前,采用部署PKI来管理身份。这类方案的实施依赖于大量的证书管理,且不能支持面向组织的(organization-friendly)应用场景。本次汇报首先介绍联盟链身份隐身保护技术。基于此总结,提出一种新的隐私保护方法解决上述问题。

知识概念: PKI, ID-based signature, Authentication

参考文献:

 [1] Su, Qianqian, Zhang, Rui, Xue, Rui, Sun, You, Miller, J, Stroulia, E, Lee, K, Zhang, LJ. An Efficient Traceable and Anonymous Authentication Scheme for Permissioned Blockchain.ICWS,2019

 [2] Zheng H , Wu Q , Xie J , et al. An organization-friendly blockchain system. Computers & Security,2020

 [3] Wan Z , Liu W , Cui H . HIBEChain: A Hierarchical Identity-based Blockchain System for Large-Scale IoT. TDSC,2023.


W407:

2023年5月30日 会议地点:理工配楼101会议室

W407-1:但唐朋(cloud group)

报告题目:基于时空自监督学习的交通流量预测

报告摘要:对不同时间段全市交通流量的准确预测在智能交通系统中发挥着至关重要的作用。尽管先前的工作在建模时空相关性方面付出了巨大的努力,但现有的方法仍然存在两个关键的局限性:1)大多数模型在没有考虑空间异质性的情况下集体预测所有区域的流量,即不同区域的交通流量分布可能存在偏差。2)这些模型未能捕捉到由时变交通模式引起的时间异质性,因为它们通常对所有时间段的共享参数化空间的时间相关性进行建模。为了应对这些挑战,本文提出了一种新的时空自监督学习(ST-SSL)流量预测框架,该框架通过辅助自监督学习范式增强了流量嵌入表示,使其能够反映空间和时间的异质性。具体来说,ST-SSL是在一个集成模块上构建的,该模块具有时间和空间卷积,用于跨空间和时间对信息进行编码。由于时空异质性广泛存在于实际数据集中,因此所提出的框架也可以为其他时空流量预测任务提供借鉴。

知识概念: 图神经网络、自监督学习

参考文献:

   [1] Ji, J., Wang, J., et al. Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction.  AAAI 2023

   [2] Ji, J.; Wang, J.; Jiang, Z.; Jiang, J.; and Zhang, H. STDEN: Towards physics-guided neural networks for traffic flow prediction. In AAAI 2022, volume 36, 4048–4056.

W407-2:刘俊旭(privacy group)

报告题目:A Generalized Sampling Approach for Personalized Differential Privacys

报告摘要:实现个性化的隐私保护已日渐成为保障个人数据隐私安全的一项基本要求。本次报告介绍一种有效的个性化差分隐私解决方法,该方法能够灵活应用在三种具体且常见的隐私保护场景——隐私统计分析、隐私梯度下降和隐私联邦学习,从而实现通用、高效、精确的个性化的隐私分析。实验结果证明了方法的有效性。

知识概念:Personalized Differential Privacy; Renyi Differential Privacy; Federated Learning

参考文献:

 [1] Y. Zhu and Y.-X. Wang. Poission subsampled rényi differential privacy. In ICML,2019

 [2] Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In ICDE, 2015

 [3] K. Liu, S. Hu, S. Z. Wu, and V. Smith. On privacy and personalization in cross-silo federated learning. In NeurIPS, 2022·

W406:

2023年5月23日 会议地点:理工配楼101会议室

W406-1:李梓童(privacy group)

报告题目:Unlearning toxic information in language model.(406-1)

报告摘要:语言模型受训练数据中有毒信息(如包含暴力、歧视等内容的信息)的影响,可能会产生不恰当的行为。本次报告介绍两种祛除语言模型中有毒信息的方法,一是reinforcement unlearning,在已有预训练模型的前提下,通过调整强化学习的框架,使模型减少不良输出;二是unlikeliness training,即在模型训练过程中,通过调整损失函数来惩罚模型的不良输出(如包含重复或不连贯的语言等)。

知识概念:Sensitive attribute selection; Stable sensitive variables;Unlearning toxicity ; Reinforcement unlearning ; Unlikeliness training

参考文献:

[1] Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., ... & Choi, Y. (2022). Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35, 27591-27609.

[2] Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715–4728, Online. Association for Computational Linguistics.

W405:

2023年5月16日 会议地点:理工配楼101会议室

W405-1:王文礼(web group)

报告题目:What should we do if the sensitive attribute is unknown?

报告摘要:机器学习领域持续关注公平问题,涌现系列高水平研究。大部分相关研究关注“如何消除受保护 (敏感)属性可能造成的不公平”,希望通过降低敏感属性在任务中的作用占比,以达到增强公平性的目标。尽管此类研究具有相对成熟的研究机制,但普遍存在如下问题:现有研究往往假设敏感属性已知,且实际中常被处理为离散、二值的,敏感属性选择的“主观性”常被忽略。本次组会针对该问题,结合近期的学习和思考,探讨进行敏感属性发现研究的可能性。

知识概念:Sensitive attribute selection; Stable sensitive variables;

参考文献:

   [1] Le Quy T, Roy A, Iosifidis V, et al. A survey on datasets for fairness‐aware machine learning[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, 12(3): e1452.

   [2] Grari V, Lamprier S, Detyniecki M. Fairness without the sensitive attribute via causal variational autoencoder[J]. arXiv preprint arXiv:2109.04999, 2021.

   [3] Yu H, Cui P, He Y, et al. Stable Learning via Sparse Variable Independence[J]. arXiv preprint arXiv:2212.00992, 2022.

W405-2:郝新丽(cloud group)

报告题目:Graph-Guided Network for Irregularly Sampled Multivariate Time Series

报告摘要:在医疗保健、气候科学等领域,时间序列的采样是不规则的,具体包括两方面:同一传感器的采样间隔不规则、不同传感器的采样非同步。本次报告关注不规则时间序列的建模,介绍一种基于图神经网络的模型:RainDrop(雨滴)。其抛弃了常用的补全思路,类比雨滴在水面产生涟漪的过程,通过信息传播过程来建模多种传感器之间在非对齐时间上的影响,可以更好的建模不规则采样的多元时间序列。

知识概念: 不规则的时间序列;异步;信息传播

参考文献:

 [1]Zhang X, Zeman M, Tsiligkaridis T, et al. Graph-guided network for irregularly sampled multivariate time series[C]. ICLR 2022.

 [2]Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports, 8(1):1–12, 2018.

W404:

2023年5月9日 会议地点:理工配楼101会议室

W404-1:彭迎涛(web group)

报告题目:KFMC: A Knowledge-aware Recommendation with Fine-grained Multi-intents Contrastive Learning

报告摘要:知识图谱(KG)以其丰富的语义信息在推荐系统中得到广泛应用。最近的技术趋势是开发基于自监督学习(SSL)的图神经网络来推荐系统的性能。然而,现有的基于知识图谱的模型在关系建模中是粗粒度的,无法捕获细粒度级别的意图。由此,我们提出了一种具有细粒度多意图对比学习(KFMC)的知识图谱推荐来缓解上述问题。我们方法的主要思想是探索交互级别的意图和序列级别的意图,并通过多意图对比学习增强数据表征,以改进知识图谱的推荐。具体来说,我们设计了一个细粒度多意图的信息感知模块,以准确地从KG中学习交互意图和序列意图特征。此外,我们设计了多意图对比学习模块,利用自监督学习来学习用户交互、项目交户和序列内的相关性特征。

知识概念:Fine-grained intent、Multi-intention Contrastive Learning、KG-based RS

参考文献:

   [1] Yang Y, Huang C, Xia L, et al. Knowledge graph contrastive learning for recommendation[C]. SIGIR 2022: 1434-1443.

   [2] Wang X, Huang T, Wang D, et al. Learning intents behind interactions with knowledge graph for recommendation[C]. WWW 2021: 878-887.

W404-2:艾山(web group)

报告题目:Generative Knowledge Graph Construction.

报告摘要:随着大型语言模型的迅速发展,基于prompt的自然语言处理方法在信息抽取任务方面表现出极强的优势。传统的基于监督学习的信息抽取方法需要大量标注数据,成本较高并且需要多个模型才能完成,而大型语言模型具有强大的语言理解和生成能力,可以在不需要过多数据的情况下完成d多个任务。本报告介绍了一种基于大型语言模型的知识图谱构建方法:生成式知识图谱构建,生成式知识图谱构建(KGC)是指利用seq2seq架构建知识图谱的方法,这种方法灵活且可适应广泛的任务(实体识别,关系抽取,事件抽取等)。其核心思想是通过prompt-tuning(instruction)技术使语言模型同时完成多种类型的任务。虽然该方法出色,仍然存在一些不足,最后我们探讨该方法存在的问题,并提出未来可能解决的问题。


知识概念: Large Language Model(LLM)、Prompt(提示)、Prompt-tuning(提示微调)、Seq2seq(序列到序列,AICG)

参考文献:

 [1]Z. Kan, L. Feng, Z. Yin, L. Qiao, X. Qiu, and D. Li, “A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks,” arXiv, 23-Sep-2022.

 [2]Zhen Bi, Jing Chen, Yinuo Jiang, Feiyu Xiong, Wei Guo, Huajun Chen, and Ningyu Zhang. 2023. CodeKGC: Code Language Model for Generative Knowledge Graph Construction. Retrieved May 3, 2023.

 [3]H. Ye, N. Zhang, H. Chen, and H. Chen, “Generative Knowledge Graph Construction: A Review”, arXiv, 01-Dec-2022.

W403

2022年4月25日 会议地点:理工配楼101会议室

W403-1:张旭康(cloud group)

报告题目:Self-Tuning Query Scheduling for Analytical Workloads

报告摘要:大多数数据库系统将调度决策委托给操作系统。虽然这种方法简化了整个数据库的设计,但也带来了一些问题。面对并发查询,自适应资源分配变得困难。此外,结合领域知识来改进查询调度是困难的。为了缓解这些问题,许多现代系统采用了基于任务的并行形式。单个查询的执行被分解为独立的小块工作(任务)。现在,基于这些任务的细粒度调度决策是数据库系统的责任。本次组会介绍数据库系统中基于任务的调度如何为优化开辟新的领域。文章提出了一种新的无锁、自调优的步幅调度器,它可以优化分析工作负载的查询延迟。通过自适应地管理查询优先级和任务粒度,文章提供了很高的调度弹性。通过将领域知识纳入调度决策,文章的系统能够应对其他系统难以应对的工作负载。即使在高负载下也可以为短时间运行的查询保留接近最佳的延迟。与传统的数据库系统相比,文章的设计经常将尾部延迟提高10倍以上。

知识概念:Self-Tuning,Query Scheduling,OLAP

参考文献:

[1] Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann:Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. SIGMOD Conference 2014: 743-754

[2] Benjamin Wagner, André Kohn, Thomas Neumann:Self-Tuning Query Scheduling for Analytical Workloads. SIGMOD Conference 2021: 1879-1891

W403-2:李晨阳(cloud group)

报告题目:Data Debugging of Machine Learning Models.

报告摘要:即便现阶段生产活动中的机器学习技术已经取得了前所未有的准确性,但对于提高机器学习模型性能的研究仍在持续。大量研究都致力于训练方法和推理方法的探究,但实际上监测输入机器学习模型的数据质量是同等重要的,这意味着以数据为中心的机器学习方法应运而生。本次报告将简单介绍一种特定类型模型和一种通用模型的调试框架。

知识概念:Ordinary Least Square regression、Schema validation、Training-Serving skew

参考文献:

[1]Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data validation for machine learning. In Conference on Systems and Machine Learning (SysML).

[2] Gabriel Cadamuro, Ran Gilad-Bachrach, and Xiaojin Zhu. 2016. Debugging machine learning models. In ICML Workshop on Reliable Machine Learning in the Wild.

W402:

2023年4月18日 会议地点:理工配楼101会议室

W402-1:徐冰冰(privacy group)

报告题目:Ignore Previous Prompt: Attack Techniques For Language Models

报告摘要:基于Transformer的大型语言模型(LLM)为面向客户的大型应用程序中的自然语言任务提供了强大的基础。然而,这些大模型也存在着许多偏见、隐私泄漏等问题,在恶意用户交互下出现的漏洞的研究却很少。本次报告研究了目前在生产中部署较为广泛的语言模型GPT-3,通过简单的手工输入提示使其错误对齐。在攻击的角度讨论如何通过忽略原始设定的规则让模型输出攻击者想输出的任意内容(包含恶意内容),通过这些攻击成功的实验分析Prompt的潜在数据安全问题。

知识概念:prompt injection(提示注入),prompt learning(提示学习)

参考文献:

[1]Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (September 2023), 35 pages. https://doi.org/10.1145/3560815

[2]Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. OpenPrompt: An Open-source Framework for Prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 105–113, Dublin, Ireland. Association for Computational Linguistics.

W402-2:范卓娅(privacy group)

报告题目:语言模型的公平性

报告摘要:大规模预训练语言模型在文本表示与文本生成中存在公平风险,针对该问题,已有众多技术从文本表示与生成的角度研究如何去除自然语言中存在的偏见。本次组会首先介绍文本表示中基于投影的2种去偏方法,然后介绍文本生成中的公平指标与1种去偏方法。

知识概念:局部偏见(Local bias)、全局偏见(Global bias)、词嵌入(word embedding)、零空间(null space)

参考文献:

[1]Bolukbasi T, Chang K W, Zou J Y, et al. Man is to computer programmer as woman is to homemaker? debiasing word embeddings[J]. Advances in neural information processing systems, 2016, 29.

[2]Ravfogel S, Elazar Y, Gonen H, et al. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7237-7256.

[3]Liang P P, Wu C, Morency L P, et al. Towards understanding and mitigating social biases in language models[C]//International Conference on Machine Learning. PMLR, 2021: 6565-6576.


W401

2023年4月11日 会议地点:理工配楼101会议室

W401-1许婧楠(privacy group)

报告题目:Proof of DP

报告摘要:DP算法的正确与否通常需要很复杂的数学证明,但算法的设计过程及其数学证明过程并不简单,并且容易出错。很多经典的DP算法虽然被多次改进,但并不是所有的改进算法都是正确的。如何让算法设计者在自己的算法设计过程中,验证自己的方法是否正确,是proof of DP这一部分工作所解决的问题。它与数学证明不同,通常是利用形式化语言结合已有的验证平台对DP算法进行验证。本次汇报将简单介绍几种proof of DP方法,以及它和隐私审计之间的区别与联系。

知识概念:proof of DP;Alignment randomness;Coupling/Lifting(probability);

参考文献:

[1]Gilles Barthe, Boris Köpf, Federico Olmedo, and Santiago Zanella-Béguelin. 2013. Probabilistic Relational Reasoning for Differential Privacy. ACM Trans. Program. Lang. Syst. 35, 3, Article 9 (November 2013), 49 pages.

[2]Danfeng Zhang and Daniel Kifer. 2017. LightDP: towards automating differential privacy proofs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL '17). Association for Computing Machinery, New York, NY, USA, 888–901.

[3]Yuxin Wang, Zeyu Ding, Guanhong Wang, Daniel Kifer, and Danfeng Zhang. 2019. Proving differential privacy with shadow execution. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 655–669.

W401-2王雷霞(privacy group)

报告题目:PriPL-tree: Private Piecewise Linear Tree for Range Query in Local Differential Privacy(401-2)

报告摘要:范围查询计数查询是数据库查询中的经典问题,在DP和LDP场景下都有着广泛的研究,涌现出了基于树结构、网格结构和机器学习的多种方法。但目前,所有的方法均假设数据是均匀分布的,存在着明显的非均匀误差;且大部分论文所提出的索引结构不考虑数据分布与查询工作负载(query workload),适用性有限。为解决上述问题,我们提出Data Distribution-aware Private Piecewise Linear Tree(PriPL-Tree),用分段线性函数拟合底层分布,并基于此构建查询树。有效的分段线性拟合可以抵御LDP的噪声,减少叶节点的数量,从而降低树的构建误差。此外,我们拟提出query workload-aware parameter optimization,优化树结构构建中用户的分配策略,以适应不同的查询工作负载。最终,我们基于该1-D PriPL-tree构建高维的网格索引,并通过互信息剪枝减少不必要的属性收集,提高可用性。本次报告,主要分享该论文的主要思想,以及目前LDP下分段线性拟合的理论与实验结果。

知识概念:range count query(范围计数查询), piecewise linear tree(分段线性树), mutual information (互信息)

参考文献:

[1] Cormode G, Kulkarni T, Srivastava D. Answering range queries under local differential privacy[J]. Proceedings of the VLDB Endowment, 2019, 12(10): 1126-1138.

[2] Yang J, Wang T, Li N, et al. Answering multi-dimensional range queries under local differential privacy[J]. Proceedings of the VLDB Endowment, 2020, 14(3): 378-390.

[3] Du L, Zhang Z, Bai S, et al. AHEAD: adaptive hierarchical decomposition for range query under local differential privacy[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 1266-1288.


W400:

2023年4月4日 会议地点:理工配楼101会议室

W400-1刘立新(privacy group)

报告题目:An efficient traceable schema for privacy-persevering blockchain systems(400-1)

报告摘要:In blockchain systems, identity privacy protection is a desirable property. However, strong identity privacy protection results in illegal crimes such as ransomware. How to balance the privacy protection and traceability has become an important issue for blockchain systems. In this paper, we propose TPE, which not only integrates privacy protection and traceability but also efficiency. We firstly design a new identity-based signature that is used to prove the user's identity. Then, we bind the signature to the user’s trading public key without any zero-knowledge proofs. Moreover, we design a binary tree-pruning method for enabling traceability. Finally, we conduct extensive analysis and experiments to verify the performance of TPE.

知识概念:identity privacy protection in blockchain systems ; identity-based signature;

参考文献:

[1] T. ElGamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” IEEE Transactions on Information Theory, vol. 31, no. 4, pp. 469–472, 1985.

[2] Chaoyue, Niu, Zhenzhe, et al. Achieving Data Truthfulness and Privacy Preservation in Data Markets[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(1):105-119.

[3] Wei S, Cjad A , Yx A , et al. AttriChain: Decentralized traceable anonymous identities in privacy-preserving permissioned blockchain. Computers & Security, 2020,99:1-17.

W400-2马超红(cloud group)

报告题目:FILM+: a Fully Updatable Learned Index for Larger-than-Memory and Concurrent Databases(400-2)

报告摘要:Learned indexes 以其性能优势,受到广泛关注。大部分的工作集中在单一存储场景下,即数据存放在内存或磁盘中。在我们的上一篇工作中探索了learned indexes 在超内存(memory and disk)场景下的应用。但该该工作主要关注append-only 的场景,并假设数据以单线程的方式被访问。本次报告将介绍FILM+,解决HTAP 场景中的问题,同时考虑并发操作。

知识概念:混合事务分析处理(HTAP),并发

参考文献:

[1]Galakatos, Alex, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. “FITing-Tree: A Data-Aware Index Structure.” SIGMOD2019.

[2]Ding, Jialin, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, et al. “ALEX: An Updatable Adaptive Learned Index.” SIGMOD2020.

[3]Wongkham, Chaichon, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. “Are Updatable Learned Indexes Ready?,”  VLDB 2022.

[4]Li, Pengfei, Yu Hua, Jingnan Jia, and Pengfei Zuo. “FINEdex: A Fine-Grained Learned Index Scheme for Scalable and Concurrent Memory Systems.”  VLDB 2021.

W399

2023年3月28日 会议地点:理工配楼101会议室

W399-1:但唐朋(cloud group)

报告题目:用于交通预测的解耦时空动态图神经网络(399-1)

报告摘要:路网上的车辆运输影响着我们大多数人的日常生活。因此,预测道路网络中的交通流量并依此进行交通疏导、路径规划是近年来的空间计算上的研究重点。交通数据通常是从部署在道路网络中的传感器中获得的。最近的工作将交通流量以节点扩散的形式建模,提出新的时空图神经网络在挖掘复杂路网环境下的时空相关性上取得了巨大的进展。然而,现实生活中,交通流量应该至少包含两类不同种的隐藏时间序列信号,即扩散(外在)信号和内在信号。且现有工作通常忽视了路网中的内在信号,为了解决这个问题,本文提出了解耦时空动态图神经网络来进行交通流量预测。我这次报告将围绕文中所提的新网络,分析学习作者的设计思路。

知识概念:图神经网络;交通流量预测

参考文献:

[1] Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting. Zezhi Shao, Zhao Zhang, Wei Wei, Fei Wang, Yongjun Xu, Xin Cao, Christian S. Jensen, In: PVLDB 2022.

[2] DSTAGNN: Dynamic Spatial-Temporal Aware Graph Neural Network for Traffic Flow Forecasting. Shiyong Lan, Yitong Ma, Weikang Huang, Wenwu Wang, Hongyu Yang, Pyang Li, in: ICML 2022.

W399-2:刘俊旭(privacy group)

报告题目:Machine Unlearning for Adversarial Training Models(399-2)

报告摘要:近年来,对抗训练模型作为一种防御对抗样本攻击、提升模型鲁棒性的有效方法,已受到了极大的关注。然而,现有的模型遗忘方法只适用于传统训练模型的遗忘问题,却不适用于对抗训练模型。由于对抗模型训练中,训练数据既用于外层优化以最小化经验损失,还用于内层优化以产生特定的扰动。这种双层优化结构使得对于待删除数据点的影响的度量更为复杂,进而使得模型遗忘问题更具挑战性。本次报告将介绍一种解决对抗训练模型遗忘问题的新方法,该方法通过近似计算数据点的影响,结合一系列近似和转换方法,有效降低了对抗训练模型遗忘的计算代价。

知识概念:Machine Unlearning; Adversarial Training Models; Approximate unlearning method

参考文献:

[1] Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In S&P, 2021: 141–159

[2] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In CVPR, 2020, pages 9304–9312.

[3] Ronak Mehta, Sourav Pal, Vikas Singh, and Sathya N. Ravi. Deep unlearning via randomized conditionally independent hessians. CVPR, 2022, pages 10422–10431

W398

2023年3月21日 会议地点:理工配楼101会议室

W398-1:郝新丽(cloud group)

报告题目:时间序列中的预训练研究(398-1)

报告摘要:预训练模型在自然语言处理与计算机视觉领域已经得到广泛研究,但在时间序列分析领域,相关问题尚未得到充分地研究。当将上述两个领域的预训练方法迁移至时间序列任务时,存在着独特的挑战问题需要解决,如缺乏有效的归纳偏置等。本次报告将总结梳理时间序列中的预训练方法,将其归类为“完形填空”和对比学习两大类,并具体讨论当前研究的解决方案。

知识概念:cloze type masking;constrastive learning;pretext task

参考文献:

[1] Shao Z, Zhang Z, Wang F, et al. Pre-training Enhanced Spatial-temporal Graph Neural Network for Multivariate Time Series Forecasting[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022: 1567-1577.

[2] Zerveas G, Jayaraman S, Patel D, et al. A transformer-based framework for multivariate time series representation learning[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 2114-2124.

[3] Ren H, Wang J, Zhao W X, et al. Rapt: Pre-training of time-aware transformer for learning robust healthcare representation[C]//Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021: 3503-3511.

[4] Yue Z, Wang Y, Duan J, et al. Ts2vec: Towards universal representation of time series[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(8): 8980-8987.

[5] Zhang X, Zhao Z, Tsiligkaridis T, et al. Self-supervised contrastive pre-training for time series via time-frequency consistency[J]. arXiv preprint arXiv:2206.08496, 2022.

W398-2:李梓童(privacy group)

报告题目:Two ways to make a chatbot(398-2)

报告摘要:近年来,聊天机器人领域发展快速。本次组会将介绍两种训练聊天机器人的方法:一是使用seq2seq模型(可将一个数据序列映射到另一个数据序列),本次组会说明如何把它应用于在聊天机器人中,生成类人响应;二是使用seq2seq与强化学习技术相结合的模型,通过设计奖励来改进聊天机器人的响应。

知识概念:Seq2Seq; Reinforcement learning; Dialog generation;

参考文献:

[1]Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision (pp. 4534-4542).

[2]Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., & Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.

W397

2023年3月14日 会议地点:理工配楼101会议室

W397-1:艾山(web group)

报告题目:ChatGPT: Jack of all trades, master of none(397-1)

报告摘要:OpenAI的聊天机器人ChatGPT在人工智能与人类交互领域被誉为突破性进展。为了验证ChatGPT在专业领域的能力,研究人员评估了ChatGPT在包括情感分析、情绪识别、自然语言推理和问答等25个不同的分析自然语言处理任务中的能力。他们发现,在大多数任务中,ChatGPT的表现比最先进的解决方案低了25%。然而,研究还表明,任务越难,ChatGPT的损失就越高。该研究还揭示了ChatGPT存在偏见,可能是OpenAI的标注人员加的prompt造成的。本报告中我们主要讨论ChatGPT对现有的NLP任务的影响,以及讨论ChatGPT的不足之处。最后讨论未来可做的方向。

知识概念:One-shot learning;Few-shot learning;

参考文献:

[1] J. Kocoń et al., “ChatGPT: Jack of all trades, master of none.” arXiv, Feb. 21, 2023. doi: 10.48550/arXiv.2302.10724.

[2] A. Borji, “A Categorical Archive of ChatGPT Failures.” arXiv, Mar. 06, 2023. doi: 10.48550/arXiv.2302.03494.

W397-2:王文礼(web group)

报告题目:因果公平:因果思想在公平性问题中的应用(397-2)

报告摘要:公平性问题愈发受到关注,研究人员围绕算法公平、数据去偏开展了广泛探索。受益于因果推断与公平问题的内生联系,因果思想开始被引入公平问题的研究,反事实公平与主体公平相继被提出。本次组会首先简要回顾因果与公平相关概念,然后阐释反事实公平与主体公平的含义,并结合具体案例说明。

知识概念:因果公平;反事实公平;主体公平;

参考文献:

[1] Mehrabi N, Morstatter F, Saxena N, et al. A survey on bias and fairness in machine learning[J]. ACM Computing Surveys (CSUR), 2021, 54(6): 1-35.

[2]Kusner M J, Loftus J, Russell C, et al. Counterfactual fairness[J]. Advances in neural information processing systems, 2017, 30.

[3]Imai K, Jiang Z. Principal fairness for human and algorithmic decision-making[J]. arXiv preprint arXiv:2005.10400, 2020.

W396:

2032年3月7日 会议地点:理工配楼101会议室

W396-1:李晨阳(cloud group)

报告题目:A Statistical Approach for Repairing GPS Data(396-1)

报告摘要:轨迹数据已被许多应用广泛使用,例如智能交通、城市规划以及自动驾驶。实现上述应用的一个基本任务是获得高质量的GPS位置。尽管在多数市区,其精度已相对较高,但在一些特殊地点,如隧道等,则仍会有产生误差的空间。为了解决这一问题,本次汇报介绍一种利用概率分布和最大似然值来考虑数据修复的统计方法及其变体。

知识概念:Minimum change principle、Dynamic programming、0-1 Knapsack

参考文献:

[1] Aoqian Zhang, Shaoxu Song, Jianmin Wang:Sequential Data Cleaning: A Statistical Approach. SIGMOD Conference 2016: 909-924

[2] Peng Zhao, Aoqian Zhang, Chenxi Zhang, Jiangfeng Li, Qinpei Zhao, Weixiong Rao:ATR: Automatic Trajectory Repairing With Movement Tendencies. IEEE Access 8: 4122-4132 (2020)

W396-2:彭迎涛(web group)

报告题目:推荐系统上的先令攻击方法(396-2)

报告摘要:推荐系统在引导用户购买方面的关键作用,使其成为攻击者关注的动机。先令攻击(又称数据投毒攻击)是一种典型攻击方式,其通过注入恶意的用户属性文件来提升或抑制目标项目。然而,传统的先令攻击基于简单的启发式方法很容易被检测,已不能满足攻击者的要求。随着深度学习技术的发展,更多实用性的先令攻击方法被研究和提出。本次报告探究了基于生成对抗学习(Generative Adversarial Network)和自监督学习(Self-Supervised Learning)的先令攻击技术在推荐系统上应用。

知识概念:Shilling Attack, Generative Adversarial Network, Cross-system Attacks, Self-Supervised Learning

参考文献:

[1] Lin C, Chen S, Zeng M, et al. Shilling Black-Box Recommender Systems by Learning to Generate Fake User Profiles[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022.

[2] Zeng M, Li K, Jiang B, et al. Practical Cross-system Shilling Attacks with Limited Access to Data[C]. AAAI, 2023.

W395:

2023 2月28日 会议地点:理工配楼101会议室

W395-1范卓娅(privacy group)

报告题目:对比学习的公平性研究(395-1)

报告摘要:机器学习模型的表现依赖于训练数据。当数据分布不平衡时,数据中的性别、种族等受保护属性与目标属性之间存在伪相关性,这使得模型容易根据这种伪相关性进行捷径学习。针对该问题,已有众多去偏方法从采样与非采样两个角度进行解决。本次组会首先介绍非采样方法中的对比学习、对抗训练、域独立训练、组分布鲁棒性优化4种方法;然后将对比学习与过采样、欠采样、重赋权3种采样方法相结合;最后对比多种单一数据增强与组合数据增强方法对去偏对比学习方法效果的影响。

知识概念:对比学习(contrastive learning)、对抗训练(adversarial training)、域独立训练(domain independent training)、组分布鲁棒性优化(group DRO)

参考文献:

[1] Hong Y, Yang E. Unbiased classification through bias-contrastive and bias-balanced learning. NIPS 2021.

[2] Wang Z, et al. Towards fairness in visual recognition: Effective strategies for bias mitigation. CVPR 2020.

[3] Sagawa S, et al. Distributionally Robust Neural Networks. ICLR 2020.

[4] Qraitem M, et al. Bias Mimicking: A Simple Sampling Approach for Bias Mitigation. arXiv 2022.


W395-2张旭康(cloud group)

报告题目:Enable GPU Accelaration As a Service for Cloud OLAP Database(395-2)

报告摘要:在云上执行OLAP查询是当前云原生数据库建设的热点。如AWS提供presto服务运行olap查询,提供云函数支持serverless功能等。这些工作在云上对运行olap查询的灵活性、弹性、实现成本效益最大化做了诸多努力。本次组会介绍一个动态查询干涉模型,该模型能够以查询为单位,在查询执行的过程中对查询进行干涉,以query pipeline为粒度迅速改变查询执行的并发度,并动态调整查询对GPU资源的使用来改善查询效率,允许查询执行过程中,根据需要开启或者关闭对GPU资源的使用。该模型的意义在于,可以进一步解放查询生成后对于资源使用的限制,方便云数据库以query为单位进行运行时的资源管理和干涉。该模型也提供了一种机会,使得用户可以在客户端对单个查询使用的资源进行动态干涉,让用户可以在查询执行过程中对查询资源有一定的选择和干涉能力。

知识概念:cloud-native,olap,dynamic query intervention

参考文献:

[1] Perron M,Fernandez RC,Dewitt D , et al. Starling: A Scalable Query Engine on Cloud Functions SIGMOD 2020.

[2] Guoliang Li, Haowen Dong, Chao Zhang.Cloud Databases: New Techniques, Challenges, and Opportunities. VLDB 2022.

[3] Viktor Leis and Maximilian Kuschewski. 2021. Towards cost-optimal query processing in the cloud. Proc. VLDB 2021.

[4] Anil Shanbhag,Samuel Madden,and Xiangyao Yu.2020.A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics.SIGMOD 2020.

Maintained by WAMDM Administrator() Copyright © 2007-2017 WAMDM, All rights reserved