研讨会中国人民大学数据库研究组 WAMDM

Lab Meetings

2022-12-01 DataPrism: Exposing Disconnect between Data and Systems by Chaohong Ma

Abstract: As data becomes a central component of many modern systems, the cause of a system malfunction may reside in the data. Like software debugging, which aims to find bugs in the source code or runtime conditions, the goal of data debugging is to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. In this presentation, we report DataPrim, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system.

2022-12-01 Debiased contrastive learning by Zhuoya Fan

Abstract: Contrastive learning is widely used in representation learning, and its core idea is to bring similar samples (positive examples) closer and dissimilar samples (negative examples) far away. However, in unsupervised scenarios, since the labels of the samples cannot be obtained, the positive examples are usually obtained by data enhancement, and the negative examples are obtained by random sampling. There may be sampling errors in the samples in the negative examples. This group will introduce a NIPS 2020 paper to eliminate sampling bias by modifying the loss function.

Past Meetings

2023/1/12	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Lab members	【Content】： (1) Summarize and report on the learning and research content of this semester; (2) Based on the learning and research situation, conceive and write each person's "Data Flavor of WAMDM", and those who have completed writing can integrate their "Data Flavor" into the presentation at the Thursday group meeting. Please refer to the previously published format of "Data Flavor of WAMDM": the length is not too long, simple and clear, highlighting a key word, and it is appropriate to combine two or three pictures. Please pay attention to summarizing the title as "[One Keyword]" (you can take the Data Flavor illustrations from your own group meeting presentation slides of this semester, and formulate the title from the main concepts of the report).
2023/1/5	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Ai Shan	【TItle】：Span-tagger：Nested and Fine-grained Named Entity Recognition 【Abstract】: Named Entity Recognition (NER) is often regarded as a sequence labeling task, where nested entity recognition and fine-grained recognition present significant challenges. Although existing methods have partially addressed these challenges, there are still unresolved issues. Span-based methods have largely solved the problem of nested recognition, but current methods treat spans independently of context, leading to incomplete semantics. This paper proposes a span-tagger model for nested and fine-grained entity recognition based on spans. This method converts character-level sequence labeling into span-level sequence labeling, and the reduced number of categories ensures more reliable fine-grained recognition. Span-tagger effectively solves the problems of nested recognition and fine-grained entity recognition, and experiments show that span-tagger outperforms existing methods on four datasets. 【Concept】：Nested NER；Fine-grained NER 【References】： [1] Luan Y, He L, Ostendorf M, et al. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3219-3232. [2] Jeong Y, Kim E. SciDeBERTa: Learning DeBERTa for Science and Technology Documents and Fine-tuning Information Extraction Tasks[J]. IEEE Access, 2022. [3] Trivedi I, Majhi S. Span level model for the construction of scientific knowledge graph[C]//2020 5th International Conference on Computing, Communication and Security (ICCCS). IEEE, 2020: 1-6.
Junxu Liu	【TItle】: Example-level Privacy Analysis in Federated Learning 【Abstract】: The widespread application of federated learning in fields such as healthcare and finance has raised higher demands for data privacy protection. Achieving example-level privacy protection means that independent privacy analysis can be conducted on any user participating in federated learning, achieving different levels of privacy protection. At the same time, in order to ensure the usability of the final machine learning model, we should try to avoid the bias caused by non-uniform random noise on the model, so as to achieve a balance between algorithmic privacy and usability. To achieve the above goals, we propose a privacy-preserving federated learning framework based on non-uniform data sampling strategies, which achieves personalized privacy protection under the guidance of privacy amplification theory. One of the main challenges in implementing this framework is how to set the sampling probabilities for each sample. To address this, we studied the approximate relation between the sampling probabilities and the privacy cost, and used fitting methods to construct a mathematical model for the two. This method is not dependent on any specific federated learning method and can be applied to any SGD-based training framework. This report mainly shares the current research progress. 【Concept】: Personal Privacy；Uniform Privacy；individual RDP 【Reference】: [1] Feldman V, Zrnic T. Individual privacy accounting via a renyi filter[J]. Advances in Neural Information Processing Systems, 2021, 34: 28080-28091. [2] Yu D, Kamath G, Kulkarni J, et al. Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent[J]. arXiv preprint arXiv:2206.02617, 2022. [3] Rogers R M, Roth A, Ullman J, et al. Privacy odometers and filters: Pay-as-you-go composition[J]. Advances in Neural Information Processing Systems, 2016, 29. [4] Zhu Y, Wang Y X. Poission subsampled rényi differential privacy[C]//International Conference on Machine Learning. PMLR, 2019: 7634-7642. [5] Girgis A, Data D, Diggavi S, et al. Shuffled model of differential privacy in federated learning[C]//International Conference on Artificial Intelligence and Statistics. PMLR, 2021: 2521-2529.
Jingnan Xu	【TItle】：A General Framework for Auditing Differentially Private Machine Learning 【Abstract】: Existing privacy audit methods usually audit only the DPSGD method, which has significant limitations. This presentation will introduce a method from NIPS 2022 that proposes an audit framework that can audit differential privacy algorithms in various types of machine learning. It can also be used to detect privacy leaks caused by errors in the implementation of the algorithm. 【Concept】：logistic regression；Naive Bayes；Radom Forest 【References】： [1] Lu F, Munoz J, Fuchs M, et al. A General Framework for Auditing Differentially Private Machine Learning[J]. arXiv preprint arXiv:2210.08643, 2022.
2022/12/29	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Xukang Zhang	【TItle】：A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics 【Abstract】: There are now many GPU databases claiming to offer performance gains of dozens or even hundreds of times over CPU databases. However, hardware experts are skeptical of such claims, as they believe that the maximum performance gain from GPUs should be limited by the GPU-to-CPU memory bandwidth ratio. This group meeting will introduce a Sigmod 2022 paper that investigates and analyzes the true effectiveness of current GPU database acceleration, and identifies design flaws in GPU operators. The paper proposes a Tile-based GPU operator design method to make the GPU operator acceleration gain as close as possible to the memory bandwidth ratio. 【Concept】：GPU DBMS；GPU-CPU heterogeneous analysis；High concurrency programming 【References】： [1] Shanbhag A, Madden S, Yu X. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics[C]//Proceedings of the 2020 ACM SIGMOD international conference on Management of data. 2020: 1617-1632.
Tangpeng Dan	【TItle】: Shortest Path Query under Query Load Awareness 【Abstract】: Computing shortest-path distances in road networks is core functionality in a range of applications. To enable the efficient computation of such distance queries, existing proposals frequently apply 2-hop labeling that constructs a label for each vertex and enables the computation of a query by performing only a linear scan of labels. However, few proposals take into account the spatio-temporal characteristics of query workloads. We observe that real-world workloads exhibit (1) spatial skew, meaning that only a small subset of vertices are queried frequently, and (2) temporal locality, meaning that adjacent time intervals have similar query distributions. We propose a Workload-aware CoreForest label index (WCF) to exploit spatial skew in workloads. In addition, we develop a Reinforcement Learning based Time Interval Partitioning (RL-TIP) algorithm that exploits temporal locality to partition workloads to achieve further performance improvements. Extensive experiments with real-world data offer insights into the performance of the proposals, showing that they achieve 62% speedup on average for query processing with less preprocessing time and space overhead when compared with the state-of-the-art proposals. 【Concept】: 2-hop; Tree decomposition 【Reference】: [1] Zheng B, Wan J, Gao Y, et al. Workload-aware shortest path distance querying in road networks[C]//2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022: 2372-2384.
2022/12/22	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Lixin Liu	【TItle】：An Efficient Scheme for Traceability in Blockchain Systems 【Abstract】: Overemphasis on identity privacy protection makes it difficult for blockchain systems to be regulated in practical applications, leading to malicious events such as ransomware and money laundering. It is worth studying how to protect user identity privacy and track user behavior in blockchain systems. Existing methods mostly rely on zero-knowledge proofs, which have a high verification cost and do not support batch verification. This presentation introduces an efficient tracking scheme that features high efficiency and support for batch verification. 【Concept】：Identity Privacy；Identity-based Signature 【References】： [1] Li Y, Yang G, Susilo W, et al. Traceable monero: Anonymous cryptocurrency with enhanced accountability[J]. IEEE Transactions on Dependable and Secure Computing, 2019, 18(2): 679-691. [2] Shao W, Jia C, Xu Y, et al. Attrichain: Decentralized traceable anonymous identities in privacy-preserving permissioned blockchain[J]. Computers & Security, 2020, 99: 102069. [3] Li P, Xu H, Ma T. An efficient identity tracing scheme for blockchain-based systems[J]. Information Sciences, 2021, 561: 130-140.
Zitong Li	【TItle】: DeltaGrad : A method for accelerating model retraining 【Abstract】: When a small change occurs in the dataset of a machine learning model, the simplest way to obtain a model trained on the new dataset is to retrain the model on the new dataset. However, this method often takes a long time. In this group meeting, we will introduce a paper from ICML 2020 that addresses this issue. The paper proposes to save the gradients and parameters obtained at each training epoch during training on the old dataset, in order to accelerate the computation of gradients and parameters on the new dataset, and thereby reduce the time required for retraining. 【Concept】: general ML techniques；rapid retraining；exact unlearning 【Reference】: [1] Wu Y, Dobriban E, Davidson S. Deltagrad: Rapid retraining of machine learning models[C]//International Conference on Machine Learning. PMLR, 2020: 10355-10366.
2022/12/15	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Leixia Wang	【Title】：A Research of Range Queries on Differential Privacy 【Abstract】：Range queries are common in geographical location, product, and database searches involving 1-D, 2-D, and up to multi-D cases. In differential privacy, researchers usually respond to 1-D range queries based on hierarchical trees and 2-D range queries based on grids and combine these two approaches to respond to multi-D range queries. They strive to optimize the results of range queries by adjusting parameters such as the ε allocation strategy, the hierarchical tree's fanout, and the grid partition's granularity. In the latest research, Wang Yufei (in ICDE 2022) used Prefix-Sum Cube to respond to multi-dimensional queries under local differential privacy (LDP). Sepanta Zeighami (in VLDB 2022) used machine learning methods to learn 2-D queries in centralized differential privacy (CDP) scenarios. In this report, we will summarize the range queries under the current differential privacy scenarios and exhibit that all current range query methods for local differential privacy depend on the uniformity assumption of the data. This assumption leads to additional utility loss when meeting the real-world data. Thus, we intend to improve the utility of range query in LDP by enabling the leaf node of the hierarchy, or the cells of the grid, to converge the data as uniformly as possible. We can accurately answer 1-D, 2-D, and multi-dimensional queries based on them. 【Concept】：tree-based range query；grid-based range query；Prefix-Sum Cube；ML for range query 【References】： [1] Wang Y, Cheng X. PRISM: Prefix-Sum based Range Queries Processing Method under Local Differential Privacy[C]//2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022: 433-445. [2] Zeighami S, Ahuja R, Ghinita G, et al. A neural database for differentially private spatial range queries[J]. Proceedings of the VLDB Endowment, 2022, 15(5): 1066-1078.
Yingtao Peng	【TItle】: Black-box attack techniques for recommendation systems 【Abstract】: Deep neural network-based recommendation systems are vulnerable to adversarial attacks where attackers inject carefully crafted false information into the target system to achieve malicious goals, such as promoting or demoting target items. Due to security and privacy concerns of the target system, black-box adversarial attacks are more practical in real-world scenarios, where attackers cannot easily obtain the structure, parameters, or training data of the target system. Therefore, black-box attacks on recommendation systems, which are sparse feature tasks, are more challenging. This presentation introduces a knowledge graph-enhanced black-box attack framework (KGAttack) that leverages deep reinforcement learning techniques to learn attack policies and improve attack effectiveness. 【Concept】: Black box attacks; reinforcement learning; knowledge graph 【Reference】: [1] Chen J, Fan W, Zhu G, et al. Knowledge-enhanced Black-box Attacks for Recommendations[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022: 108-117. [2]Fan W, Derr T, Zhao X, et al. Attacking black-box recommendations via copying cross-domain user profiles[C]//2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021: 1583-1594.
2022/12/8	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Wenli Wang	【TItle】：Big Data Causal Inference: Methods and Challenges 【Abstract】: The exploration of causality originated from philosophy and can be traced back to the time of Aristotle. In the 1990s, causal science received widespread attention and was successfully applied in fields such as medicine and statistics. In recent years, with the development of machine learning reaching a bottleneck, some scholars have begun to rethink the limitations of association analysis and conduct research on causal relationship discovery and causal inference. Big data provides new research methods for causal inference, but also poses new challenges. This seminar will report on causal inference from the perspectives of methods and challenges in the context of big data. 【Concept】：Data debiggomg; Data profile file; Intervention. 【References】： [1] Yao D, Gong C, Zhang L, et al. CausalMTA: Eliminating the User Confounding Bias for Causal Multi-touch Attribution[J]. arXiv preprint arXiv:2201.00689, 2021. [2] Zhong K, Xiao F, Ren Y, et al. DESCN: Deep Entire Space Cross Networks for Individual Treatment Effect Estimation[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022: 4612-4620. [3] Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms[C]//International Conference on Machine Learning. PMLR, 2017: 3076-3085.
Xinli Hao	【TItle】: Anomaly Detection Study of GWAC Light Curve Data 【Abstract】: Current research on anomaly detection in time series can be divided into two categories: univariate and multivariate, and the two can be transformed into each other. However, in the context of astronomical discovery, neither univariate nor multivariate analysis can meet the requirements: since the brightness of each celestial body is autonomous and has no mutual influence, but there is noise interference related to time and space during the manual observation process. Therefore, it is not appropriate to analyze each celestial body separately as a univariate time series, nor can it be simply spliced into a multivariate time series. This report proposes a new type of network for solving the problem of time series anomaly detection in the context of astronomical discovery. Combining univariate/multivariate time series modeling methods, incorporating time/space constraints, and proposing a new type of anomaly score calculation method, the accuracy of scientific discovery can be improved while reducing the false positive rate. 【Concept】: Spectral Graph Convolutional Networks; Transformer; Chebyshev Polynomials 【Reference】: [1] Chuang C Y, Robinson J, Lin Y C, et al. Debiased contrastive learning[J]. Advances in neural information processing systems, 2020, 33: 8765-8775.
2022/12/1	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Chaohong Ma	【TItle】：DataPrism: Exposing Disconnect between Data and Systems 【Abstract】: As data becomes a central component of many modern systems, the cause of a system malfunction may reside in the data. Like software debugging, which aims to find bugs in the source code or runtime conditions, the goal of data debugging is to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. In this presentation, we report DataPrim, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. 【Concept】：Data debiggomg; Data profile file; Intervention. 【References】： [1] Galhotra S, Fariha A, Lourenço R, et al. DataPrism: Exposing Disconnect between Data and Systems[C]//Proceedings of the 2022 International Conference on Management of Data. 2022: 217-231. [2] Rezig E K, Cao L, Simonini G, et al. Dagger: a data (not code) debugger[C]//CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. 2020.
Zhuoya Fan	【TItle】: Debiased contrastive learning 【Abstract】: Contrastive learning is widely used in representation learning, and its core idea is to bring similar samples (positive examples) closer and dissimilar samples (negative examples) far away. However, in unsupervised scenarios, since the labels of the samples cannot be obtained, the positive examples are usually obtained by data enhancement, and the negative examples are obtained by random sampling. There may be sampling errors in the samples in the negative examples. This group will introduce a NIPS 2020 paper to eliminate sampling bias by modifying the loss function. 【Concept】: Contrastive learning; Sampling bias. 【Reference】: [1] Chuang C Y, Robinson J, Lin Y C, et al. Debiased contrastive learning[J]. Advances in neural information processing systems, 2020, 33: 8765-8775.
2022/11/24	Venue: FL1, Meeting Room(101B), Wing Building for Science Complex
Bingbing Xu	【TItle】：Academic Expert Finding via (k,P)-Core based Embedding over Heterogeneous Graphs 【Abstract】:Finding relevant experts in specified areas is often crucial for a wide range of applications in both academia and industry.Given a user input query and a large amount of academic knowledge (e.g., academic papers), expert finding aims to find and rank the experts who are most relevant to the given query, from the academic knowledge. This time will focus on the background of the problem and the overall method and process of solving the problem. 【Concept】：Heterogeneous graphs；(k, P)-core；Contrastive learning；Expert finding 【References】： [1]Xu X, Liu J, Wang Y, et al. Academic Expert Finding via (k, P) -Core based Embedding over Heterogeneous Graphs[C]//2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022: 338-351. [2] Kong Y X, Shi G Y, Wu R J, et al. k-core: Theories and applications[J]. Physics Reports, 2019, 832: 1-32. [3] Zhang C, Song D, Huang C, et al. Heterogeneous graph neural network[C]//Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019: 793-803.
Chenyang Li	【TItle】: Leveraging Currency for Repairing Inconsistent and Incomplete Data 【Abstract】: Consistency, Completeness and Currency are three important problems that affect the data quality in relational databases. However, the solutions to these problems are often not independent of each other. This report aims to provide a solution to the more complex data cleaning problem caused by missing, misplaced, and unavailable timestamps. Knowledge summary: Currency order; Edit distance; Naive bayes 【Concept】: Expert findingCurrency order; Edit distance; Naive bayes 【Reference】: [1] Ding X, Wang H, Su J, et al. Leveraging currency for repairing inconsistent and incomplete data[J]. IEEE Transactions on Knowledge and Data Engineering, 2020.

Maintained by WAMDM Administrator()

Zhongyuan's Website