演讲题目:Challenges of Big Data in Scientific Discovery

Big Data is emerging as one of the hottest multi-disciplinary research fields in recent years. Big data innovations are transforming science, engineering, medicine, healthcare, finance, business, and ultimately society itself. In this presentation, we examine the key properties of big data (volume, velocity, variety, and veracity) and their relation to some applications in science and engineering. To truly handle big data, new paradigm shifts (as advocated by the late Dr. Jim Gray) will be necessary. Successful applications in big data will require in situ methods to automatically extracting new knowledge from big data, without requiring the data to be centrally collected and maintained. Traditional theory on algorithmic complexity may no longer hold, since the scale of the data may be too large to be stored or accessed. To address the potential of big data in scientific discovery, challenges on data complexity, computational complexity, and system complexity will need to be solved. We illustrate these challenges by drawing on examples in various applications in science and engineering.

Benjamin W. Wah is currently the Provost and Wei Lun Professor of Computer Science and Engineering of the Chinese University of Hong Kong. Before then, he served as the Director of the Advanced Digital Sciences Center in Singapore, as well as the Franklin W. Woeltge Endowed Professor of Electrical and Computer Engineering and Professor of the Coordinated Science Laboratory of the University of Illinois, Urbana-Champaign, IL. He received his Ph.D. degree in computer science from the University of California, Berkeley, CA, in 1979. He has received a number of awards for his research contributions, which include the IEEE CS Technical Achievement Award (1998), the IEEE Millennium Medal (2000), the IEEE-CS W. Wallace-McDowell Award (2006), the Pan Wen-Yuan Outstanding Research Award (2006), the IEEE-CS Richard E. Merwin Award (2007), the IEEE-CS Tsutomu Kanai Award (2009), and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley (2011). Wah's current research interests are in the areas of big data applications and multimedia signal processing.
      Wah cofounded the IEEE Transactions on Knowledge and Data Engineering in 1988 and served as its Editor-in-Chief between 1993 and 1996, and is the Honorary Editor-in-Chief of Knowledge and Information Systems. He currently serves on the editorial boards of Information Sciences, International Journal on Artificial Intelligence Tools, Journal of VLSI Signal Processing, and World Wide Web. He has served the IEEE Computer Society in various capacities, including Vice President for Publications (1998 and 1999) and President (2001). He is a Fellow of the AAAS, ACM, and IEEE.


本报告从数学、信息、数据、计算交叉融合的角度提出大数据研究的若干科学问题, 包括: 大数据的超高维问题、计算理论问题、Subsampling问题、分布适时计算问题、非结构化信息处理问题与可视分析问题等。对每一问题,我们提出相应的研究思路并展示初步探索结果。

徐宗本,应用数学家、信号与信息处理专家。西安交通大学教授。1955年1月生于陕西省柞水县, 籍贯安徽岳西。1976年毕业于西北大学数学系, 1982年和1987年分别于西安交通大学获理学硕士和博士学位。
      主要从事智能信息处理、机器学习、数据建模基础理论研究。提出了稀疏信息处理的L(1/2)正则化理论, 为稀疏微波成像提供了重要基础。发现并证明机器学习的“徐-罗奇”定理, 解决了神经网络与模拟演化计算中的一些困难问题, 为非欧氏框架下机器学习与非线性分析提供了普遍的数量推演准则; 提出了基于视觉认知的数据建模新原理与新方法,形成了聚类分析、判别分析、隐变量分析等系列数据挖掘核心算法, 并广泛应用于科学与工程领域。曾获国家自然科学二等奖、国家科技进步二等奖、中国CSIAM苏步青应用数学奖, 并在世界数学家大会(2010, 印度)上作45分钟特邀报告。2011年12月当选为中国科学院院士。

演讲题目:Challenges in Accelerating Big Data Processing on Modern Clusters
Ohio State University

Modern clusters are having multi-/many-core architectures, high-performance rdma-enabled interconnects and SSD-based storage devices. Hadoop framework is extensively being used these days for Big Data processing. Similarly, Memcached is being used in data centers with Web 2.0 environment. This talk will provide an overview of challenges in accelerating Hadoop and Memcached on modern clusters. An overview of RDMA-based designs for multiple components of Hadoop (HDFS, MapReduce, RPC and HBase) and Memcached will be presented. Benefits of these designs on various cluster configurations will be shown. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these middleware. These benchmarks and evaluations will demonstrate interplay between high performance interconnects, storage systems (HDD and SSD) and multi-core platforms to achieve best solutions for Hadoop and Memcached.

Dhabaleswar K. (DK) Panda is a Professor of Computer Science and Engineering at the Ohio State University. His research interests include parallel computer architecture, high performance networking, InfiniBand, exascale computing, Big Data, programming models, GPUs and accelerators, high performance file systems and storage, virtualization, and cloud computing. He has published over 300 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, High-Speed Ethernet and RDMA over Converged Enhanced Ethernet (RoCE). The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X software libraries, developed by his research group(http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,085 organizations worldwide (in 71 countries). This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 194,000 downloads of this software have taken place from the project's website alone. This software package is also available with the software stacks of many network and server vendors, and Linux distributors. The new Hadoop-RDMA package, consisting of acceleration for HDFS, MapReduce and RPC, is publicly available from http://hadoop-rdma.cse.ohio-state.edu. Dr. Panda's research has been supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda


报告首先提出了大数据对存储系统的四个要求:第一是高可用,即数据能够随时访问,不丢失;第二是低成本,对磁盘容量要求低;第三是高性能, 访问速度要快;第四是低开销,对CPU、网络资源占用少。由于I/O的速度远低于CPU和网络,这四个要求的优先级应为:高可用>成本低>性能>低开销。高可用和低成本是大数据存储系统的两个主要目标。为了提高可用性,以清华大学研制的云存储系统MeePo为例介绍了多副本技术;以清华大学研制的与结构无关快速恢复容灾方案为例介绍了容灾技术。为了低成本,介绍了编码和主存储删冗方法。提出并实现了基于部分查找的主存储删冗技术,解决了传统删冗无法应用于主存储端,且带宽和延迟性能不佳的问题。

郑纬民1965年考入清华大学,1970年毕业并留校,现为清华大学计算机系教授,目前担任中国计算机学会理事长。长期从事网络存储与分布式处理的科研教学工作。主持并完成了973、863、自然科学基金等科研项目35项,负责或参与工程项目10项。与合作者一起发表论文300余篇。主要成就和贡献如下:1、提出了一种高可靠、可扩展、高效率的网络存储系统并进行了产业化, 是国内最早具有自主知识产权的系统之一,已有近百套系统应用于公安、审计、电信、教育等行业。2、提出了大规模计算机系统全过程评测方法,对国内40余套高性能计算机进行了评测,为国家信息安全作出了重大贡献主要奖励:2002年国家信息安全管理系统获国家科技进步奖一等奖,2007年高性能集群计算机与海量存储系统获国家科技进步奖二等奖,2008年中国教育科研网格获国家科技进步奖二等奖;获省部级科技进步奖13项。1993年享受国务院政府特殊津贴,2006年获北京市优秀教师奖,2007年被授予北京市教育创新标兵,2008年所主讲的“计算机系统结构”被评为国家级精品课程,2009年获得北京市教学名师称号。