Currently, there is to appear large-scale telescopes, large particle accelerators, high-throughput gene sequencers, etc. They continue to produce huge amounts of scientific data, making the global scientific and technological innovation has entered an unprecedented era. Scientific data management and analysis is the key to win the major scientific discoveries in the future. Among them, the emergence of large astronomical observation technology allows researchers to observe the new astronomical phenomenon, more of which can be used to verify the correctness of the existing physical model. The latest astronomical findings are based on the near-real-time generation, management and analysis of large-scale astronomical data, thus pushing new challenges to the current data management system. At the same time, in order to effectively deal with real-time large-scale multi-source data collection we will face new problems and challenges, it forces the urgent need to develop scientific data management of the basic framework, technology and analytical methods.
National Key Research and Development Program of China ((No. 2016YFB1000602)
With the emergence of a variety of the latest observation technology, time-domain astronomical field faces the era of information explosion, and the first wave is the astronomical data management. In the 21st century, astronomy has entered a era of information-rich data, astronomical data is increasing with the rapid growth, the size of magnitude TB or even PB. In the face of the enhancement of collection capability of large-scale astronomical equipment, the database system behind it is mainly faced with three main challenges: rapid acquisition and aggregation of multi-source data, (2) real-time transient source analysis and discovery, and (3) querying suspected transient source from short-term historical data with low latency response. To sum up, the astronomical database system is not only faced with long-term storage problems, but also faced with real-time analysis of the challenges. The need for a unified architecture can balance the two to play the best performance of the database system.
Based on the background of astronomical observation, we design a large-scale relational data management system to manage the data of hundreds of billions and even billion rows of astronomical catalogs in astronomy. The real-time sub-system focuses on the monitoring, analysis and management of abnormal astronomical data. The off-line sub-system focuses on the persistent storage of large-scale short-term astronomical data, and constructs the summary data mode. The real-time system is divided into distributed query tasks, through the pipeline processing mechanism to speed up large-scale data analysis and system tuning. The main core technologies include: (1) efficient organization and persistent storage mechanism of 100 billion row-level relationship data, (2) efficient distributed query method based on summary data and optimal scheduling, and (3) automated pipeline analysis of massive astronomical data processing and anomaly discovery.
Efficient organization and persistent storage mechanism
The large astronomical equipment can collect huge amounts of data. For example, we require about 7 PB disk space to store 10 years of data of GWAC. Therefore, we combine the main characteristics of the relational data in the field of astronomy, and mainly study the one hundred billion scale partition method for large-scale relational data, and design a scheme with high reliability and scalability. In addition, in order to minimize query latency, we study spatial-temporal indexing methods for large-scale short-term astronomical data under long-term storage, and design an index storage scheme with the low memory consumption. Assuming that the value of data decreases over time, we design different indexing strategies for different periods of data to speed up large-scale data indexing processes and to control the data granularity of indexes.
An efficient distributed query method based on digested data
For scientists, the latency for most analysis queries must be low enough to meet the performance of interactive queries. Based on the data characteristics of the astronomical field, we study the rapid generation method of typical digested data and the fast distributed query method based on digested data by analyzing the access pattern on the large-scale short-term astronomical data and building an abstract model. In order to ensure the availability of data and the real-time response capability of the query, through optimizing a replica placement and scheduling scheme for distributed query, we can achieve load balancing, and greatly improve the data access speed and query efficiency.
Automatic pipeline analytics and abnormal discovery
Astronomical equipment is usually designed for real-time observation of anomalous astronomical phenomena. Therefore, we need to design a pipeline to automate the processing of astronomical data from the follow points: (1) the model, (2) algorithm, (3) data partition and (4) the performance optimization of the entire pipeline. We mainly study the method of real-time recognition of abnormal astronomical phenomena and implement the system framework of low-latency warning for abnormal astronomical phenomena. In addition, we also study a rapid distributed caching method for large-scale short-term anomaly astronomical data. It achieves the low-latency write operation of large-scale astronomical data and the low memory consumption. Aiming at the requirement of real-time and interactive query, we study the spatial-temporal index mechanism of large-scale short-time anomaly astronomical data in caching.
Maintained by Zhongyuan Wang()
|| Copyright © 2007-2009 WAMDM, All rights