Building the Next-Generation Data Processing System


The LSST is a large optical survey project funded by the National Science Foundation and the Department of Energy. It will continually image the sky, identify changes in near real time, and over a decade of operations collect tens of petabytes of data building up the deepest, widest, image of the Universe. Its data will enable a range of science goals from identification of Near Earth Asteroids to understanding the nature of Dark Energy.

A survey of this scale requires significant computing resources but also a modern, high-performance, scalable, data processing and analysis system. The LSST Data Management team is guiding an effort to build such a suite. Primarily written in Python and C++, open source, and comprised of modular codes ranging from science pipelines to web user interfaces, the LSST software stack will power the LSST and form a basis that other projects can reuse in the future.

The LSST DM team is distributed across a number of partner institutions — the LSST Project Office, the Infrared Processing and Analysis Center, the National Center for Supercomputing Applications, Princeton University, SLAC National Accelerator Laboratory, and the University of Washington — but also helped by contributors from the community, the LSST science collaborations, and other project subsystems.

NSF DOE

Learn more about LSST data processing »

Science Pipelines

The LSST Science Pipelines will implement the core image processing and data analysis algorithms needed to process optical survey imaging data at low latency and unprecedented scale and accuracy. We are writing pipelines for single-epoch image processing, coaddition, image differencing, optimal multi-epoch measurements, and (global) photometric and astrometric calibration, among others.

Scalable Database

To satisfy the need to efficiently store, query, and analyze catalogs running into trillions of rows and petabytes of data, we are developing Qserv, a distributed shared-nothing SQL database query system.

User Interface

One of the most important jobs of a large survey is to provide access. This includes access to catalogs, processed images, and raw images. Access in the next generation of surveys will extend to visualization and analysis. We are writing interfaces that will allow thousands of users to query, download, visualize, and analyze petabytes of LSST data.

Data Access Middleware

In order to build a scalable, portable processing system, we are creating extensible middleware to transparently access data irrespective of storage location or format.

Distributed Execution

The LSST data processing pipelines will need to efficiently scale from single core execution to tens of thousands of cores. To meet this requirement we are building an orchestration framework to launch and monitor jobs on many different systems at many different scales.

Getting the Code

The LSST data processing codes are being developed in an iterative, agile, fashion. Though engineering first light is still six years away, prototype versions of a number of LSST codes are already being tested on simulations and being applied to existing data (e.g., reprocessing SDSS Stripe 82, or processing HSC Survey data).

While already state-of-the-art in many areas, LSST software is still in its infancy when it comes to end-user friendliness, documentation, and API stability. There is no binary distribution yet — builds must be done from source. Knowledge of Python (and willingness to write some Python code) are necessary to work with the current code base.

Warning At this stage, the LSST software will be of greatest interest to the LSST Science Collaborations, large survey builders (or those reprocessing large survey data sets), and astronomical image processing enthusiasts. If you're just looking to reduce a few observations with a ready-to-use tool, it may be better to look into one of the more polished and/or established packages such as AstroPy or the AstrOmatic suite.

Installing

There are several ways of installing the LSST Stack, including from source, through Anaconda, or as Docker containers. Our installation documentation will get you started.

Here's how to install the latest (v12) release from source, given pre-requisites:

curl -OL https://raw.githubusercontent.com/lsst/lsst/12.0/scripts/newinstall.sh
bash newinstall.sh

source loadLSST.bash

eups distrib install -t v12_0 lsst_apps

Once you've installed the stack, see here for examples of what you can do with it.

Cloning the sources

All LSST DM code is visible on GitHub, spread across 100+ repositories.

The LSST software build tool is helpful for cloning and (re)building from git. Feel free to join DM Developers on the LSST Community forum and ask for help in the Support category.

Getting Involved

The real work to construct the LSST data processing system is just beginning, and there's ample room to get involved.

Join Us

We're in the process of assembling the team of 45+ scientists, software engineers, and IT experts needed to build, commission, and operate the data system for LSST.

Current LSST DM job openings:

Hint To receive e-mails for new LSST DM job openings, subscribe to the dm-announce mailing list. For more LSST positions, see the main LSST Hiring page.

Use, Modify, Contribute

The LSST data processing system, though still in an early construction phase, is an open source (GPLv3) software project free for anyone to use and is open to contributions from the community.

We invite you to:

The LSST Survey

8.4 meter, wide-field, f/1.2 telescope.
3.2 Gigapixel, 189 4k x 4k CCD camera, with 2-second readout.
PetaFLOPS of computing power, hundreds of PB of storage, gigabit long-haul networks.

Turning the sky into a database.

Petascale Era of Optical Astronomy

Beginning early in the next decade, the LSST will collect over 50 PB of raw data, resulting in over 30 trillion observations of 40 billion astronomical sources. It will measure the positions and properties of over 20 billion stars, or 10% of all stars in the Milky Way.

Rapid Discovery

The LSST will scan the visible sky once every three days, charting objects that change or move: from exploding supernovae to potentially hazardous near-Earth asteroids.

Open Data, Open Source

LSST data will be available with no proprietary period to all astronomers in the United States, Chile, and International Partners. Alerts about variable sources will be available world-wide within 60 seconds. The LSST data processing stack will be open source (GPL v3).

Learn more about the science of LSST »