Data Services Newsletter

Volume 20 : No 2 : Summer 2018

GeoSciCloud: Exploring the potential for hosting a geoscience data center in the cloud

The IRIS Data Management Center (DMC) has operated a public repository of seismological data for 3 decades supporting thousands of researchers. Since its founding, the DMC has operated its own infrastructure to support the computational and storage resources needed to support its mission. In the GeoSciCloud project, a Building Block Project with support coming from EarthCube1, the DMC is deploying a subset of its archive and key software components into two cloud environments. This project will allow the DMC to evaluate the realities of operating in the cloud and explore the potential advantages and disadvantages.

The two cloud environments selected for this project are Amazon’s AWS and XSEDE’s Jetstream and Wrangler systems. The XSEDE resources are operated on behalf of NSF by Indiana University jointly with the Texas Advanced Computing Center. The DMC deployed a ~40 terabyte test data set and a subset of its web service-based data access architecture to both environments. The DMC is conducting an extensive evaluation of the capabilities of these deployments. To ensure these systems support and, ideally, improve upon real-world research use cases, the DMC is collaborating with scientists who have performed their own tests designed to meet their research needs.

A promising, expected gain from cloud-like environments over DMC-operated systems is the ability to scale-out in order to handle more simultaneous users, both with respect to storage I/O and processor intensive tasks. Another potential advantage is providing data within, or very near to, a powerful computing environment that researchers may also use. A key aspect to evaluate is the relative costs of the cloud environments against the DMC’s own infrastructure.

Based on testing thus far, preliminary results indicate that both cloud environments can deliver data at a much higher level of concurrent requests for both raw and processed data. Results also indicate transmission of data across the internet is quickly becomes the limiting factor as data volume increases. Comparison between the two cloud environments is illustrated in the following figures:

Comparison of cloud environments for raw data access
Comparison of raw data access response times for both cloud environments with a variety of supporting cores. This result shows that XSEDE provides faster access with lower core counts but roughly matching AWS at higher core counts. The increase of response time with an increase in cores at XSEDE is due to saturation of the connection between the virtual machines and the storage system. Note that the levels of concurrent requests (cq) are not something we can currently offer from the DMC.

Comparison of cloud environments for processed data access
Comparison of processed data access response times for both cloud environments with a variety of supporting cores. This result shows roughly equivalent response times for the two cloud environments. Based on diagnostics, the processing of these request was CPU-bound, and CPUs are roughly equivalent between the two environments. Network transmission time and protocol overhead were not significant factors for this kind of request. Note that the levels of concurrent requests (cq) are not something we can currently offer from the DMC.

1 The GeoSciCloud project is supported by the National Science Foundation’s (NSF) EarthCube program, ICER-1639719.
EarthCube Building Blocks: Collaborative Proposal: Deploying Multi-Facility Cyberinfrastructrue in Commerical and Private Cloud-based Systems

by Chad Trabant and Mike Stults (IRIS Data Management Center)

Page built 13:35:42 | v.ab6383bc