Thread: current best practice to access USAarray data set

Started: 2020-06-12 10:40:07
Last activity: 2020-06-17 13:55:04
Gregor Hillers
2020-06-12 10:40:07
Dear IRIS data masters,

What is the way to go these days to download or obtain a large fraction
of USArray 3C data (say all west of 90 degree W)?

We have been using ObsPy's FDSN download facilities; that does of course
work in principle, but we encounter more problems with not retrieving
station or data files than we'd expected, considering our admittedly
high expectations of no/few problems at all.

We tested to run several downloading jobs in parallel. We've been
experimenting with 8, 16, 32, ... 128 parallel jobs. The more jobs, the
more problems (less total amount of data retrieved per "launch") or
drop-outs.

Before we start exchanging stdout details we were wondering if there is
some kind of advice we could follow to execute the download in a manner
that serves everybody in an optimal way, i.e., the data center and us.

Do you recommend a specific number of parallel request jobs?
Are there any other suggestions, tricks, good practice recommendations?

We appreciate any suggestions,

With kind regards,

-Gregor Hillers

bit.ly/HILatHEL


  • Hi Gregor,

    For downloading a large set of data, you might look into the ROVER tool:
    https://iris-edu.github.io/rover/

    I think this might provide exactly what you're looking for -- a management system around the ObsPy code. This probably implements all the best practices for the kind of job you're doing.

    As for best practices themselves, here are a few of my personal thoughts:

    - Error handling is required

    I can only speak to the IRIS data indirectly (and other data centers not at all) but our servers are wholly optimized for speed and scale, and not at all for safety -- for instance, they don't provide any recovery for low-level HTTP errors so if anything goes wrong the whole request will fail.

    Most of the time there was a hiccup in the network somewhere, if you retry it will work. But one common error is 503 Service Unavailable (which might not appear in exactly those terms) which means that the server is overloaded or throttled. So these it's important to *not* retry, at least not immediately -- the server is already getting too many requests.

    - Parallelism has rapidly diminishing returns

    Parallelism will have very limited effectiveness, because most data centers throttle these. You will probably see the 503 error mentioned above, or the requests may time out on your end.

    - Data is organized by channel/time (and nothing else)

    Internally (at IRIS at least) the raw data is laid out in channel-days, ie. one channel for one day (UTC), so it's faster to make requests that break on these bounds. Beyond that, there's no "grouping" of things -- if a station has 36 channels, requesting a year of data for that station is simply (36 x 365 = ) 13140 channel requests.

    So in theory, you could make one request or you could make 13140 separate requests (or anything in between) and at some level these are exactly equivalent. I think a lot of "best practices" are about how to split these up.

    - Response size is the thing to manage for reliability

    The upshot of all this is that it's important for the client to manage the amount of data returned in each request. The more data is returned, the longer the connection is left open and the more likely some transient thing will break it (and force the entire request to be run again). This has to be balanced against the overhead cost of managing the larger number of requests required if each individual request is smaller. Where exactly the balance lies depends a lot on your network speed and reliability; I think shooting for something like 10MB per request is probably a good starting point.


    Sorry, this got a little long. I hope it helps!

    Cheers,
    Adam



    ----- Original Message -----
    From: "Gregor Hillers (via IRIS)" <data-request-help<at>lists.ds.iris.edu>
    To: "Data Request Help" <data-request-help<at>lists.ds.iris.edu>
    Sent: Monday, June 15, 2020 4:15:38 PM
    Subject: [IRIS][data-request-help] current best practice to access USAarray data set

    Dear IRIS data masters,

    What is the way to go these days to download or obtain a large fraction of USArray 3C data (say all west of 90 degree W)?

    We have been using ObsPy's FDSN download facilities; that does of course work in principle, but we encounter more problems with not retrieving station or data files than we'd expected, considering our admittedly high expectations of no/few problems at all.

    We tested to run several downloading jobs in parallel. We've been experimenting with 8, 16, 32, ... 128 parallel jobs. The more jobs, the more problems (less total amount of data retrieved per "launch") or drop-outs.


    Before we start exchanging stdout details we were wondering if there is some kind of advice we could follow to execute the download in a manner that serves everybody in an optimal way, i.e., the data center and us.

    Do you recommend a specific number of parallel request jobs?
    Are there any other suggestions, tricks, good practice recommendations?

    We appreciate any suggestions,

    With kind regards,

    -Gregor Hillers

    bit.ly/HILatHEL


    ----------------------
    Data Request Help
    Topic home: http://ds.iris.edu/message-center/topic/data-request-help/ | Unsubscribe: data-request-help-unsubscribe<at>lists.ds.iris.edu

    Sent from the IRIS Message Center (http://ds.iris.edu/message-center/)
    Update subscription preferences at http://ds.iris.edu/account/profile/

Page built 02:16:45 | v.6a0a3e16