As the first article in this new series for the Data Services Newsletter, we at IRIS want to cover what is a very commonly reported issue when users attempt to access data. We receive emails frequently asking why data that was expected from one of our services did not arrive. Many times, the answer can be found in knowing where to look to find out exactly what data we actually have.
As a matter of discoverability, it is simple enough to use tools like the MetaData Aggregator1 and GMAP2 to locate geophysical stations of interest anywhere on the globe and use this as a basis to ask for data for these stations. The user knows that when it comes to requesting time series data utilizing any of a number of tools, including IRIS Web Services, to do so.3 Such requests, however, do not always proceed to expectation and can leave the user confounded as well as disappointed when they do not receive the data they asked for.
What happens when you request data?
Regardless of the tool used, all requests access data through the dataselect web service as the mainstay of our seismic data retrieval architecture.4 Since this is an official FDSN web service, we formally call it fdsnws-dataselect.
Modes of access like timeseries, timeseriesplot, WILBER, and BREQ_FAST, as well as client tools like ROVER and FetchData all use this common pathway to get at data from the IRIS archive.
When dataselect receives parameters relating to network, station, channel, location code, as well as time extents, it has the primary identifiers to know where to get data. However, the system doesn’t do so blindly. It first makes a cursory check for the time extents as recorded in our database catalog. It is not enough that a station was actually operating during a given time period to know that IRIS has data. The reality is that data availability can be sparse, latent, or not available for great lengths of time. Sometimes the data exists, but not necessarily in the repository you’d expect.
So what is a user to do? Well, we have another web service that helps with this, called fdsnws-availability, that contains very detailed station, channel, and time information of the IRIS archive. We will refer to it’s shorter name as the availability service for the purposes of this article.
How we track the data we have
When data first arrives at IRIS Data Services, it is usually kept in a temporary holding area while the collection is completed and verification checks are performed. Only data that passes verification with our station instrument metadata will be moved to the archive. The primary check is that there is a station and channel available during an operational time period, referred to as an epoch. For example:
#Network | Station | Latitude | Longitude | Elevation | SiteName | StartTime | EndTime IM|EKB|55.3339|-3.19229|308.0|Eskdalemuir Array, site EKB, Scotland|2008-11-25T00:00:00|
This example shows an International Monitoring station in Scotland (IM.EKB), which has been operational since 2008. Any new data arriving for this station for channel BHZ (and an empty location code) could be matched to this entry and considered valid.
The collecting and curing of data in real-time, before it moves to the archive can take a couple of days. The process of transcribing data to the archive records the station, channel, and detailed time segments of data samples along with the destination archive file, file offset, and length in a large database. This information not only provides a reference for users to get a precise catalog of data holdings, but also serves as the address book for data retrieval by the dataselect web service.
It should be noted that not all data that arrives in a data stream is truly in real time. While some stations return data within seconds from the point they are recorded, others take minutes, hours, or days before they arrive. We measure this time lag as three different latency metrics, which can be examined through our MUSTANG quality assurance web service.5
Using the availability service
The availability service provides a translation of the data catalog in an easily digestible form and provides a powerful set of tools to filter and sort the information into a useful presentation. The main page is presented here, with instructions on its use.
Though the station EKB above has been operational since November 2008, that doesn’t mean that there is data going back that far, nor that there is data arriving presently. We can ask the availability service to tell us. First, let’s get the short version: an overall summary of the total time extent of data in the archive, from the earliest time, to the latest time.
The main operative terms here are:
- extent = show just the earliest times and latest times, selectively also filtering for sample rate changes and overlaps
- net,sta,loc,cha = the network, station, location, channel identifiers of the data channel that we wish to access. This channel does not have a location code, so we signify this with two dashes: ‘loc=—’
- nodata = indicate the http code to be returned when no data is found (default is 204, which returns nothing to the client)
- format = set the format of data we want returned. Text is a good option to be compact and easy to read.
What we discover is the following:
#Network Station Location Channel Quality SampleRate Earliest Latest Updated TimeSpans Restriction IM EKB -- BHZ M 39.55 2016-02-23T16:00:05.899000Z 2016-02-23T16:00:07.795334Z 2020-02-03T22:06:23Z 1 OPEN IM EKB -- BHZ M 39.7 2016-02-25T03:33:09.350000Z 2016-02-25T03:33:09.979723Z 2017-01-01T01:15:26Z 1 OPEN IM EKB -- BHZ M 39.87 2016-02-24T13:06:15.775000Z 2016-02-24T13:06:19.988695Z 2017-01-01T01:15:23Z 1 OPEN IM EKB -- BHZ M 40.0 2014-03-17T00:00:00.000000Z 2017-04-18T10:04:59.975000Z 2020-02-03T22:06:37Z 1296 OPEN
There are four entries shown, instead of just one, because there are slightly different assessed sample rates (39.55 to 40.0) having been cataloged. The very last entry is the best reflection of the time extent, which shows that the earliest time in the archive is in fact March of 2014 and ends in April of 2017. To show just one record with all of the sample rates collapsed, use the ‘merge=samplerate’ flag in the URL.
#Network Station Location Channel Quality Earliest Latest Updated TimeSpans Restriction IM EKB -- BHZ M 2014-03-17T00:00:00.000000Z 2017-04-18T10:04:59.975000Z 2020-02-03T22:06:37Z 1299 OPEN
We can now try to ask for data using dataselect. Let’s go for a small time window of BHZ data on Christmas Eve 2014:
Error 404: handler exited, no error message from handler More Details: handler exited, code: 0 reason: No Content
I get no data back! Why is this?
Let’s look at a more detailed account of the data contained in the archive for this station using the query operator in the availability service. The parameters vary only slightly since the records do not aggregate as they do in extent mode.
#Network Station Location Channel Quality SampleRate Earliest Latest IM EKB -- BHZ M 40.0 2014-03-17T00:00:00.000000Z 2014-03-18T00:14:55.175000Z IM EKB -- BHZ M 40.0 2014-03-18T00:15:00.000000Z 2014-03-18T00:15:59.975000Z IM EKB -- BHZ M 40.0 2014-03-18T00:16:30.000000Z 2014-03-18T00:26:18.625000Z IM EKB -- BHZ M 40.0 2014-03-18T00:29:00.000000Z 2014-03-18T00:29:49.725000Z ....etc.... IM EKB -- BHZ M 40.0 2014-12-23T23:30:00.000000Z 2014-12-24T20:14:39.975000Z IM EKB -- BHZ M 40.0 2014-12-24T20:14:50.000000Z 2014-12-24T20:44:39.975000Z IM EKB -- BHZ M 40.0 2014-12-24T20:44:50.000000Z 2014-12-24T20:53:26.100000Z IM EKB -- BHZ M 40.0 2014-12-24T20:55:10.000000Z 2014-12-24T20:59:49.975000Z IM EKB -- BHZ M 40.0 2014-12-24T21:00:00.000000Z 2014-12-26T20:08:09.975000Z IM EKB -- BHZ M 40.0 2014-12-26T20:09:00.000000Z 2014-12-26T20:10:57.975000Z IM EKB -- BHZ M 40.0 2014-12-26T20:11:10.000000Z 2014-12-26T20:59:19.975000Z IM EKB -- BHZ M 40.0 2014-12-26T20:59:30.000000Z 2014-12-26T20:59:57.525000Z
What we get is a very long list of time extents for this channel. Clearly, the data is quite discontinuous, so there are many gaps where a query might fail. In our example above, we happened to request data right in a gap. With really gappy data, it is generally easier to ask for a larger time span:
We now get data back! We can make more refined time span queries to the availability service to know whether a query will work. Let’s test this out for a time period in 2015:
#Network Station Location Channel Quality SampleRate Earliest Latest IM EKB -- BHZ M 40.0 2015-03-01T00:00:00.000000Z 2015-03-02T11:37:19.975000Z IM EKB -- BHZ M 40.0 2015-03-02T11:38:20.000000Z 2015-03-05T23:59:59.975000Z IM EKB -- BHZ M 40.0 2015-03-06T00:14:30.000000Z 2015-03-06T16:40:31.150000Z IM EKB -- BHZ M 40.0 2015-03-06T16:40:40.000000Z 2015-03-06T16:44:45.950000Z IM EKB -- BHZ M 40.0 2015-03-06T16:45:00.000000Z 2015-03-08T00:48:45.000000Z IM EKB -- BHZ M 40.0 2015-03-08T00:48:50.000000Z 2015-03-08T01:00:49.975000Z IM EKB -- BHZ M 40.0 2015-03-08T01:01:00.000000Z 2015-03-08T01:13:59.975000Z IM EKB -- BHZ M 40.0 2015-03-08T01:15:00.000000Z 2015-03-08T07:46:15.350000Z IM EKB -- BHZ M 40.0 2015-03-08T07:46:20.000000Z 2015-03-10T00:00:00.000000Z
This is a manageable list to process and can help determine whether or not to make a dataselect call with that time span.
Knowing which repository to access
IRIS Data Services has two separate data repositories for seismic data. In the examples above, the FDSN-compliant services point to our primary miniSEED archive. There is a second archive that houses experiment data, many times data sets using active source excitation of ground signal. We refer to this repository by its archival format: PH5.
The functioning of the PH5 web services station, dataselect, and availability work generally the same, though you can collect shot gathers and the like from a PH5 data query. Still, when it comes to data availability, if you can’t find the station in one service, you might check the other.
For instance, the experiment labeled 1D_2008 (Erebus) is something you won’t find when looking in the FDSN station service or the FDSN availability service:
Error 404 NOT_FOUND: No data found!
However, you will find it in the PH5 archive, swapping out fdsnws with ph5ws in the URL:
#Network | Station | Latitude | Longitude | Elevation | SiteName | StartTime | EndTime 1D|1001|-77.5245|166.964417|2100.0|1001|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1002|-77.548567|166.97205|2066.0|1002|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1003|-77.508133|166.931617|1952.0|1003|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1004|-77.4965|166.965167|2055.0|1004|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1005|-77.492167|167.051167|2450.0|1005|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1006|-77.492083|167.105167|2588.0|1006|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1007|-77.5628|166.9777|1786.0|1007|2008-11-24T20:50:50|2009-01-09T03:13:46 1D|1008|-77.504183|167.336983|2420.0|1008|2008-11-24T20:50:50|2009-01-09T03:13:46 ...etc...
And the data time extents:
#Network Station Location Channel Quality Earliest Latest Updated TimeSpans Restriction 1D 1001 -- EH2 D 2008-12-10T04:26:27.850000Z 2008-12-24T01:26:27.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1001 -- EHZ D 2008-12-10T04:26:27.850000Z 2008-12-24T01:26:27.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1002 -- EH1 D 2008-12-10T04:55:20.190000Z 2008-12-24T01:55:20.190000Z 2019-09-18T17:53:29Z 8 OPEN 1D 1002 -- EH2 D 2008-12-10T04:55:20.190000Z 2008-12-24T01:55:20.190000Z 2019-09-18T17:53:29Z 8 OPEN 1D 1002 -- EHZ D 2008-12-10T04:55:20.190000Z 2008-12-24T01:55:20.190000Z 2019-09-18T17:53:29Z 8 OPEN 1D 1003 -- EH1 D 2008-12-10T05:07:20.850000Z 2008-12-24T02:07:20.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1003 -- EH2 D 2008-12-10T05:07:20.850000Z 2008-12-24T02:07:20.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1003 -- EHZ D 2008-12-10T05:07:20.850000Z 2008-12-24T02:07:20.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1004 -- EH1 D 2008-12-10T04:44:41.850000Z 2008-12-24T01:44:41.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1004 -- EH2 D 2008-12-10T04:44:41.850000Z 2008-12-24T01:44:41.850000Z 2019-09-18T17:53:29Z 7 OPEN 1D 1004 -- EHZ D 2008-12-10T04:44:41.850000Z 2008-12-24T01:44:41.850000Z 2019-09-18T17:53:29Z 7 OPEN ...etc...
Though we do not have a direct tool to blanket over both of these archives, we do have a couple of convenient tools that can inform you as to what archive a given network and year can be located in. The first is the MetaData Aggregator web tool. The thing to remember with temporary experiments is that the network identifier gets recycled, so can be tied to different experiments, depending on the year. In this case, 1D_2008 is a two-year experiment, and we can see this in the tool view:
Another way you can find which repository a dataset is in is to use the fedcatalog web service. This is a search directory that lists stations and channels available at more than 20 FDSN data repositories around the globe. So, when you do a search here, you can find out whether data is available at IRIS, or perhaps the Netherlands, or many other locations.
nodata=404 DATACENTER=IRISPH5,http://ds.iris.edu DATASELECTSERVICE=http://service.iris.edu/ph5ws/dataselect/1/ STATIONSERVICE=http://service.iris.edu/ph5ws/station/1/ EVENTSERVICE=http://service.iris.edu/ph5ws/event/1/ AVAILABILITYSERVICE=http://service.iris.edu/ph5ws/availability/1/ 1D 1001 -- EH1 2008-11-24T20:50:50 2009-01-09T03:13:46 1D 1001 -- EH2 2008-11-24T20:50:50 2009-01-09T03:13:46 1D 1001 -- EHZ 2008-11-24T20:50:50 2009-01-09T03:13:46 1D 1002 -- EH1 2008-11-24T20:50:50 2009-01-09T03:13:46 ...etc...
The top header in the fedcatalog return shows you the URLs for the repository to reach out to. Keep in mind that many centers in the FDSN do not have an availability service, but more and more of these centers are beginning to offer one as time goes on.
IRIS Data Services performs detailed scans on the data that it archives and makes this information available to users through the availability web service. This service provides a helpful guide to users and client software in making smart and informed data access choices to IRIS Web Services. Though we do not have the space here to cover the more esoteric aspects of data availability, the crossover to real-time data in the pre-archive stage, or the time it takes for data availability information to trickle up to the service, you can count on these services to give you reliable information on what we have in our holdings and which repository they can be found in.
We hope you learned from this How-To Corner segment and encourage you to make use of these services to make your data gathering efforts productive and efficient.
- MetaData Aggregator: https://ds.iris.edu/mda
- GMAP: https://ds.iris.edu/gmap
- Data Access Tools: https://ds.iris.edu/ds/nodes/dmc/tools
- Note: we also support SeedLink for real time feeds to clients
- MUSTANG measures data_latency, feed_latency, and total_latency. See: https://service.iris.edu/mustang/metrics/1
by Rob Casey (IRIS Data Management Center)