Advanced: Discovering and Downloading Data with Python
This tutorial focuses on advancing the previous knowledge of searching for and downloading Analysis Ready Data (ARD) from a dynamic Spatio-Temporal Asset Catalog (STAC) using the python programming language. As one gets into more advanced queries, new tools can be used to speed up the process. In this tutorial we will use
async
to help us retrieve a list of url’s that we want before using rclone
to download them from the cloud. Rclone is perfered when accessing the cloud as opposed to wget
or urllib.request.urlretrieve
.
For this tutorial you will need pystac_client
(refer to “Discovering and Downloading Data with Python” for an example of how to get this set up), pandas
, geopandas
, shapely
, pvl
, aiohttp
and asyncio
. These packages can all be installed with pip or conda depending on preference.
Why use async? Async allows you to write more eddiecient and responsive code. Tradionally, python programs execute code sequentially, which can lead to delays while eaiting for tasks to cpmplete (such as network requests). Async is based on asychronous programing and is implimented through Python’s asyncio
module. It enables non-blocking execution. In this tutorial we will use it to quickly find the data we need.
In this tutorial we want to use the API to access several databases and then retrieve urls based on a certain critera. For the purposes of this example we will be looking at the incidence angle of each image. Lets start by getting our file set up, for the purposes of this example, we will say we are trying to get images in a certain bounding box.
|
|
With this setup code we can then go on to create a geodata frame to store the results of the query. For example, the link to the image, the image id, the geometry, the incidence angle, etc. In the below example, the data is stored in items.gdf
|
|
Now we have all the images that intersect our given bounding box, we want to download them locally. All of these images are hosted on the cloud, so this is a great time to use rclone. First, lets make the paths to were the data is hosted in the cloud
|
|
Where s3_noauth
represents the name where they are stored in your config file. When using rclone one must setup a rclone.conf file. To do this navigate to the rclone.conf
file, it should be stored in your home directory in ./.config/rclone/rclone.conf
. Open it using vim vim ./.config/rclone/rclone.conf
and insert the following in.
[s3_noauth]
type = s3
provider = AWS
env_auth = false
region = us-west-2
Now rclone is all set up and ready to use! Although there are python packages that allow one to use rclone in python, they are a bit clunky and not as well supported. As such it is recommended to create the following script to quickly and easily download your files. Save this script in download_files.sh
|
|
It can then be run with a command like: sh download_files.sh kaguyatc_images_to_download.txt /path/to/dowload/folder
Thats it! You have sucesfully downloadde imags using rclone and async. Hopefully this will speed up downloading and fetching processess.