This repository contains scripts to download the data presented in the paper MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning. The scripts are used to download large scale satellite data from different sensors and satellites (Sentinel-2, Sentinel-1, ERA5 - temperature & precipitation, Aster GDEM etc) which we call modalities. The data is downloaded from Google Earth Engine.
Last Updated on 2024.11.07
The MMEarth data can be downloaded using the following links. To enable more easier development with Multi-Modal data, we also provide 2 more "taster" datasets along with the original MMEarth data. The license for the data is CC BY 4.0.
Dataset | Image Size | Number of Tiles | Dataset size | Data Link | Bash Script |
---|---|---|---|---|---|
MMEarth | 128x128 | 1.2M | 597GB | download | bash |
MMEarth64 | 64x64 | 1.2M | 152GB | download | bash |
MMEarth100k | 128x128 | 100k | 48GB | download | bash |
All 3 dataset have a similar structure as below:
.
├── data_1M_v001/ # root data directory
│ ├── data_1M_v001.h5 # h5 file containing the data
│ ├── data_1M_v001_band_stats.json # json file containing information about the bands present in the h5 file for each data stack
│ ├── data_1M_v001_splits.json # json file containing information for train, val, test splits
│ └── data_1M_v001_tile_info.json # json file containing additional meta information of each tile that was downloaded.
A sample Jupyter Notebook that shows an example to load the data using pytorch is here. Alternatively, the dataloader has also been added to TorchGeo.
To get started with this repository, you can install the dependencies and packages with this command
pip install -r requirements.txt
Once this is done, you need to setup gcloud and earthengine to make the code work. Follow the below steps:
earthengine authenticate
.earthengine authenticate
doesnt directly work since you will get multiple links to click, and these links
wouldnt work when opening them from the browser on your local machine. Hence run this command earthengine authenticate --quiet
. Follow the instructions on your terminal
and everything should work. An additional step is to add the project name in every file that has earthengine.initialize(project = '$PROJECT_NAME')
.This repository allows downloading data from various sensors. Currently the code is written to download the following sensors/modalities:
The data downloading happens only when you have a geojson file with all the tiles you want to download. Here tiles represent ROI (or polygons) for each location that you want. Once you have the tiles, the data stacks (data for each modality) are downloaded for each tile in the geojson. The data can be downloaded by following this broad structure, and each of these points are further explained below:
create_tiles_polygon.py
is the file used to create the tiles. The corresponding config is config/config_tiles.yaml
. For a global sample, the various sampling techniques are based on the biomes and ecoregions from the RESOLVE ECOREGIONS.main_download.py
is the main script to download the data. The corresponding config is config/config_data.yaml
. The config file contains various parameter to be set regarding the different modalities, and paths. Based on the geojson file created from the above step, this file downloads the data stacks for each tile.ee_utils/ee_data.py
file contains custom functions for retrieving each modality in the data stack from GEE. It merges all these modalities into one array, and export it as a GeoTIFF file. The band information and other tile information is stored in a json file (tile_info.json
).post_download.py
file performs 4 operations sequentially:
tile_info.json
files (these files are created when parallely downloading using slurm - explained more below)redownload.py
is the file that can be used to redownload any tiles that failed to download. Sometimes when downloading the data stacks, the script can skip tiles due to various reasons (lack of sentinel-2 reference image, network issues, GEE issues). Hence if needed, we have an option to redownload these tile. (An alternative is to just download more tiles than needed).(NOTE: The files are executed by making use of SLURM. More information on this is provided in the Slurm Execution section)
Downloading Data Stacks: GEE provides a function called getDownloadUrl()
that allows you to export images as GeoTIFF files. We extend this by merging all modalities for a single location into one image, and export this as a single GeoTIFF file. To further speed up the data downloading, we make use of parallel processing using SLURM. The above figures give an idea of how this is done. The tile information (tile GeoJSON) contains location information and more about N tiles we need to download. N/40 tiles are downloaded by 40 slurm jobs (we set the max jobs as 40 since this is the maximum number of concurrent requests by the GEE API).
To run the slurm parallel download, execute the following command
sbatch slurm_scripts/slurm_download_parallel.sh
Please cite our paper if you use this code or any of the provided data.
Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, & Nico Lang (2024). MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning.
@misc{nedungadi2024mmearth,
title={MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning},
author={Vishal Nedungadi and Ankit Kariryaa and Stefan Oehmcke and Serge Belongie and Christian Igel and Nico Lang},
year={2024},
eprint={2405.02771},
archivePrefix={arXiv},
primaryClass={cs.CV}
}