This is a work in progress!
Use this software with caution!
If you find it helpful please usie the Issues
tab to identify problems or suggest improvements (there are already several things there)!
This repository contains code needed to access weather and climate data commonly re-used in the Doss-Gollin lab. None of the actual data is contained in this repository! Instead, it is downloaded from various external sources and then stored on the Rice Research Data Facility. Use of this facility is not free, so we should only store the datasets that we expect to use frequently. If you have questions, contact James.
All data is stored as NetCDF4 files for easy use with the user-friendly, accessible, and standard Pangeo ecosystem.
You don't need to use the codes in this repository to access the data. The data is stored on James's RDF account. See the lab manual on Notion for details of accessing the data.
Our philosophy is to avoid storing too much data on a single file.
To deal with datasets spread across many files, we suggest to leverage the open_mfdataset
functionality in xarray
.
- Radar precipitation data over the continental United States. For more details, see the README.
- ERA5 reanalysis data over the continental United States at 0.25 degree resolution. For more details, see the README. Specifically, we have the following variables:
- meridional and zonal wind at 500 hPa
- surface air temperature
- elevation (time-invariant, called geopotential)
- vertical integral of meridional and zonal water vapor flux
If you want to change the files we track or edit this repository, then this section is for you. This repository uses Snakemake to define and specify dependencies and workflows. If you've never used Snakemake before, you should read up on it.
We also use conda to manage dependencies.
This includes using conda to specify the dependencies for each workflow rule, making the workflow more concise and reproducible.
If you run into problems please use the Issues
tab on GitHub to bring them to our attention.
You will need conda
installed (or mamba
for maximum performance).
Then open a terminal and run the following lines
conda env create --file environment.yml # creates the environment
conda activate climatedata # activates the environment
All other required dependencies are described using Snakemake and anaconda environments (see Snakemake docs).
The location where data will be stored is defined in config.yaml
under the datadir
argument.
This should point to the location where the RDF is mounted.
To mount the RDF locally, follow the directions here.
If you follow the default settings, you shouldn't need to adjust the path specified in config.yaml
.
If you are mounting on Linux, be aware that you need to change the uid
argument.
For example the command
sudo mount.cifs -o username=jd82,domain=ADRICE,mfsymlinks,rw,vers=3.0,sign,uid=jd82 //smb.rdf.rice.edu/research $HOME/RDF
will mount the RDF to $HOME/RDF
for user jd82
.
If you copy and paste, be aware that this documentation doesn't make very clear that you need to change the uid
.
You can find this with the id
command.
To update the entire dataset, run something like
snakemake --use-conda --cores SOME_NUMBER
where SOME_NUMBER
is the number of cores you want to use (e.g., --cores all
).
There are many additional arguments you can pass -- see Snakemake docs.
It takes Snakemake a long time to build the DAG. When you're just rapidly prototyping things, this lag time is annoying. To speed up, you can batch files like this:
nohup snakemake /home/jd82/RDF/jd82/NEXRAD/2020-11.nc --use-conda --cores all --rerun-incomplete &
See dealing with very large workflows for more details.
If you just want to build the dataset, a good default command to use is
nohup snakemake all --use-conda --cores all --rerun-incomplete --keep-going
Note:
nohup
: Keep running on a remote machine viassh
, even if the connection closes (seenohup
docs).use_conda
: use anaconda for environments--cores all
: use all cores (workstation has 12)--rerun-incomplete
: reduces errors if a job was canceled earlier--keep-going
: if a file causes an error, don't give up
If you are on a different machine you can learn about how many cores are available with the lscpu
command.
In a nutshell, this will keep the process running even after you close your ssh
session.
Linters should run automatically if you are using VS Code with relevant extensions installed.
Before submitting a pull request, please activate nexrad
and then
snakefmt .
to reformat the Snakefileblack .
to reformat the codemypy . --ignore-missing-imports
to run type checks on the code. This is a good way to catch bugs.
Additionally, if you're messing with Snakefile
then snakemake --lint
is a helpful resource with good suggestions.
We are working to get these to run automatically using GitHub workflows. See this issue and help out if you're goot at GitHub Actions.
I am terrible at shell stuff so here are some handy commands
After running Snakemake, you may get some time steps that throw an error. Snakemake will generate a log file for you with something that looks like
Complete log: .snakemake/log/2022-10-11T071442.418042.snakemake.log
To parse this log file to get all the dates that threw errors, something like this will work:
grep "output:" PATH_TO_LOG_FILE | cut -c69- | sort | uniq > lines.log