-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathm202-svi-tri-icis-places.qmd
1942 lines (1432 loc) · 118 KB
/
m202-svi-tri-icis-places.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Exploring Air Quality, Social Vulnerability, and Health Outcomes in Metro Detroit"
author: "Joshua Brinks"
date: "September 4, 2024"
df-print: paged
code-block-bg: true
code-block-border-left: "#315c86"
format:
html
keep-ipynb: True
bibliography: svi-references.bib
---
## Overview
In this lesson, we will introduce you to 3 public health and air quality datasets: the United States [Environmental Protection Agency](https://www.epa.gov/)'s (EPA's) [Toxic Release Inventory](https://www.epa.gov/toxics-release-inventory-tri-program/tri-toolbox) (TRI), the EPA's [Integrated Compliance Information System for Air](https://www.epa.gov/enviro/icis-air-overview) (ICIS-AIR), and the U.S. [Centers For Diesease Control and Prevention's](https://www.cdc.gov/) (CDC's) [PLACES health outcomes](https://www.cdc.gov/places/index.html) dataset. You will learn how to form queries and retrieve each dataset from their respective API, process the returned object into a useable format, create visualizations and perform simple analysis. You will also use the CDC's [Social Vulnerability Index](https://sedac.ciesin.columbia.edu/data/set/usgrid-us-social-vulnerability-index-rev01) (SVI), available through NASA, to examine socioeconomic patterns in the greater Detroit metropolitan area and explore relationships with public health.
::: {.callout-tip style="color: #5a7a2b;"}
## Programming Reminder
This lesson uses the Python programming environment.
:::
::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
#### Data Science Review
An Application Programming Interface (API) is a set of protocols, tools, and definitions that enable different software applications to communicate with each other. It acts as an intermediary, allowing one application to request specific data or actions from another application or service without needing access to its entire codebase. APIs define the methods and data formats that applications can use to request and exchange information, typically through specific *endpoints* (URLs) that accept requests and return responses in standardized formats like JSON. They are crucial for integrating external services, accessing databases, and enabling communication between different parts of larger open science software systems.
:::
:::
## Learning Objectives
After completing this lesson, you should be able to:
- Retrieve multiple public health datasets through the EPA and CDC's APIs.
- Create multipanel raster maps from multilayered SVI data.
- Geolocate tabular data using latitude and longitude.
- Create maps joining geolocated points and basemaps.
- Create maps with graduated point symbols to visualize point source pollution releases.
- Interpolate a raster surface from point spatial data.
- Perform raster math to create specialized environmental and socio-economic indexes.
- Perform spatial correlation analysis.
::: callout-warning
## Preface: A Note on Data Interpretation and Ethical Considerations
The topics of environmental justice, institutional racism, socioeconomic conditions, and pollution are complex and multifaceted. The data and analyses presented in this lesson are not intended to draw definitive conclusions or suggest scientific evidence of cause and effect relationships. Rather, they are meant to equip you with the skills to investigate data and perform analyses that could be applied to your local communities.
As you work through this lesson, remember that correlation does not imply [causation](https://www.coursera.org/articles/correlation-vs-causation). The patterns you may observe in the data could be influenced by factors not represented in these datasets. Approach your findings with caution, and consider the broader historical, social, and economic contexts that shape environmental and health outcomes in different communities.
This lesson will empower you with data literacy and analytical skills. We encourage you to use these tools and skills responsibly. Consider the ethical implications of your analyses and the potential impact on the communities represented in the data. When drawing insights or making decisions based on such analyses, it’s crucial to involve community stakeholders, consider multiple perspectives, and seek additional expertise when necessary.
:::
## Introduction
In many urban areas, air quality issues disproportionately affect low-income communities and communities of color [@tessum2021; @liu2021]. This disparity is a key focus of environmental justice efforts. In cities like Detroit and Chicago, industrial facilities, highways, and other pollution sources are often concentrated in disadvantaged neighborhoods. Detroit has a history of industrial pollution and is working to address air quality issues in areas with relatively lower socio-economic status like Southwest Detroit, where residents face higher exposure to pollutants from nearby industries and heavy truck traffic [@rector2017; @air_quality_michigan; @restum2022; @hartig2024; @safta2024]. Similarly, Chicago has seen community efforts to address air pollution in areas like the Southeast Side, where residents have fought against polluting industries and advocated for stricter regulations [@gordon2022; @nature_conservancy_2021; @illgner2022].
The EPA is the primary federal agency responsible for environmental protection in the United States. It sets and enforces air quality standards, including the [National Ambient Air Quality Standards (NAAQS)](https://www.epa.gov/criteria-air-pollutants/naaqs-table) for six criteria pollutants. The EPA also maintains the [AirNow](https://www.airnow.gov/about-airnow/) system, which provides real-time air quality data to the public. While primarily focused on public health, the CDC plays a crucial role in understanding the health impacts of air pollution. It conducts research on the relationships between air quality and various health outcomes, and provides guidance on protecting public health from environmental hazards [@cdc2024].
In response to these challenges, community-driven science initiatives have emerged. These efforts involve local residents in collecting data on air quality and other environmental factors, often using low-cost sensors and mobile monitoring techniques [@morawska2018; @lu2022; @park2021; @su2024]. This approach helps fill gaps in official monitoring networks and empowers communities to advocate for themselves. Open data is crucial for community-driven solutions in several ways:
- Transparency: Open data allows communities to verify official reports and hold authorities accountable.
- Accessibility: When air quality data is freely available, communities can use it to inform local decision-making and advocacy efforts.
- Innovation: Open data enables researchers, activists, and tech developers to create new tools and analyses that can benefit communities.
- Collaboration: Open data facilitates collaboration between communities, scientists, and policymakers, leading to more effective solutions.
While the EPA and CDC provide federal networks of open-access datasets like the Air Quality System (AQS), TRI, ICIS-AIR, and Population Level Analysis and Community Estimates (PLACES) collections, community driven data collection is a large part of driving change in traditionally underrepresented communities. In Chicago, the [Array of Things](https://arrayofthings.github.io/) project has installed sensors throughout the city, providing open data on various environmental factors including air quality, while Detroit's [Community Air Monitoring Project](https://www.michigan.gov/-/media/Project/Websites/egle/Documents/Reports/AQD/monitoring/2018-04-27-48217-air-monitoring-project.pdf?rev=33ae4f2a312149219bcdf9f25a538cca) uses low-cost sensors to collect and share air quality data in areas underserved by official monitoring stations.
With access to open data, communities can:
- Identify local air quality hotspots that may be missed by sparse official monitoring networks and potential discrepancies in available data.
- Detect inconsistencies or errors in federal air quality data, ensuring more accurate and localized environmental assessments.
- Correlate air quality data with health outcomes to strengthen advocacy efforts.
- Develop targeted interventions, such as promoting indoor air filtration on high-pollution days.
- Create custom alerts and information systems tailored to local needs.
Although open data provides many benefits, challenges still remain. These include: 1) ensuring data quality and consistency, especially when integrating data from various sources; 2) addressing the "digital divide" to ensure all community members can access and use the data; 3) balancing the need for detailed local data with privacy concerns; and 4) building capacity within communities to effectively use and interpret complex environmental data.
::: {.callout-tip style="color: #5a7a2b;"}
#### EJ in the News
The [Clear the Air Coalition](https://cleartheairmi.org/), a new environmental justice advocacy group in Michigan, is calling for stronger protection of vulnerable communities from pollution. Launched in Detroit, the coalition argues that state regulators focus too much on technical compliance with environmental laws rather than public health. They claim the current permitting process favors polluters and fails to consider the cumulative impacts of pollution on overburdened communities. The group seeks changes in state law and a shift in mindset from environmental regulators.
[![Clear the Air Coalition](https://npr.brightspotcdn.com/dims4/default/42a4f01/2147483647/strip/true/crop/960x720+0+0/resize/1760x1320!/format/webp/quality/90/?url=http%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2F78%2F01%2F1711cf6e442dbe579d452c0fc6f6%2Fclearair.jpg)](https://www.michiganpublic.org/environment-climate-change/2024-05-02/new-environmental-justice-coalition-aims-to-change-state-law-pressure-egle)
The [Michigan Department of Environment, Great Lakes, and Energy (EGLE)](https://www.michigan.gov/egle/) emphasizes its commitment to protecting underserved communities and notes improvements in air quality over recent decades, but Coalition members argue that the current approach is insufficient. They call for regulators to consider the combined effects of multiple pollution sources in marginalized areas and to give local governments more power to reject new polluting facilities in already burdened communities.
:::
The Michigan EGLE and the Office of the [Environmental Justice Public Advocate](https://www.michigan.gov/egle/about/organization/environmental-justice) are two of the more active state departments aiming to address issues of the environment, public health, and social justice. Here are some of the projects enacted in greater Detroit area:
- [Detroit Environmental Agenda](https://detroitenv.org/): This community-led initiative works closely with EGLE to address environmental concerns in Detroit. It focuses on issues like air quality, water quality, and waste management.
- [48217 Community Air Monitoring Project](https://www.michigan.gov/-/media/Project/Websites/egle/Documents/Reports/AQD/monitoring/2018-04-27-48217-air-monitoring-project.pdf): Named after the zip code of a heavily industrialized area in Southwest Detroit, this project involves community members working with EGLE to monitor air quality using low-cost sensors.
- [Detroit Climate Action Plan](https://detroitenvironmentaljustice.org/climate-action-plan/): Developed in partnership with EGLE, this plan addresses climate change impacts on the city, with a focus on vulnerable communities.
- [Delray Neighborhood Initiatives](https://detroitmi.gov/departments/planning-and-development-department/neighborhood-plans/central-design-region/delray): EGLE has been involved in efforts to address air quality concerns in the Delray neighborhood, which is impacted by industrial emissions and heavy truck traffic.
- [Green Door Initiative](https://www.greendoorinitiative.org/): This Detroit-based organization collaborates with EGLE on various environmental justice projects, including lead abatement and air quality improvement efforts.
- [Detroit River Sediment Cleanup](https://www.epa.gov/great-lakes-aocs/detroit-riverwalk-great-lakes-legacy-act-project-detroit-river-aoc): EGLE has been involved in efforts to clean up contaminated sediments in the Detroit River, which disproportionately affects nearby low-income communities.
- [Asthma Prevention Programs](https://getasthmahelp.org/asthma-collaborative-of-detroit.aspx): EGLE supports community-based asthma prevention programs in Detroit, recognizing the link between air quality and asthma rates in disadvantaged neighborhoods.
These are just a few examples that demonstrate the ongoing collaboration between community groups, local government, and EGLE to address environmental justice concerns in Detroit.
Air quality issues in urban areas disproportionately affect low-income communities and communities of color, making it a critical focus for environmental justice efforts. Federal agencies like the EPA and CDC play crucial roles in setting standards and conducting research, but community-driven science initiatives have emerged as powerful tools for local action. Open data is key to these efforts, enabling transparency, accessibility, innovation, and collaboration. Cities like Detroit and Chicago are at the forefront of these initiatives, with projects that empower residents to monitor and advocate for better air quality. While challenges remain, particularly in data management and accessibility, the collaboration between community groups, local governments, and state agencies like Michigan's EGLE demonstrates a promising path forward.
## Data Analysis and Exercises
To better understand and address these complex issues of air pollution, public health impacts, and environmental inequities, we’ll work through a series of coding and data science exercises. These hands-on activities will allow users to explore some of the open datasets and tools that are available to community members and stakeholders that offer practical insights into air quality, public health, and social vulnerability.
We'll begin with the ICIS-AIR dataset, then move on to the TRI Facility dataset, and finally work our way down to TRI Form A. After exploring these 3 air pollution datasets, we will take a look at the CDC PLACES health outcomes data and perform some simple analysis integrating the pollution and health outcomes data.
::: {.callout-tip style="color: #5a7a2b;"}
#### Data Science Review
This lesson uses the following Python modules:
- [`pandas`](https://pandas.pydata.org/): Essential for data manipulation and analysis.
- [`geopandas`](https://geopandas.org/): Extends pandas functionality to handle geospatial data.
- [`matplotlib`](https://matplotlib.org/): Used for creating static, animated, and interactive visualizations.
- [`numpy`](https://numpy.org/): Provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.
- [`requests`](https://docs.python-requests.org/en/latest/): Allows you to send HTTP requests and interact with APIs easily.
- [`contextily`](https://contextily.readthedocs.io/en/latest/): Adds basemaps to your plots, enhancing the visual context of your geospatial data.
- [`pygris`](https://pygris.readthedocs.io/en/latest/): Simplifies the process of working with US Census Bureau TIGER/Line shapefiles.
- [`rasterio`](https://rasterio.readthedocs.io/en/latest/): Reads and writes geospatial raster datasets.
- [`xarray`](https://xarray.pydata.org/): Introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.
- [`shapely`](https://shapely.readthedocs.io/en/latest/): Manipulation and analysis of geometric objects in the Cartesian plane.
- [`scipy`](https://www.scipy.org/): Used for scientific and technical computing, particularly the `stats` and `interpolate` submodules.
- [`rioxarray`](https://corteva.github.io/rioxarray/stable/): Extends xarray to make it easier to handle geospatial raster data.
- [`time`](https://docs.python.org/3/library/time.html): Provides various time-related functions (part of Python's standard library).
- [`tabulate`](https://pypi.org/project/tabulate/): Used for creating nicely formatted tables in various output formats.
- [`pysal`](https://pysal.org/): A library of spatial analysis functions.
- [`splot`](https://splot.readthedocs.io/en/latest/): Provides statistical plots for spatial analysis.
If you'd like to learn more about the functions used in this lesson, you can refer to the documentation on their respective websites.
The `pandas` module is essential for data manipulation and analysis, while `geopandas` extends its functionality to handle geospatial data (spatial data format used to represent geographic information as a grid of cells or pixels). `matplotlib` is used for creating static, animated, and interactive visualizations. `numpy` provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
The `requests` module allows you to send HTTP requests and interact with APIs easily. `contextily` adds basemaps to your plots, enhancing the visual context of your geospatial data. `pygris` simplifies the process of working with US Census Bureau TIGER/Line shapefiles.
`rasterio` and `xarray` are used for working with geospatial raster data (spatial data format used to represent geographic information as a grid of cells or pixels). `rasterio` reads and writes geospatial raster datasets, while `xarray` introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.
`shapely` is used for manipulation and analysis of geometric objects. `scipy` provides additional tools for scientific computing, including statistical functions and interpolation methods. `rioxarray` combines the functionality of rasterio and xarray for easier handling of geospatial raster data.
`pysal` is a library for spatial analysis, providing tools for exploratory spatial data analysis. `splot` is a visualization module that works with pysal to create statistical plots for spatial analysis.
Make sure these modules are installed before you begin working with the code in this document.
:::
### ICIS Air
[ICIS-AIR (Integrated Compliance Information System for Air)](https://www.epa.gov/enviro/icis-air-overview) is a comprehensive database maintained by the U.S. Environmental Protection Agency (EPA) that focuses specifically on air quality compliance and enforcement data. It tracks information related to stationary sources of air pollution, including their compliance with various Clean Air Act regulations, permit data, and enforcement actions. While ICIS-AIR and the [Toxic Release Inventory (TRI)](https://www.epa.gov/toxics-release-inventory-tri-program) are separate systems, they have significant overlap in their coverage of industrial facilities and air emissions. Many facilities that report to TRI also have data in ICIS-AIR, providing complementary information.
Where TRI focuses on the quantities of specific toxic chemicals released into the air, ICIS-AIR offers a broader picture of a facility's overall air quality compliance status, including non-toxic pollutants regulated under the Clean Air Act. Together, these systems provide a more comprehensive view of industrial air pollution sources, enabling researchers, regulators, and the public to assess both the types and quantities of air pollutants emitted (from TRI) and the regulatory compliance status of the emitting facilities (from ICIS-AIR).
#### Data Processing
##### Defining Our Study Area
Before we can query the API for ICIS-AIR regulated facilities in the greater Detroit area, we need to create a spatial object that defines our study area and that we can use throughout this analysis. The U.S. Census Bureau defines the Detroit, MI Metropolitan Statistical Area (MSA) as including 6 counties (Wayne, Macomb, Oakland, Lapeer, Livingston, and St. Clair), but for this lesson we will focus on the core 3 counties (Wayne, Macomb, and Oakland); both population and air emissions related facilities are sparser in the outer counties.
We can us **pygris** to get vector boundaries for the counties and dissolve them into a single boundary we can use for cropping data and restricting API searches.
```{python}
import geopandas as gpd
import pandas as pd
import pygris
from shapely.geometry import box
counties = ['Wayne', 'Oakland', 'Macomb']
# Fetch the county boundaries for Michigan (MI) from the pygris dataset for the year 2022.
# Then, filter the DataFrame to only include the counties in the Detroit metro area (Wayne, Oakland, Macomb).
metro_counties = pygris.counties(state="MI", year=2022)
detroit_metro = metro_counties[metro_counties['NAME'].isin(counties)]
# Combine the geometries of the selected counties into a single polygon (dissolve by state code).
detroit_metro = detroit_metro.dissolve(by='STATEFP')
# Convert the dissolved geometries into a GeoDataFrame and ensure the coordinate reference system (CRS) is EPSG:4269.
detroit_metro = gpd.GeoDataFrame(detroit_metro, geometry='geometry', crs='EPSG:4269')
# Obtain the total bounding box for the Detroit metro area polygon.
bbox = detroit_metro.total_bounds
# Create a bounding box polygon from the bounding coordinates and store it in a GeoDataFrame with the same CRS.
bbox_polygon = gpd.GeoDataFrame(
geometry=[box(*bbox)],
crs=detroit_metro.crs
)
```
Let's verify this by overlaying it on a basemap.
```{python}
import matplotlib.pyplot as plt
import contextily as ctx
# Initialize a figure and axis with a 7x7 inch size for plotting.
fig, ax = plt.subplots(figsize=(7, 7))
# Reproject the Detroit metro area and bounding box to Web Mercator (EPSG:3857) to align with the basemap.
detroit_metro_bm = detroit_metro.to_crs(epsg=3857)
bbox_polygon_bm = bbox_polygon.to_crs(epsg=3857)
# Plot the Detroit metro area with a light blue fill and dark blue edges at 30% opacity.
detroit_metro_bm.plot(ax=ax, color='lightblue', edgecolor='darkblue', alpha=0.3)
# Plot the boundary of the bounding box with a grey outline and a thicker line width.
bbox_polygon_bm.boundary.plot(ax=ax, color='grey', linewidth=2)
# Add a basemap from OpenStreetMap (Mapnik) to provide background geographic context.
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
# Adjust the map's visible area (extent) to fit the bounding box.
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])
# Remove axis lines and labels for a cleaner map visualization.
ax.set_axis_off()
# Add a title to the plot.
plt.title("Detroit Metro Study Area", fontsize=16)
# Optimize layout to ensure proper spacing around the plot.
plt.tight_layout()
# Display the final plot.
plt.show()
```
This looks correct, and we'll use this boundary object repeatedly throughout our analysis.
#### Querying the API
The EPA's [Enforcement and Compliance History Online (ECHO)](https://echo.epa.gov/) website serves as a comprehensive portal for environmental compliance and enforcement data. It includes the ECHO Data Service, which provides programmatic access to EPA data through [RESTful](https://echo.epa.gov/tools/web-services) APIs. Among these is the ICIS-AIR API, which specifically offers access to air quality compliance and enforcement data. This API allows users to retrieve detailed information about stationary sources of air pollution, including facility details, permit data, emissions reports, and compliance status.
The ECHO API requires that users follow a multi-step workflow to acquire data via the [Facility Search Air API](https://echo.epa.gov/tools/web-services/facility-search-air):
1. Use get_facilities to validate parameters, get summary statistics, and obtain a query_id (QID), valid for about 30 minutes.
2. Use get_qid with the obtained QID to paginate through facility results.
3. Use get_map with the QID to visualize and navigate facility locations.
4. Use get_download with the QID to generate a CSV file of facility information.
The get_facility_info endpoint operates independently, returning either clustered summary statistics or an array of individual facilities. This API structure allows for efficient querying, visualization, and extraction of air quality compliance data, making it a valuable resource for environmental analysis and research.
We'll begin by querying this database for just Michigan facilities in the workflow detailed above. From there we'll subset with our list of counties.
We start with the base url for the ECHO API and our parameter list.
```{python}
# Base URL for the ECHO ICIS-AIR API, which provides access to air quality and compliance data from the EPA.
base_url = "https://echodata.epa.gov/echo/air_rest_services"
# Query parameters for the API call:
# - "output": Specifies the response format as JSON.
# - "p_st": Filters results to include only the state of Michigan (MI).
params = {
"output": "JSON",
"p_st": "MI"
}
```
Now we define 2 functions that will carry out the first 2 steps; 1) identifying the facilities in our search and 2) retrieve the facility data.
```{python}
import requests
# Define a function to query the ECHO ICIS-AIR API for facilities of interest in Michigan.
# If successful, extract the Query ID (qid) and total number of facilities from the API response.
def get_facilities():
response = requests.get(f"{base_url}.get_facilities", params=params)
if response.status_code == 200:
data = response.json()
if 'Results' in data:
qid = data['Results']['QueryID'] # Unique Query ID to fetch detailed facility data.
print(f"Query ID: {qid}")
print(f"Total Facilities: {data['Results']['QueryRows']}") # Number of facilities in the query result.
return qid
# Log an error if the API request fails or the data is incomplete.
print("Failed to get facilities and QID")
return None
# Define a function to retrieve detailed facility data using the Query ID (qid).
# Retrieves paginated data, adding all facility results to the all_facilities list.
def get_facility_data(qid):
all_facilities = []
page = 1
while True:
# Query parameters include the Query ID and the page number for paginated results.
params = {"qid": qid, "pageno": page, "output": "JSON"}
response = requests.get(f"{base_url}.get_qid", params=params)
if response.status_code == 200:
data = response.json()
if 'Results' in data and 'Facilities' in data['Results']:
facilities = data['Results']['Facilities'] # List of facilities on the current page.
if not facilities: # Break if no more facilities to retrieve.
break
all_facilities.extend(facilities) # Add the facilities from the current page to the result list.
print(f"Retrieved page {page}")
page += 1
else:
break # Break if 'Results' or 'Facilities' key is missing.
else:
print(f"Failed to retrieve page {page}") # Log an error if the API request fails for the current page.
break
return all_facilities
```
Now that we've defined the functions we can make a request to the API with our Detroit metro counties.
```{python}
# Step 1: Retrieve the Query ID (qid) by querying the API for ICIS-AIR facilities in Michigan.
qid = get_facilities()
if qid:
# Step 2: If the Query ID is retrieved, use it to get all relevant facility data.
print("Retrieving facility data...")
facilities = get_facility_data(qid)
# Convert the list of facilities into a pandas DataFrame for easier data analysis and manipulation.
df_icis_air = pd.DataFrame(facilities)
# Print the total number of facilities retrieved and display the column names in the dataset.
print(f"\nSuccessfully retrieved {len(df_icis_air)} ICIS-AIR facilities for Michigan")
print("\nColumns in the dataset:")
print(df_icis_air.columns)
else:
# If the Query ID is not retrieved, output an error message indicating the failure.
print("Failed to retrieve facility data")
```
There are over three thousand facilities in the ICIS-AIR Michigan dataset. Here are the descriptions for some key columns.
#### Key Columns in the [ICIS-AIR ECHO API Dataset](https://echo.epa.gov/help/facility-search/air-search-results-help)
| **Key Column** | **Description** |
|--------------------------------------|------------------------------------------------------------------------|
| AIRName | The name of the facility |
| SourceID | A unique identifier for the air pollution source |
| AIRStreet, AIRCity, AIRState, AIRZip | Address components of the facility |
| RegistryID | A unique identifier for the facility in the EPA’s registry |
| AIRCounty | The county where the facility is located |
| AIREPARegion | The EPA region responsible for the facility |
| AIRNAICS | North American Industry Classification System code(s) for the facility |
| FacLat, FacLong | Latitude and longitude coordinates of the facility |
| AIRPrograms | Air quality programs applicable to the facility |
| AIRStatus | Current operational status of the facility |
| AIRUniverse | Categorization of the facility within air quality regulation |
| AIRComplStatus | Current compliance status under the Clean Air Act |
| AIRHpvStatus | High Priority Violator status |
| AIRQtrsWithViol | Number of quarters with violations |
| AIRLastViolDate | Date of the most recent violation |
| AIREvalCnt | Count of evaluations conducted |
| AIRLastEvalDate | Date of the most recent evaluation |
| AIRFceCnt | Count of Full Compliance Evaluations |
| AIRLastFceDate | Date of the last Full Compliance Evaluation |
These columns provide information about each facility's location, operational characteristics, compliance history, and regulatory oversight under the Clean Air Act. It includes information on violations, evaluations, and compliance status that assess a facility's environmental performance and regulatory compliance.
We can check some basic information like how many facilities are in Detroit metro and the breakdown by our 3 counties.
```{python}
# Subset the DataFrame to include only the ICIS-AIR facilities located in the Detroit metro counties (Wayne, Oakland, Macomb).
icis_air_detroit = df_icis_air[df_icis_air['AIRCounty'].isin(counties)]
# Print a summary of the total number of ICIS-AIR facilities in Michigan and those specifically in the Detroit metro area.
print(f"Total ICIS-AIR facilities in Michigan: {len(df_icis_air)}")
print(f"ICIS-AIR facilities in Detroit metro area: {len(icis_air_detroit)}")
# Display the count of ICIS-AIR facilities for each county within the Detroit metro area.
print("\nFacilities per county:")
print(icis_air_detroit['AIRCounty'].value_counts())
```
Next we should check for facilities with missing latitude or longitude coordinates.
```{python}
# Identify and count the records in the Detroit metro subset where latitude or longitude values are missing.
missing_coords = icis_air_detroit[(icis_air_detroit['FacLat'].isnull()) | (icis_air_detroit['FacLong'].isnull())]
print(f"Number of ICIS-AIR records with missing coordinates: {len(missing_coords)}")
# Remove records with missing latitude or longitude from the dataset to ensure only facilities with valid coordinates remain.
icis_air_detroit = icis_air_detroit.dropna(subset=['FacLat', 'FacLong'])
```
No missing records! That's great. As you'll find out later, you won't always be so lucky.
Now we can create a spatial object, reproject it so it matches the basemap of Detroit, and make a map showing the location of all the ICIS-AIR facilities in Detroit metro. The existing dataset in our Python environment includes the latitude (FacLat) and longitude (FacLong) coordinates, but we need to convert this to an explicit spatial object with GeoPandas using the coordinates so we can perform spatial operations on this object like creating maps, creating surfaces, or finding nearby objects.
```{python}
# Convert the ICIS-AIR facilities data into a GeoDataFrame with valid point geometries (using longitude and latitude).
gdf_icis_air = gpd.GeoDataFrame(
icis_air_detroit,
geometry=gpd.points_from_xy(icis_air_detroit.FacLong, icis_air_detroit.FacLat),
crs="EPSG:4326" # Coordinate reference system set to WGS 84 (EPSG:4326)
)
# Reproject the ICIS-AIR facilities' GeoDataFrame to Web Mercator (EPSG:3857) for alignment with basemap data.
gdf_icis_air_bm = gdf_icis_air.to_crs(epsg=3857)
# Set up the plot with a 7x7 inch figure.
fig, ax = plt.subplots(figsize=(7, 7))
# Plot the Detroit metro area boundary and bounding box, using previously reprojected objects.
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)
# Plot the ICIS-AIR facilities as cyan markers, with transparency (alpha=0.5) and grey outlines.
gdf_icis_air_bm.plot(ax=ax, color='cyan', markersize=50, alpha=0.5, label='ICIS-AIR Facilities', edgecolor = "grey")
# Add an OpenStreetMap basemap for geographical context.
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
# Adjust the map's extent to fit the bounding box around the Detroit metro area.
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])
# Hide the axes to make the plot cleaner.
ax.set_axis_off()
# Add a legend to the plot to label the ICIS-AIR facilities.
ax.legend()
# Add a title and improve the layout of the plot.
plt.title("Detroit Metro Area ICIS-AIR Facilities", fontsize=16)
plt.tight_layout()
plt.show()
# Output the total number of ICIS-AIR facilities plotted.
print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")
```
These are distributed along business and industrial sectors as expected, but we can't infer much else from this map.
Let’s look for regulated facilities that have been in violation status. We can plot all the facilities with at least one violation during a reporting quarter and use graduated symbols that reflect the total number of quarters with violations. This information is in the `AIRQtrsWithViol` column.
```{python}
# Convert the 'AIRQtrsWithViol' column to numeric, replacing any non-numeric values with NaN.
gdf_icis_air_bm['AIRQtrsWithViol'] = pd.to_numeric(gdf_icis_air_bm['AIRQtrsWithViol'], errors='coerce')
# Subset the dataset to include only ICIS-AIR facilities with reported violations (quarters with violations > 0).
gdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] > 0]
# Set up the plot with a 7x7 inch figure.
fig, ax = plt.subplots(figsize=(7, 7))
# Plot the Detroit metro area boundary and bounding box for geographical context.
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)
# Plot the violators (facilities with violations) using graduated symbol sizes based on 'AIRQtrsWithViol' values.
# The symbol size is scaled (multiplied by 10) to make differences more visible.
scatter = ax.scatter(gdf_icis_air_violators.geometry.x,
gdf_icis_air_violators.geometry.y,
s=gdf_icis_air_violators['AIRQtrsWithViol']*10, # Adjust marker size by violation count
c='orangered', # Color the violators in 'orangered' for visibility
edgecolor='yellow', # Add a yellow edge to make the points stand out
linewidth=1,
alpha=0.7) # Set transparency for better map visibility
# Add an OpenStreetMap basemap to enhance the visual context of the plot.
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
# Set the map's extent to match the bounding box around the Detroit metro area.
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])
# Hide axes for a cleaner map.
ax.set_axis_off()
# Add a legend to label and distinguish facilities with violations.
ax.legend()
# Set the plot title and finalize the layout for professional presentation.
plt.title("Detroit Metro Area ICIS-AIR Facilities w/ Violations", fontsize=16)
plt.tight_layout()
plt.show()
```
We can also create a table showing which businesses have been in violation status, their name, address, and the date of their most recent violation using the **tabulate** module.
```{python}
from tabulate import tabulate
# Filter the dataset to include only facilities with 10 or more quarters of violations.
gdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] >= 10]
# Define the columns to be displayed in the final table.
table_columns = ['AIRName', 'AIRStreet', 'AIRCity', 'AIRQtrsWithViol', 'AIRLastViolDate']
# Create a new DataFrame with only the selected columns for table generation.
table_data = gdf_icis_air_violators[table_columns].copy()
# Convert the 'AIRLastViolDate' column to a datetime format for proper sorting and display.
table_data['AIRLastViolDate'] = pd.to_datetime(table_data['AIRLastViolDate'])
# Sort the DataFrame by the 'AIRLastViolDate' column in descending order to show the most recent violations first.
table_data = table_data.sort_values('AIRLastViolDate', ascending=False)
# Define a dictionary to map the original column names to more readable/pretty names for the table.
pretty_names = {
'AIRName': 'Name',
'AIRStreet': 'Street',
'AIRCity': 'City',
'AIRQtrsWithViol': 'Violations',
'AIRLastViolDate': 'Violation Date'
}
# Apply the new readable column names.
table_data.rename(columns=pretty_names, inplace=True)
# Optionally, format the 'Violation Date' column to display dates in a 'YYYY-MM-DD' format.
table_data['Violation Date'] = table_data['Violation Date'].dt.strftime('%Y-%m-%d')
# Generate the table using the 'tabulate' library and format it as an HTML table for presentation.
table = tabulate(table_data, headers='keys', tablefmt='html', showindex=False)
# Display the resulting table.
table
```
ICIS-AIR is a powerful tool for communities and policy-makers to identify regulated facilities and their compliance status, but the dataset does have limitations. For example, we can not examine the types or amounts of chemicals being released into the community.
Next, we’ll take a look at facilities required to report to the TRI.
::: {.callout-tip style="color: #9f0b64;"}
#### Knowledge Check
Which of the following best describes the ICIS-AIR database?
a) A database of air quality measurements from monitoring stations
b) A comprehensive database of air quality compliance and enforcement data
c) A public health database tracking respiratory illnesses
d) A database of weather patterns affecting air quality
:::
### Toxic Release Inventory
The [Toxic Release Inventory (TRI)](https://www.epa.gov/toxics-release-inventory-tri-program) is a crucial resource in understanding and addressing air quality issues in the United States. Established under the Emergency Planning and Community Right-to-Know Act of 1986, the TRI is a publicly accessible database maintained by the Environmental Protection Agency (EPA). It requires certain industrial facilities to report annually on their releases of toxic chemicals into the environment, including air emissions. The TRI provides valuable data on over 770 chemicals and chemical categories, offering insights into the types and quantities of pollutants released into the air by various industries. This information is vital to researchers, policymakers, and community advocates in assessing local air quality, identifying potential health risks, and developing targeted strategies to reduce toxic air emissions. By making this data publicly available, the TRI plays an important role in promoting transparency and supporting environmental justice initiatives focused on improving air quality in communities across the nation.
::: {.callout-tip style="color: #5a7a2b;"}
#### Data Review
The Toxics Release Inventory (TRI) and the Integrated Compliance Information System for Air (ICIS-AIR) are two important but distinct environmental reporting systems maintained by the U.S. Environmental Protection Agency (EPA). They have several key differences:
- **Regulatory Basis**
- TRI: Established under the Emergency Planning and Community Right-to-Know Act (EPCRA) of 1986
- ICIS-AIR: Part of the Clean Air Act (CAA) compliance and enforcement program
- **Focus**
- TRI: Tracks the management of certain toxic chemicals that may pose a threat to human health and the environment
- ICIS-AIR: Focuses specifically on air quality and emissions from facilities regulated under the Clean Air Act
- **Reported Information**
- TRI: Facilities report on releases, waste management, and pollution prevention activities for specific toxic chemicals
- ICIS-AIR: Tracks emissions data, compliance status, and enforcement actions related to air quality regulations
- **Facility Coverage**
- TRI: Covers facilities in specific industries that manufacture, process, or use TRI-listed chemicals above certain thresholds
- ICIS-AIR: Includes a broader range of facilities that emit air pollutants, regardless of the specific chemicals involved
- **Reporting Thresholds**
- TRI: Has specific chemical thresholds that trigger reporting requirements
- ICIS-AIR: Generally doesn't have chemical-specific thresholds; requirements are based on overall emissions and facility type
- **Public Accessibility**
- TRI: Designed with a strong focus on public right-to-know, with data easily accessible to the public
- ICIS-AIR: While public, it's primarily designed for regulatory and enforcement purposes
- **Data Frequency**
- TRI: Annual reporting is required for covered facilities
- ICIS-AIR: May involve more frequent reporting, depending on permit requirements and compliance status
- **Scope of Pollutants**
- TRI: Focuses on a specific list of toxic chemicals and chemical categories
- ICIS-AIR: Covers a wider range of air pollutants, including criteria air pollutants and hazardous air pollutants
- **Use in Environmental Management**
- TRI: Often used for assessing long-term trends in toxic chemical releases and waste management practices
- ICIS-AIR: More commonly used for day-to-day air quality management and enforcement activities
- **Geographic Coverage**
- TRI: Nationwide program with consistent reporting across states
- ICIS-AIR: While national, implementation can vary more by state or local air quality management district
:::
#### Envirofacts API
[EnviroFacts](https://www.google.com/url?q=https://enviro.epa.gov/&sa=D&source=docs&ust=1733163481208997&usg=AOvVaw0dJjyDJg-FG1iSnwyCuuaH) is a comprehensive online database and information system maintained by the U.S. Environmental Protection Agency (EPA). It serves as a centralized hub for accessing a wide range of environmental data collected by the EPA and other federal agencies. EnviroFacts integrates information from multiple EPA databases, covering various aspects of environmental health, including air quality, water quality, hazardous waste, and toxic releases. One of the key components of EnviroFacts is the Toxic Release Inventory (TRI) data. Through EnviroFacts, users can easily access and query TRI information, allowing them to investigate toxic chemical releases and waste management activities in their local areas. The integration of TRI data within the broader EnviroFacts system enables researchers, policymakers, and community members to contextualize toxic release information alongside other environmental indicators.
Envirofacts is available as a [web based search and data platform](https://enviro.epa.gov/) and also as a [programmatic API](https://www.epa.gov/enviro/envirofacts-data-service-api). The web platform is a great way to familiarize yourself with the available datasets and create simple downloads, however, for analytical purposes we recommend learning to navigate their API so you can create repeatable and reliable analysis.
::: {.callout-tip style="color: #5a7a2b;"}
#### Environmental Justice In the News
[The Michigan Department of Environment, Great Lakes and Energy (EGLE) has settled a civil rights complaint regarding a hazardous waste facility in Detroit](https://www.detroitnews.com/story/news/local/michigan/2024/08/29/egle-settles-civil-rights-complaint-about-detroit-hazardous-waste-approval-us-ecology-north/75001763007/). The complaint, filed in 2020 by environmental groups and local residents, challenged the renewal and expansion of [U.S. Ecology North's](https://www.michigan.gov/egle/about/organization/materials-management/hazardous-waste/liquid-industrial-byproducts/us-ecology-detroit-north/license-information) license, arguing it was unjust to increase hazardous waste storage in a predominantly low-income and minority neighborhood. As part of the settlement, EGLE will now consider environmental justice concerns in future licensing decisions for hazardous waste facilities.
The agreement, described as "groundbreaking" by the [Sierra Club](https://www.sierraclub.org/), introduces several new measures. EGLE will conduct environmental justice and cumulative impact analyses for future licensing decisions, potentially denying licenses that would have an unlawful impact on human health and the environment. The settlement also includes provisions for improved community engagement, such as better translation services, public input processes, and the installation of air monitors around U.S. Ecology North. Additionally, the state will work with local residents to conduct a community health assessment. While some details remain to be clarified, environmental advocates view this as a significant step towards advancing environmental justice in Michigan.
:::
#### Querying the API
We'll continue to use the Detroit metro boundary we created earlier. We assigned them the names `detroit_metro` and `detroit_metro_bm` (projected to mercator to match **b**ase**m**aps for plotting).
```{python}
# Print the coordinates of the bounding box, which defines the rectangular boundary for the area of interest
print("Bounding Box:")
# Display the minimum longitude (X-axis) of the bounding box
print(f"Minimum X (Longitude): {bbox[0]}")
# Display the minimum latitude (Y-axis) of the bounding box
print(f"Minimum Y (Latitude): {bbox[1]}")
# Display the maximum longitude (X-axis) of the bounding box
print(f"Maximum X (Longitude): {bbox[2]}")
# Display the maximum latitude (Y-axis) of the bounding box
print(f"Maximum Y (Latitude): {bbox[3]}")
```
Now we'll create an empty list to store the TRI data, loop through each county making an API query, and place the retrieved data in the empty list.
```{python}
# Initialize a list to hold TRI (Toxic Release Inventory) facility data for each county in Michigan
tri_data = []
# Loop through each county in the specified list to fetch TRI data separately
# (avoiding issues encountered when querying all counties at once)
for county in counties:
# Construct the API URL to query TRI facilities for the current county
api_url = f"https://data.epa.gov/efservice/tri_facility/state_abbr/MI/county_name/{county}/JSON"
response = requests.get(api_url)
# Check if the response was successful
if response.status_code == 200:
county_data = response.json()
# Append the current county's TRI data to the overall list
tri_data.extend(county_data)
else:
# Print an error message if data retrieval failed for the county
print(f"Failed to fetch data for {county} County. Status code: {response.status_code}")
```
#### Processing the Data
The TRI data comes in json format. This format can be confusing to interpret, because each record (traditionally a row in a table) is viewed with columns structured vertically. Let’s look at the first record.
```{python}
# Access and print the second entry in the TRI data list to inspect a sample facility's data structure
tri_data[1]
```
This record shows all the information for *PPG GROW DETROIT* with columns and values listed side-by-side ('city_name': 'DETROIT'). This may seem difficult to deal with, but most programming languages have tools to easily parse json data into traditional tables. `pd.DataFrame` does it automatically when you attempt to create a **pandas** data frame.
```{python}
# Convert the collected TRI data list into a Pandas DataFrame for easier data manipulation and analysis
tri_df = pd.DataFrame(tri_data)
# Output the total count of facilities retrieved to confirm successful data fetching
print(f"Number of facilities fetched: {len(tri_df)}")
# Display the first few rows of the DataFrame for a quick data preview (uncomment if needed for inspection)
# tri_df.head()
```
Checking the first few rows--everything looks as it should. We want to geocode the facility locations into a point layer using the latitude and longitude values. Therefore, we should start by making sure all the facilities have coordinate values.
We'll check and remove those without.
```{python}
# Create a copy of the TRI DataFrame to prevent modifications that could trigger a SettingWithCopyWarning
tri_df_clean = tri_df.copy()
# Remove rows where latitude or longitude is missing to ensure data completeness
tri_df_clean = tri_df_clean.dropna(subset=['pref_latitude', 'pref_longitude'])
print(f"Number of facilities after removing empty coordinates: {len(tri_df_clean)}")
# Convert latitude and longitude columns to numeric, handling non-numeric values by setting them to NaN
tri_df_clean['pref_latitude'] = pd.to_numeric(tri_df_clean['pref_latitude'], errors='coerce')
tri_df_clean['pref_longitude'] = pd.to_numeric(tri_df_clean['pref_longitude'], errors='coerce')
```
Ouch! We lost roughly 300 records due to missing coordinate information (789 vs. 478). If this were an important analysis for policy-making or scientific research we would take the time to [geocode](https://www.mapbox.com/insights/geocoding) the listed addresses into latitude/longitude coordinates, but for the purposes of this exercise we’ll move on.
In my personal investigations of this data, I noticed that some longitude coordinates were not negative, but had an absolute value (e.g. 83.25) that made sense for the Detroit metro area, therefore we will flip the sign on these records.
```{python}
# Define a function to correct longitudes that are mistakenly positive, converting them to negative values for North America
def correct_longitude(lon):
if lon > 0:
return -lon
return lon
# Apply longitude correction to all facilities in the DataFrame
tri_df_clean['pref_longitude'] = tri_df_clean['pref_longitude'].apply(correct_longitude)
```
Additionally, there are some records with wild longitudinal values that are not simply from a missing negative sign. We'll identify these outliers using the interquantile range. Anything outside the range will be tossed. One of the downsides of the TRI database is that many aspects are provided directly by the facility (and therefore subject to errors). Again, if this were a more important analysis we would carefully explore all of these missing records and possibly geocode using the address.
Toss the quartile outliers.
```{python}
# Calculate the Interquartile Range (IQR) for longitude to identify and remove outlier values
Q1 = tri_df_clean['pref_longitude'].quantile(0.25)
Q3 = tri_df_clean['pref_longitude'].quantile(0.75)
IQR = Q3 - Q1
# Define acceptable range for longitude, filtering out points beyond 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
lower_bound, upper_bound
# Remove facilities with longitudes outside of the defined bounds to eliminate extreme outliers
tri_df_clean = tri_df_clean[(tri_df_clean['pref_longitude'] >= lower_bound) &
(tri_df_clean['pref_longitude'] <= upper_bound)]
print(f"Number of facilities after removing longitude outliers: {len(tri_df_clean)}")
```
Thankfully we only lost 2 more records from outliers.
#### Visualizing the Data
The data has been cleaned up a bit so now lets create a spatial point object from the data with **geopandas** and use **matplotlib** to plot the values on top of a basemap of Detroit we'll acquire using **contextily** and our original Wayne, Macomb, and Oakland counties boundary.
```{python}
# Create a GeoDataFrame from the cleaned TRI data, converting latitude and longitude into geometric points
detroit_tri = gpd.GeoDataFrame(
tri_df_clean,
geometry=gpd.points_from_xy(tri_df_clean.pref_longitude, tri_df_clean.pref_latitude),
crs="EPSG:4326" # Set the coordinate reference system to WGS 84 (latitude/longitude)
)
# Reproject TRI data to the Web Mercator projection (EPSG:3857) for compatibility with the basemap
detroit_tri = detroit_tri.to_crs(epsg=3857)
# Initialize a plot with a fixed size
fig, ax = plt.subplots(figsize=(7, 7))
# Plot the Detroit metro area boundaries and the bounding box as background layers
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)
# Plot TRI facilities as purple points with a grey edge, representing facilities' locations in the metro area
detroit_tri.plot(ax=ax, color='purple', edgecolor='grey', markersize=50, alpha=0.5, label='TRI Facilities')
# Plot ICIS-AIR facilities as cyan points with a grey edge, indicating the locations of air-quality-monitored facilities
gdf_icis_air_bm.plot(ax=ax, color='cyan', edgecolor='grey', markersize=50, alpha=0.5, label='ICIS-AIR Facilities')
# Overlay a basemap for geographical context, using OpenStreetMap tiles
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
# Set the map's viewing area to the bounding box’s coordinates, zooming in on the Detroit metro area
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])
# Hide axis labels and ticks for a cleaner map presentation
ax.set_axis_off()
# Add a legend to differentiate TRI and ICIS-AIR facilities on the map
ax.legend()
# Add a title for context and adjust layout for better spacing
plt.title("Detroit Metro Area TRI and ICIS-AIR Facilities", fontsize=16)
plt.tight_layout()
plt.show()
# Print the number of facilities included in the plot for confirmation
print(f"Number of TRI facilities plotted: {len(detroit_tri)}")
print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")
```
We can see that there are far fewer facilities being tracked in TRI compared to ICIS-AIR. Some of this can be attributed to the records we tossed after checking for valid geographic coordinates and outliers, but generally speaking ICIS-AIR will contain far more as it tracks every facility regulated by the Clean Air Act as opposed to only those that are part of TRI.
You may have also noticed that this dataset does not contain any actual pollution data; only facility information for those that are regulated by TRI.
TRI is made up of numerous tables containing a wealth of information regarding the facilities themselves and all the regulated chemicals that are handled at each site. The TRI_FACILITY table we requested is the starter table used in most of the examples in the Envirofacts API documentation, however, it does not contain any chemical release data.
Navigating the TRI Envirofacts interface can be overwhelming at times. Multiple tables can be queried with a single URL, however, for a proper join to be carried out both tables must contain at least one column in common. In most cases this can be accomplished using TRI_FACILITY_ID. That said, air pollution release data (column name [AIR_TOTAL_RELEASE](https://enviro.epa.gov/enviro/ef_metadata_html.tri_page?p_column_name=air_total_release)) is only found in the [TRI_FORM_R](https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=TRI_FORM_R&p_topic=TRI) table, which does not contain the FACILITY_ID column.
You can querry the V_TRI_FORM_R_EZ using the API even though it’s not a documented table listed in the API website. This form contains facility information and AIR_TOTAL_RELEASE reporting, but it only goes until 2021. To get the most recent data (2023) you would have to manually download the individual tables with separate calls, perform the joins, or use the [custom form search web portal](https://enviro.epa.gov/facts/tri/form_ra_download.html). We’ll demonstrate a quick API call for the V_TRI_FORM_R_EZ, but move forward with the most recent data available in the custom form search web portal.
::: {.callout-tip style="color: #9f0b64;"}
#### Knowledge Check
What is the primary purpose of the Toxic Release Inventory (TRI)?
a) To track emissions of greenhouse gases
b) To monitor compliance with the Clean Air Act
c) To provide data on toxic chemical releases from industrial facilities
d) To record air quality index values for major cities
:::
### Toxic Release Inventory: Form R
We can make the query to Envirofacts. Let's break down the call.
```{python}
# Define the URL for the CSV file containing Toxic Release Inventory Form R data
# This query requests data from the EPA Envirofacts API, specifying:
# - Reporting Year: 2021
# - Air releases greater than 0
# - State: Michigan (MI)
# - Counties: Wayne, Oakland, and Macomb
url = "https://data.epa.gov/efservice/V_TRI_FORM_R_EZ/REPORTING_YEAR/2021/AIR_TOTAL_RELEASE/>/0/STATE_ABBR/MI/COUNTY_NAME/CONTAINING/WAYNE/CONTAINING/OAKLAND/CONTAINING/MACOMB/CSV"
```
- Base url: https://data.epa.gov/efservice/
- The table we're requesting: V_TRI_FORM_R\_EZ/
- For just 2021: REPORTING_YEAR/2021/
- Only facilities that report some air release: AIR_TOTAL_RELEASE/\>/0/
- Just Michigan: STATE_ABBR/MI/
- Our counties: COUNTY_NAME/CONTAINING/WAYNE/CONTAINING/OAKLAND/CONTAINING/MACOMB/
- We want it as a csv: CSV
```{python}
# Read the TRI Form R data CSV file directly into a pandas DataFrame
tri_form_r = pd.read_csv(url)
# Confirm successful data loading by printing the number of records
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
# Display column names to understand the structure of the dataset
print("\nColumns in the dataset:")
print(tri_form_r.columns)
```
We have roughly the same number of facilities as before, but this table has a lot of columns (568). We should make it more manageable.
```{python}
# The dataset contains numerous columns (568), so we will retain only essential columns for analysis
# Select columns: facility name, address, city, reporting year, and air release amounts
keeps = [
'facility_name', 'street_address', 'city_name',
'reporting_year', 'air_total_release']
# Update DataFrame to only include the selected columns
tri_form_r = tri_form_r[keeps]
```
This works well but we can get more recent data that is already filtered using the [web portal](https://enviro.epa.gov/facts/tri/form_ra_download.html). The user selects location, years of interest, and columns of interests. The portal serves up your requested dataset in an Amazon S3 cloud storage. We can import the dataset directly from the address they provide.
```{python}
# Read a secondary TRI Form R CSV file directly from the URL
# This may contain additional or alternative data for further analysis
tri_form_r = pd.read_csv("https://dmap-epa-enviro-prod-export.s3.amazonaws.com/396975438.CSV")
# Confirm successful reading by printing the number of records and dataset structure
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
print("\nColumns in the dataset:")
print(tri_form_r.columns)
```
This is a bit more manageable, but we have multiple records per facility because each chemical release has a separate record. This allows the user to investigate specific chemicals, but for the purposes of this exercise we will aggregate `AIR_TOTAL_RELEASE` for each facility, to get a single row per facility.
```{python}
# Display the first few rows of the dataset for a quick inspection
# Note: this printout is limited to avoid excessive output in a web environment
# tri_form_r.head
```
Let's aggregate it.
```{python}
# Group the dataset by facility name and location details, summing the total air releases per facility
# This aggregation step consolidates all entries for each unique facility based on name, latitude, longitude, and address
tri_form_r = tri_form_r.groupby(['FACILITY_NAME', 'LATITUDE','LONGITUDE','STREET_ADDRESS'], as_index=False).agg({
'AIR_TOTAL_RELEASE': 'sum' # Sum the air release values for each facility
})
# Display the total number of unique facilities after grouping
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
```
There were only 211 unique listings in the 2023 data.
As before, we should check for valid coordinates and missing air release data.
```{python}
# Define column names for latitude, longitude, and air release values
# Assumes column names match; adjust 'LATITUDE', 'LONGITUDE', and 'AIR_TOTAL_RELEASE' if necessary
lat_col = 'LATITUDE'
lon_col = 'LONGITUDE'
release_col = 'AIR_TOTAL_RELEASE' # Column for the total air release per facility
# Remove any records with missing coordinates or air release data to ensure data integrity for spatial analysis
df_tri_clean = tri_form_r.dropna(subset=[lat_col, lon_col, release_col])
# Output the number of records remaining after removing rows with missing data
print(f"\nNumber of records after removing missing data: {len(df_tri_clean)}")
```
There were no missing coordinates or missing air release data.
Now we can create a spatial object with **GeoPandas** using the latitude and longitude coordinates in the table.
```{python}
# Convert the cleaned TRI data to a GeoDataFrame for spatial analysis
# Creates a 'geometry' column based on longitude and latitude, setting the coordinate system to EPSG:4326 (WGS 84)
gdf_tri_form_r = gpd.GeoDataFrame(
df_tri_clean,
geometry=gpd.points_from_xy(df_tri_clean[lon_col], df_tri_clean[lat_col]),
crs="EPSG:4326"
)
```
Let’s check the distribution of the air release data for outliers.
```{python}
# Plot a histogram to visualize the distribution of total air release values across facilities
plt.figure(figsize=(8, 8)) # Define plot size
plt.hist(gdf_tri_form_r['AIR_TOTAL_RELEASE'], bins=10, edgecolor='black') # Set bin count and edge color
plt.title('Distribution of Air Total Release Values') # Plot title
plt.xlabel('Air Total Release Sum') # X-axis label for air release values
plt.ylabel('Frequency') # Y-axis label for frequency of values
plt.grid(True, alpha=0.3) # Add grid with slight transparency for readability
# Display the histogram
plt.show()
```
Looks like there are outliers around 100,000 lbs and 400,000 lbs. Before we map the data, it would be a good idea to perform a log transformation of `AIR_TOTAL_RELEASE` to smooth out the visuals. You can't take the log of negative or zero values, so we'll use the `log1p` function, which adds 1 to every value before taking the log.
```{python}
import numpy as np
# Calculate the natural logarithm of air release values +1 to handle any zero values safely
# This transformation helps normalize the data, reducing skew for clearer visualization
gdf_tri_form_r['LOG_AIR_RELEASE'] = np.log1p(gdf_tri_form_r['AIR_TOTAL_RELEASE'])
# Plot a histogram to visualize the distribution of the log-transformed air release values
plt.figure(figsize=(8, 8))
plt.hist(gdf_tri_form_r['LOG_AIR_RELEASE'], bins=10, edgecolor='black') # Set bin count and edge color
plt.title('Distribution of Log-Transformed Air Release Values') # Plot title
plt.xlabel('Log of Air Total Release Sum') # X-axis label for log air release values
plt.ylabel('Frequency') # Y-axis label for frequency of values
plt.grid(True, alpha=0.3) # Add grid with slight transparency for readability
# Display the histogram
plt.show()
```
That looks much better.
Now we can visualize the data, by overlaying the points on a basemap of Detroit while using graduated symbols to illustrate the amount of air releases for a given facility.
```{python}
# Reproject TRI facility data to Web Mercator (EPSG:3857) to align with the basemap projection
gdf_tri_form_r_bm = gdf_tri_form_r.to_crs(epsg=3857)
# Create the map plot
fig, ax = plt.subplots(figsize=(7, 7))
# Plot the Detroit metro area boundary and bounding box with customized colors
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=2)
# Scatter plot TRI facilities with symbol sizes proportional to the log of air releases
# Scaling factor applied for better visibility; adjust as needed
scatter = ax.scatter(gdf_tri_form_r_bm.geometry.x, gdf_tri_form_r_bm.geometry.y,
s=gdf_tri_form_r_bm['LOG_AIR_RELEASE']*20, # Scale size of symbols by log air releases
c='orangered', # Color fill of symbols
edgecolor='yellow', # Outline color of symbols
linewidth=1, # Outline thickness for clarity
alpha=0.7) # Adjust transparency level
# Add OpenStreetMap basemap for spatial context
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
# Set map view to focus on the bounding box area
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])
# Remove axes for a cleaner map view
ax.set_axis_off()
# Add a legend with sample sizes for symbol scaling reference
legend_sizes = [0,4,8,12] # Adjust sample sizes based on data scale
legend_elements = [plt.scatter([], [], s=size*20, c='orangered', edgecolor='yellow',
linewidth=1, alpha=1, label=f'{size:,}')
for size in legend_sizes]
ax.legend(handles=legend_elements, title='Log Total Air Releases (lbs)',
loc='lower right', title_fontsize=12, fontsize=10) # Customize legend title and text
# Set the plot title
plt.title("Detroit Metro Area TRI Facilities - Total Air Releases (Custom Data)", fontsize=16)
plt.tight_layout()
plt.show()
# Output summary statistics for TRI facilities plotted and air release values
print(f"\nNumber of TRI facilities plotted: {len(gdf_tri_form_r)}")
print(f"Total air releases: {gdf_tri_form_r[release_col].sum():,.2f} lbs")
print(f"Average air release per facility: {gdf_tri_form_r[release_col].mean():,.2f} lbs")
```
The point source air release data provides a good snapshot of the locations and relative amounts of chemical releases. With some basic understanding of the demographic dynamics of Detroit one might theorize who is being subjected to high levels of pollution, however, we’ll need additional data to develop any type of systematic analysis.
::: {.callout-tip style="color: #9f0b64;"}
#### Knowledge Check
What key information does Form R of the Toxic Release Inventory provide?
a) Facility names and locations only