Variable Name | Variable Description |
---|---|
no2 | Atmospheric Concentration of Nitrogen Dioxide (ppb) |
co | Atmospheric Concentration of Carbon Monoxide (ppm) |
so2 | Atmospheric Concentration of Sulfur Dioxide (ppb) |
no | Atmospheric Concentration of Nitrogen Monoxide (ppb) |
temp | Temperature (Celsius) |
wind | Wind Speed (knots) |
pressure | Barometric Pressure (millibars) |
humid | Relative Humidity (% rh) |
AQI | Next-day Air Quality Index, measured by Ozone |
Introduction
The EPA is an organization fundamentally invested in securing and promoting the health and well-being of the environment. In this paper, we analyze two distinct but related environmental issues: air pollution and car dependency. For both issues, we used publicly available data sourced from the EPA to construct machine learning models to make predictions. For air pollution, we predicted a county’s next-day Air Quality Index (AQI) from the previous day’s atmospheric conditions to determine which factors contribute to AQI and how accurately we can determine tomorrow’s AQI from today’s conditions. For car dependency, we predicted the walkability index of a localized area based on built environment charactersitics to deduce which characteristics contribute most meaningfully to the walkability of a location.
Data Collection & Description
AQI
To collect information about atmospheric conditions and Air Quality Index, we downloaded two years worth of atmospheric data (2021 & 2022) from the EPA (source in the appendix). These data were collected by various monitoring stations across 43 counties in 32 states across the United States on a daily basis. In the final data set, one observation corresponded to one day of atmospheric measurements for a particular county, with combined measurements from different weather stations across that county. We had to omit some of the observations because predicting next-day AQI from current-day atmospheric conditions requires sequential days where all atmospheric measurements at that monitoring station are present. Sequential days that were missing any of the atmospheric measurements on either day were omitted from the data. The final dataset consisted of 7,768 observations.
There are different AQIs for different atmospheric pollutants. For this project, we chose to predict the AQI based on ozone from atmospheric conditions (excluding ozone concentration) because the ozone AQI was the most frequently reported in this dataset.
For each atmospheric measurement, we extracted the minimum, average, and maximum measurement across all weather stations for a given county each day. We did not use Ozone measurements as predictors so that our models can be used when Ozone measurements are not available. The following table details the atmospheric measurements we extracted.
Table 1. Variable summary for AQI dataset
Walkability
We accessed data about walkability from the U.S. Government’s open data catalog, data.gov. The data was collected by the Environmental Protection Agency (EPA) and contains demographic information, built environment characteristics, and transportation infrastructure data from 2021 for each Census block group. These inputs are used to calculate the National Walkability Index as defined by the EPA. This index can be expressed either numerically or translated into qualitative categories (see Table 2 below).
Walkability Index | Category |
---|---|
1 - 5.75 | Least Walkable |
5.76 - 10.5 | Below Average Walkable |
10.51 - 15.25 | Above Average Walkable |
15.26 - 20 | Most Walkable |
Table 2: NWI Walkability Categories
In total, there are 220,740 observations for 117 variables in this dataset. Since there were so many predictors to choose from, we narrowed it down to those in the table below (more details on variable selection in the Exploratory Data Analysis section).
Variable Name | Description |
---|---|
D2a_EpHHm | The mix of employment types and occupied housing |
D2b_E5MixA | 5-tier employment mix (retail, office, industrial, service, and entertainment jobs) |
D3apo | Network density in terms of pedestrian-oriented paths per square mile |
D3b | Street intersection density (weighted, auto-oriented intersections eliminated) |
D3bmm4 | Intersection density in terms of multi-modal intersections |
D3a | Total road network density |
Table 3: Variable Summary for Walkability Dataset
Exploratory Data Analysis
AQI
Overall distribution of AQI
Text(0, 0.5, 'Frequency')
Figure 1. Overall distribution of AQI
Mean | STD | Median | IQR |
---|---|---|---|
58.617 | 36.4629 | 44 | 30 |
Table 4. Summary statistics of AQI
Overall, the distribution of AQI scores is single-peaked and right-skewed, with a median next-day AQI of 44 and an interquartile range of 30. The graph is likely right-skewed because AQI is bounded on the lower end by 0, but has no upper limit.
Overall distribution of AQI by month
Text(0, 0.5, 'AQI')
Figure 2. Distribution of AQI by month
Figure 2 reveals the seasonal trend in AQI. AQI tends to be higher in the summer months and lower in the winter months for these data. The exact mechanism for this relationship is unknown, but some hypotheses include:
- Extreme heat and stagnant air can lead to an increase in ground-level ozone concentrations
- Hot and dry weather results in forest fires that can increase atmospheric pollutants
- Clouds are less common at higher temperatures but are important for preventing the production of ozone
Correlations between predictors and AQI
<AxesSubplot:xlabel='variable_category'>
Figure 3. Predictor correlations with AQI, by min, mean, and max concentration
Figure 3 shows how the different atmospheric measurements correlate with next-day AQI. For most atmospheric measurements, the correlation is in the same direction for the minimum, mean, and maximum. However, wind speed is a notable exception to this trend, where we can see that although the mean and minimum wind speed are negatively correlated with next-day AQI, the maximum wind speed has a slight positive correlation with next-day AQI. The single variable with the most positive correlation to next-day AQI is maximum no2, with a correlation of ~0.59, meaning that higher maximum concentration measurements of no2 are associated with higher next-day AQIs. The single variable with the most negative correlation to next-day AQI is minimum humidity, with a correlation of ~-0.50, meaning that higher minimum humidity measurements are associated with lower next-day AQIs.
Walkability
Overall distribution of Walkability Index
Figure 4. Histogram showing the distribution of walkability index
Mean | STD | Median | IQR |
---|---|---|---|
13.8147 | 2.34047 | 13.8333 | 3.33333 |
Table 5: Summary statistics of walkability index
The histogram of Walkability Index appears quite bell-shaped and symmetric. The median Walkability Index is 13.83 and the interquartile range is 3.33. The range for the Walkability Index is 1 to 20.
Relationship between Population and Walkability Index Category
Text(0.5, 0, 'Walkability Index Category')
Figure 5. Boxplots showing the relationship between population and walkability index category
We performed further exploration to examine if population may have a relationship with the Walkability Index Category. The Least Walkable category seemed to have the highest population, with the Below Average and Above Average Walkable categories having the lowest populations. These areas had a tighter interquartile range although there were some outliers having much higher populations. The places that were Most Walkable had a large variation in population.
Relationship between Employment Mix and Walkability Index Category
We conducted some preliminary investigation to look at the relationship between employment mix and walkability. Figure 6 below shows a clear upward trend in walkability as the employment mix increases. The employment mix variable is a proportion between 0 and 1 that shows how diverse jobs are in a given area (categories of jobs are retail, office, industrial, service, and entertainment). Higher values indicate greater diversity of jobs which, according to the EPA, is correlated with more walk trips.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Figure 6. Scatterplot showing the relationship between employment mix and walkability index. Red = Least Walkable, Orange = Below Average Walkable, Yellow = Above Average Walkable, Green = Most Walkable
Model Results
AQI
We used three different types of models to predict AQI: linear models (LASSO), decision trees, and random forests. Each model predicts the next day’s AQI based on the min, mean, and max values of our 8 predictor measurements. Notably, this does not include any measurements of the day’s Ozone or AQI. We measured each model’s performance by cross-validating on data the model had not been trained on. The R-Squared and Root Mean Squared Error metrics are displayed below:
Model | R-Squared | Root Mean Squared Error | |
---|---|---|---|
0 | Linear Model | 0.640877 | 21.8541 |
1 | Random Forest | 0.709867 | 19.6418 |
2 | Decision Tree | 0.604399 | 22.9356 |
Table 6. Best performance metrics of our three models for AQI
These metrics indicate that the Random Forest is our overall most effective model, with an R-Squared value of 0.710 and an RMSE of 19.642 AQI points. This indicates that the random forest accounts for 71.0% of the variance in AQI, only being wrong by an average of about 19.6 AQI points. Our linear model was markedly worse, with an RMSE of 21.854 AQI points, and our decision tree fared the worst, with an RMSE of 22.936 AQI points.
To investigate the biases of each of our models, we have plotted the residual errors below:
<Figure size 1920x480 with 0 Axes>
<Figure size 672x96 with 0 Axes>
Figure 7: AQI model residual plots. A moving average of the mean error is plotted to display model biases
With these results in mind, we will briefly discuss each model’s strengths and weaknesses.
Linear Model
Our final linear model ended up slightly biased, with its lowest predictions being under the true AQI and its higher predictions tending to over-predict by about 10 points. After comparing several linear model specifications, we opted to use quadratic linear regression. This quadratic model performed better than the original linear regression, but did not remove all of the bias on the high and low predictions.
Despite its biases and inaccuracies, the linear model is useful for its interpretability. We found three of our predictors to be far more important than the others by using LASSO regression, giving us a simple understanding of the basic associations at play. See the formula below:
\[ AQI = 58.61 + 12.20 * max\_no2 + 8.16 * max\_temperature - 6.45 * min\_humidity \]
Note that in this formula, max_no2, max_temperature, and max_humidity are standardized, measured in standard deviations rather than typical units. This gives us a simple understanding that high maxmimum no2 and temperature in a day indicate high next-day AQI, while high maxmimum humidity indicates lower next-day AQI. This is consistent with the correlation trends in Figure 3 in the Exploratory Data Analysis section. The AQI this model predicts is based on ozone, indicating that there is correlation between no2 and ozone. This makes sense, as both are typically generated from fossil cars and industrial facilities, so we would expect them to be highly correlated. Despite being so simplified, this model still has a RMSE of only 21.85, and a R-squared of 0.641. This means that this relatively simple model accounts for 64.1% of the variance in next-day AQI, and its next-day AQI predictions are typically off by roughly 21.9 points.
Decision Tree
Unlike our other models, our decision tree’s predictions form discrete vertical lines on the residuals plot because the decision tree only splits its predictions into a relatively small number of distinct values. This limitation makes it impossible for the decision tree to perfectly predict the data, but it still performs better than the linear model because it can respond to non-linear relationships very effectively. The decision tree also allows us to directly see how it makes its decisions. The following decision tree flowchart is simplified down to only 4 layers of splits, but should grant insight into which variables are most important in which circumstances:
At each step, this decision tree chooses one value of one variable that best splits the data into higher AQI and lower AQI categories. The first, largest, split separates based on no2: days with a very low maximum no2, below 18 ppb, are directed to the left side of the tree, which tends to have lower next-day AQI. This is consistent with our finding from the linear model, as we know there is a meaningful positive correlation between no2 and ozone (and therefore AQI based on ozone).
For low no2 days, the tree finds that high maximum temperature is the best indicator of high next-day AQI, splitting at a maximum temperature of 81.65 degrees Fahrenheit. For high no2 days, the tree finds that high minimum pressure is the best indicator of high next-day AQI, splitting at a minimum pressure of 982.86 millibars.
This goes to show that next-day AQI is complicated to predict, with the different sides of the tree using different sets of variables. This tree indicates that there are complex interactions between atmospheric conditions that contribute to next-day AQI.
Random Forest
Our final optimized random forest model averages the results of 500 decision trees, each of which is trained on a random subset of 6 of our predictors, a random subset of our data, and splits into many more AQI categories than the decision tree above. This leads each of the individual decision trees to over-fit to different small trends in our training data. By taking the average of so many over-fit trees, each of which recognizes different indicators of AQI, the random forest is very effective at predicting AQI, but does not lend itself to a more detailed human understanding.
Walkability
We used four different types of regeression models to predict walkability index: Linear Regression, Decision Tree, KNN, and Random Forest. For each model, we calculated the R-squared and RMSE based on 10-fold cross-validation.
Model Specification | R-Squared | Root Mean Squared Error |
---|---|---|
Linear Regression | 0.7503 | 1.16619 |
K-Neighbors Regression | 0.8895 | 0.77201 |
Decision Tree Regression | 0.8929 | 0.762693 |
Random Forest Regression | 0.8934 | 0.761183 |
Table 7. Best performance metrics of our four models for walkability index
The model that resulted in the best cross-validated metrics was our Random Forest, observing an R-squared of 0.907 and a RMSE of 0.718. This means that this model is able to account for over 90% of variance in walkability index, and its predictions were only off by 0.718 walkability index points, on average.
On the other hand, our most interpretable model, the decision tree, still resulted in reasonable performance metrics, with an R-squared of 0.719 and a RMSE of 1.223. The following sections detail our methodologies in more detail.
Linear Regression
Our linear model achieved an R-squared value of 0.819, meaning it is accounting for a non-trivial amount of variance in walkability index. As a result, interpretations of the largest coefficients are especially valuable. The following plot shows the coefficients of the variables with the six coefficients of largest magnitude.
Figure 9: Bar plot of top six variables, as indicated by OLS
There are three particularly large coefficients, corresponding to D3B, D2B_E8MIXA, and D2A_EPHHM. Because these three coefficients are all close to one, this means that a single unit increase in non-car-oriented street intersection density, employment entropy, and household entropy are each roughly associated with a single unit increase in the walkability index.
Conversely, the network density has a negative coefficient of about -0.5. The variable D3BPO3 specifically refers to the density of pedestrian oriented 3-way intersections. This could indicate that 3-way are not as good at promoting street connectivity. It’s possible that this is related to that stop signs are easier for pedestrians to cross than traffic lights are. Overall, these indicate areas that are crowded with vehicles tend to be less walkable even if there are a lot of pedestrian intersections.
Decision Tree Regression
A decision tree, much like our linear regression model, has the benefit of being highly interpretable, even if its predictive ability is not optimal. We created our decision tree with very small maximum depth to examine and interpret the whole tree easily. This did result in substantial underfit, hence why it had the worst cross-validated performance metrics of all the models. Fitting this model on the walkability data resulted in the following tree:
The first split was made on the variable D3B. This shows that having a lower density of street intersections (not including auto-oriented intersections) is an indicator of a lower walkability score. This makes sense, as areas with fewer intersections would be more disconnected for pedestrians, making walking between destinations less feasible.
For areas with a lower density of street connections, the next split considers the mix of employment types (D2B_E8MIXA), showing that areas with lower job diversity tend to have lower walkability indices.
For areas with a higher density of street connections, the next split considers the diversity in residence types, showing that areas with less living options tend to have lower walkability indices.
The fact that the different branches of the tree consider different variables alludes to the fact that walkability index is a complicated metric that weighs many factors in a nuanced manner.
Random Forest Regression
With such a large dataset, a random forest algorithm was a very compute intensive modeling option. Regardless, we tuned a Random Forest model, resulting in the highest cross-validated metrics of any of our models. This random forest has excellent predictive accuracy, so we would recommend it when trying to predict walkability scores, but we cannot use it to gain any further interpretable insight.
K-Neighbors Regression
To tune the optimal number of neighbors for our KNN model, we performed a grid search. To perform this grid search, we fit each KNN model on the same training data and compared their performance on the same validation set. The following plot shows how validation R-squared changes with k.
0.7720103626247513
For the walkability data, it turned out that a relatively low value of k (in comparison to the entire size of the dataset) optimized validation R-squared. We settled on the optimal number of neighbors being around 10, as this is where the so-called ‘elbow’ in the graph is.
KNN is a computationally expensive algorithm due to both the breadth and depth of the walkability dataset. To reduce this cost, we fit our different KNN models on only the top 6 predictors as indicated by OLS (See Figure 9). The final model specification achieved a cross-validated RMSE of 0.772 and an R-squared of .889. Although KNN performs worse than Random Forest for this data the decent metrics from KNN demonstrate that there is more than one way to effectively predict walkability index. However, just like the Random Forest, our KNN model lacks interpretability.
Discussion
AQI
Our model results demonstrate that we can forecast ozone-based AQI effectively without knowledge of ozone levels. On average, our random forest was only wrong by 19.642 AQI points, rarely predicting far from the actual AQI. As the estimates are in the same ballpark as the real AQI, they could potentially serve to give the public a general idea of how unsafe atmospheric ozone levels are in the absence of ozone measurements. We tuned several models for various parameters in order to arrive at this accurate of a random forest predictor. While our other models didn’t prove to be as impactful, the linear model gives us an intuitive understanding that ozone is associated with high no2, high temperatures, and low humidity.
Walkability
Our model results indicate that walkability index can be predicted quite well using the publicly available data. We were able to explain 89.3% of the variance in walkability index using a Random Forest model, which is an impressive number. KNN was able to achieve comparable but slightly worse results, explaining 89% of the variance in walkability index. Although neither of these models is particularly interpretable, their predictive power is impressive.
Our Linear Regression and Decision Tree models were not only highly interpretable, they also explained 75% and 89.3% of the variation in Walkability Index respectively. For such interpretable models, these performance metrics are remarkable. Linear Regression is particularly poor, as it was not able to predict walkability index within 1 point, on average, while only explaining 75% of the variation in walkability index.
Ethics
Good uses of our work
It is important that as we give you these powerful machine learning model fits, you are aware of the importance of their proper usage. We hope that you will use the work detailed in this report to better understand the atmospheric factors that contribute to the future air quality of an area and its characteristics that determine its walkability. For the AQI model in particular, we also expect that you can use this model to effectively predict the air quality based on ozone even when direct ozone measurements are unable to be taken, using other atmospheric conditions as a proxy measurement.
For the walkability model, we would support the usage of our models to predict areas with poor walkability for areas where it is currently unknown, such that changes could be implemented to make it more pedestrian-friendly. The goal of these models is to highlight a small number of factors that have a large impact on how walkable a place is. By identifying these characteristics, it is our hope that civil projects can be undertaken and that public policy changes can be implemented to improve walkability for all communities, but especially those that need it most. For example, there could be changes to the local codes that require a certain intersection density or federal incentives to having high public transportation service levels. Walkability is very strongly linked to physical health and quality of life metrics, so it is of the utmost importance to ensure all communities are walkable.
Misuses
Although there are many appropriate uses of our work, we also want to forewarn you about the potential misuses to avoid. Though our models predict ozone-based AQI, they should not be used as a proxy for AQI when proper measurements are available, as it could miss some important and dangerous fluctuations. The model’s intended use is as an emergency measure when ozone measurements are not available.
As the EPA, it is your responsibility to use the knowledge gained from this work and the corresponding machine learning models to help the environment, not to harm it. If misused, these models could ultimately end up doing more harm than good. For example, entities responsible for emitting large amounts of pollutants could use these models to find ways to disguise their harmful ozone production. If these enterprises learn that cold, humid days see low ozone, they may choose emit more fumes during those times which could cause more harm to the environment.
Care should be taken to protect public transportation services even if the area would be “walkable enough” without having such high service levels. It would also not be appropriate to target low-income housing developments to places that are less walkable simply because these areas tend to have a higher percentage of low-wage workers.
Bias
We must acknowledge that our project is subject to biases - deliberate or otherwise. The primary bias in the AQI analysis came from the data collection process. Our final data set consisted of atmospheric measurements from only 43 counties across 32 states, meaning that a substantial portion of the country is left unaccounted for. Also, we did not investigate the geographic locations of the weather stations that were monitoring the atmospheric conditions; hence it is possible that these measurements are not representative of the general population of atmospheric conditions (For example, perhaps weather monitoring stations tend to be close to cities, making this analysis less representative of rural areas). Finally, the data we analyzed came only from the years 2021 and 2022 - meaning that insights and trends illuminated in this report may not reflect the historical patterns.
For the walkability dataset, we have very serious concerns about bias. Many variables in the dataset required information about public transit services. Local transit agencies should use a General Transit Feed Specification (GTFS) to share transit information in a standardized format. Although 95% of total transit ridership in the US is covered by the GTFS, not all places develop GTFS data for their systems or share it, particularly smaller transit agencies. Since many variables depended on this information, observations without it were removed. Unfortunately, this had a severely disproportionate impact on the less walkable places. Prior to removal, 43,250 places were designated as “Least walkable” and 68,640 places were “Below average walkable.” After removal, there were only 10 least walkable places and 7,583 below average walkable places remained. Most walkable and above average walkable places only lost 4% and 15% of their locations respectively. Since the Walkability Index categories are very unevenly represented in the data our models were trained on, they will likely (and unrealistically) predict areas will be more walkable than they are.
Conclusion
Our goal was to leverage machine learning to assist you, the EPA, with your mission of furthering environmental protection, advancing environmental justice, and safeguarding communities. We specifically created a tool to help you understand the factors that contribute to next-day AQI and predict the harmful effects of ozone without any direct measurements of ozone concentrations. We also developed a model that can capture over 80% of the variance in walkability index using built environment characteristics. This can help you inform the public of air quality danger and help lobby for projects to reduce the car dependency of locations with low walkability.
For both the AQI and walkability datasets, random forests proved to be the model specifications with the highest predictive power as measured by R-squared and RMSE. We discovered that the most important variables for predicting next-day AQI were maximum no2, maximum temperature, and minimum humidity. On the other hand, the most important variables for predicting the walkability index were those relating to diveristy of employment, road network density, and diversity of housing types.
The single model we would recommend to you for predicting next-day AQI is our random forest model, as it has the best predictive ability. This report has sufficient information on the contributing factors to next-day ozone AQI such that the more interpretable models are not worth using at the cost of predictive power.
The single model we would recommend to you for predicting the walkability index of a location is the linear regression model, because it has a good balance of predictive power and interpretability. Because walkability index is such a complicated metric and this dataset has so many features, it is important to preserve the ability to interpret how and why the model is making the predictions it is.
Appendix
Sources
“AirData Website File Download Page.” EPA, Environmental Protection Agency, 1 Jan. 2023, https://aqs.epa.gov/aqsweb/airdata/download_files.html.
“Center for Science Education.” How Weather Affects Air Quality | Center for Science Education, https://scied.ucar.edu/learning-zone/air-quality/how-weather-affects-air-quality.
Champan, Jim, et al. EPA, Environmental Protection Agency, June 2021, https://www.epa.gov/smartgrowth/smart-location-database-technical-documentation-and-user-guide.
Thomas, John, et al. “National Walkability Index Methodology and User Guide.” EPA, Environmental Protection Agency, June 2021, https://www.epa.gov/smartgrowth/national-walkability-index-user-guide-and-methodology.
“Walkability Index.” Catalog, Responsible Party U.S. Environmental Protection Agency, Office of Sustainable Communities (Point of Contact), 3 July 2021, https://catalog.data.gov/dataset/walkability-index.