California Fire History

Analyzing official California Fire History data collected by government agencies dating back to the late 1800’s. This is done using the R programming language.

Tommy Smale true
2022-03-02

Introduction

California has had our largest fires in recorded history the last 2 years. Fires have had an effect on everyone in California weather it be due to air quality, or loss of property and loved ones. One school morning I woke up to a sky full of dark smoke, the smoke was so bad I could barely see down the block. Since that day in Chico there have been multiple other instances where smoke has blanketed the skies of California and ash has fallen from the skies. Smoke over the horizon is a terrible feeling as life as you know it can rapidly change without there being anything you can do about it. As a data science student, we are often just on our computer but we are learning complex skills which we can apply to the real world to make a difference. This project is a great opportunity to spend our time helping in the effort to solve current real world problems.

Goal

I will be exploring this fire perimeter dataset to learn more about what measures cal fire is taking to combat fires. I would like to see if I can tell if any of these measures have been successful or unsuccessful. In addition I would like to see what is causing these fires in California. I would like to see just how many acres are burning and what areas have been affected the most.

My initial data science questions are

  1. What were some of the biggest fires in California history?
  2. Which agencies respond to the most fires?
  3. How long does it take for fires to be contained?
  4. What causes the biggest fires?
  5. Are certain causes of fires becoming more or less frequent?
  6. Can the fire cause be predicted?
  7. Is there a relationship between year and fire size?
  8. Are no burn days effective in preventing human caused fires?
  9. Can we calculate what troops are the busiest?
  10. Is there a linear relationship between fire size and duration?

About the dataset

I am using the 2020 California Fire Perimeters data set available from gis.data.gov. A fire perimeter is the boundary of a fire measured by acres in this data set. This data set includes data from many different government agencies from different time periods. This data set includes prescribed burns and wildfire perimeters dating back to 1950 for CAL FIRE. Data from 2021 is not released until the fire season is over, which is on going at the time of this (Dec 18 2021). The United States Forest Service, USFS, has submitted records as far back as 1878 and the National Park Service, NPS, has submitted records from as far back as 1921. The types of fires included in this report has varied over the centuries which may add bias to the results. A couple important notes about the data is that from 1950 to 2001 it included USFS fires 10 acres and greater and CAL FIRE fires 300 acres and greater. BLM and NPS started inputting data since 2002 collecting fires 10 acres and greater. From 2002-2014 CAL FIRE expanded its criteria by including timber fires 10 acres or greater, brush fires 50 acres and greater, grass fires 300 acres and greater in size, wild land fires destroying 3 or more structures, and wild land fires causing 300,000 or more in damage. From 2014 and on the monetary requirement was dropped and the damage requirement is 3 or more habitable structures or commercial structures. In 1989 each unit was assigned to verify all 300 plus Acre fires from pre 1989 and as a result there is a statewide GIS layer from 1950-1999. Some errors that could occur when exploring this data is that duplicates may exist. For example, USFS and CAL FIRE could both capture the fire perimeter and submit it. In some cases they could even report different parameters of the same fire. While duplicate records is rare, there is an asterisk next to the cells that are the most accurate.

Explanation of variables

Discovery:

[1] "The fire dataset dimensions are (21318, 17)"
missing_data <- c(1:ncol(fires))
for(col in 1:ncol(fires)) { 
  colname <- colnames(fires)[col]
  nans <- sum(is.na(fires[, col]))
  bad_strings <- c('', ' ', "UNKNOWN", "UKNOWN", "N/A")
  bad_strings_count <- sum(fires[, col] %in% bad_strings)
  zeros <- sum(fires[, col] <= 0)
  total <- 0
  if(!is.na(nans)) { 
    total <- nans 
  }
  if(!is.na(bad_strings_count)) { 
    total <- total + bad_strings_count
  }
  if(!is.na(zeros)) { 
    total <- total + zeros
  }
  missing_data[col] <- total
}
missing_data <- data.frame(colnames(fires), missing_data)
colnames(missing_data) <- c("Column", "Num NA's, '', or <= 0's")
missing_data
         Column Num NA's, '', or <= 0's
1          year                      77
2         state                       6
3        agency                      10
4       unit_id                      46
5     fire_name                   13416
6       inc_num                    1852
7    alarm_date                   10728
8     cont_date                   25360
9         cause                      48
10     comments                   36974
11    report_ac                   12551
12    gis_acres                       7
13     c_method                   12222
14    objective                     195
15     fire_num                    8123
16 shape_length                       0
17   shape_area                   21317

There are many ways to deal with missing data like ignoring it or changing them to the mean or median. It is important to not change the integrity of the data if you manipulate missing data. The majority of the NA’s are in columns report_ac and c_method which means I will use shape_area instead. The rest of the values mostly come from missing values like ’’. This does not matter so much for comments but is a concern for inc_num, alarm_date, cont_data, and fire_num. For fire_name, many of the “UNKNOWN” fire names may stem from small prescribed burns. This is something to keep in mind as I use this data and will check to see if those old values originate from the older data or not.

Brief summaries of every variable

YEAR
[1] "The range of years in the dataset is from 1878 to 2020"

This graph does not prove that more fires are occurring every year because we do not know how accurate reporting is in the early 1900s. CAL FIRE has data set going back to 1950 but USFS has data from 1878 in here. The two years with the most amount of data is 2017 and 2020 which has had the worse fires in recent history.

STATE
Table 1: Number of observations for each state
states Freq
AZ 1
CA 21226
NV 73
OR 15

No fires that happened out of this state were under jurisdiction of Cal Fire which suggests these fires may have originated outside of California. These fires can be potentially misleading since the boundary can include both California and the other state. There is no way to determine what the boundary is for the California part of the fire.

AGENCY
Table 2: Number of reports by each agency
agency num_fires
USF 9556
CDF 6563
CCO 3222
NPS 1011
BLM 652
DOD 163
LRA 111
FWS 22
BIA 12
OTH 1

Surprisingly the United States Forest Service has been in charge of more fires than CalFire. However, this is likely due to them reporting many smaller prescribed burns. I wonder who is in charge of more land, and how jurisdiction is delegated. Private (PVT) was an option in the official documentation but there appears to be no occurrences in this data set. It seems like the groups to pay the most attention to are BLM, CCO, CDF, LRA, NPS, and USF.

We continue to see that the National Park Service has jurisdiction over more large fires than CAL Fire. Despite the increase of fires in recent years, the number of contracted counties in charge of fires has decreased. However, the Bureau of Land Management and National Park Service have been increasing their aid in combating California Fires.

UNIT_ID

out_of_state_units <- fires %>% 
  select(state, unit_id) %>% 
  filter(state != "CA" & state != "" & !is.na(state) & unit_id != '') %>% 
  unique()
in_state_units <- fires %>% 
  select(state, unit_id) %>% 
  filter(state == "CA" & state != '' & !is.na(state) & unit_id != '') %>%
  unique()
#See if there are any units belonging to both states
cross_state_units <- c() 
for(row in 1:nrow(in_state_units)) { 
  unit <- in_state_units[row, "unit_id"]
  if(sum(out_of_state_units[, "unit_id"] == unit) > 0) { 
    cross_state_units <- c(cross_state_units, unit)
    }
}
cross_state_units_df <- fires %>% 
  filter(unit_id == cross_state_units & state != '' & !is.na(state)) %>% 
  select(state, agency, unit_id, gis_acres, fire_name) %>% 
  group_by(state, agency, unit_id) %>% 
  summarise(gis_acres_total = sum(gis_acres)) %>% 
  arrange(unit_id)
kable(cross_state_units_df, caption = 
        "The following units were responsible for fires in multiple states")
Table 3: The following units were responsible for fires in multiple states
state agency unit_id gis_acres_total
CA BLM CCD 1392.92076
NV BLM CCD 25.46643
CA USF HTF 26751.67800
NV USF HTF 14267.09300
CA USF KNF 235463.13349
OR USF KNF 476.97339
CA BLM NOD 18492.59674
NV BLM NOD 17.40460
CA USF TMU 3092.50918
CA USF TOI 18235.83105
NV USF TOI 10366.60557

The reason that there are some units in charge of fires in multiple states is because they belong to federal organizations like United States Forest Service or Bureau of Land Management.

calfire_units_biggest_fire <- fires %>% 
  filter(agency == "CDF" & !is.na(gis_acres)) %>% 
  group_by(unit_id) %>% 
  slice(which.max(gis_acres)) %>% 
  select(year, agency, unit_id, fire_name, gis_acres) %>% 
  arrange(desc(gis_acres)) %>% 
  head(n=10)
kable(calfire_units_biggest_fire, caption="CAL FIRE Units that Responded to the Top 10 Biggest Fires in California History")
Table 4: CAL FIRE Units that Responded to the Top 10 Biggest Fires in California History
year agency unit_id fire_name gis_acres
2018 CDF LNU RANCH 410202.47
2020 CDF SCU SCU COMPLEX 396399.00
2007 CDF SBC ZACA 240358.70
2018 CDF SHU CARR 229651.41
1977 CDF BEU MARBLE-CONE 173333.45
2007 CDF MVU WITCH 162070.47
2018 CDF BTU CAMP 153335.56
1990 CDF TGU CAMPBELL 131504.22
2020 CDF CZU CZU LIGHTNING COMPLEX 86553.46
1985 CDF SLU LAS PILITAS 84271.42

I would like to calculate the busiest years for each unit, however an accurate way of calculating this is stopping me. Factors to consider are center, spread, range, count, and max fires for each unit every year. I need a function that will return the hardest year based off this data.

FIRE_NAME
#There's still other bad fire names in the data set
bad_fire_names <- c("UNKNOWN", "N/A", "", "UKNOWN", " ")
popular_fire_names <- fires %>%
  filter(!is.na(fire_name)) %>% 
  subset(!(fire_name %in% bad_fire_names)) %>% 
  group_by(fire_name) %>% 
  count(name="count") %>% 
  arrange(desc(count)) %>% 
  head(n=10)
kable(popular_fire_names, caption="10 Most Popular Names for Fires")
Table 5: 10 Most Popular Names for Fires
fire_name count
CANYON 45
RIVER 39
LAKE 37
CREEK 35
COTTONWOOD 33
BEAR 32
RIDGE 30
RANCH 28
SPRING 28
PINE 27
biggest_fire_names <- fires %>% 
  arrange(desc(gis_acres)) %>% 
  select(year, fire_name, gis_acres) %>% 
  head(n=5)
kable(biggest_fire_names, caption="Top 5 Biggest Fires in California History and their Names")
Table 6: Top 5 Biggest Fires in California History and their Names
year fire_name gis_acres
2020 AUGUST COMPLEX FIRES 1032699.0
2002 BISCUIT 501082.0
2018 RANCH 410202.5
2020 SCU COMPLEX 396399.0
2020 CREEK 379882.2
[1] "A moment of condolences for anyone affected by the Camp Fire in Paradise or any other fire."
  fire_name year unit_id gis_acres alarm_date  cont_date
1      CAMP 2018     BTU  153335.6 2018-11-08 2018-11-26

Fire names are not unique and there are some fires that share the same name. There are many fire names that have been entered in incorrectly, that also contain back slashes or other gibberish.

ALARM_DATE

new_dates <- fires
new_dates$alarm_date <- as.Date(new_dates$alarm_date, "%Y/%m/%d")
new_dates$cont_date <- as.Date(new_dates$cont_date, "%Y/%m/%d")
busiest_days <- new_dates %>% 
  filter(!is.na(alarm_date)) %>% 
  group_by(alarm_date) %>% 
  summarise(num_fires = n(), 
            mean = mean(gis_acres), 
            median = median(gis_acres), 
            sd=sd(gis_acres)) %>% 
  arrange(desc(num_fires)) %>% 
  head(n=5)
kable(busiest_days, caption="Top 5 Number of Daily Fires Since 1878")
Table 7: Top 5 Number of Daily Fires Since 1878
alarm_date num_fires mean median sd
2008-06-21 122 5178.631 245.33860 16386.395
1987-08-30 61 8144.967 1932.24430 13913.601
2003-09-03 56 515.189 58.68177 1629.138
2015-07-30 50 4250.587 389.58737 10438.913
1999-08-23 47 7053.712 1571.95800 19095.666

It is hard to imagine that 122 fires are logged on June 6 even if it is a summer day. Before I analyze more time’s, lets find out the meaning behind this. One explanation is it could be a bunch of controlled burns. Many of these did share the same inc_number. However, many had different names and containment dates. While the primary cause on July 30, 2015 for lots of fires was smoking, on July 6, 2008 it was just lots of lightning.

CONT_DATE
format <- "%Y/%m/%d %H:%M:%S %z"
longest_fires <- new_dates %>% 
  select(fire_name, alarm_date, cont_date, cause, objective) %>% 
  filter(!is.na(alarm_date) & !is.na(cont_date)) %>% 
  mutate(fire_duration = 
           difftime(cont_date, alarm_date, format, units="days")) %>% 
  arrange(desc(fire_duration)) %>% 
  head(n=10)
kable(longest_fires, caption="Longest Lasting Fires in California History Since 1878")
Table 8: Longest Lasting Fires in California History Since 1878
fire_name alarm_date cont_date cause objective fire_duration
PIUTE 1990-08-13 1999-09-14 1 1 3319 days
RALSTON 2006-09-05 2007-09-17 9 1 377 days
WESTLAKE 2006-07-08 2007-07-09 2 1 366 days
LAVAL 2012-06-03 2013-06-03 14 1 365 days
GORGE 2018-08-19 2019-08-19 14 1 365 days
PANTHER 2013-05-01 2013-12-09 9 1 222 days
POWERHOUSE 2013-05-30 2013-12-18 9 1 202 days
LAKE 2015-06-17 2015-12-31 9 1 197 days
ROSASCO 2019-05-23 2019-12-02 9 1 193 days
PONY 2016-06-07 2016-12-15 1 1 191 days

In the data set there are some alarm_dates and cont_dates entered incorrectly, like the cont_date occurring before the alarm_date. Hence, there is a possibility that these results have also been entered in incorrectly. The reason that some of these fires last so long, is that fire responders allow it, because it is in a rural area and fires can be beneficial. The cause for most of these fires is lightning, unknown, or miscellaneous.

[1] “Fire Duration (Hours) Summary Statistics” Min. 1st Qu. Median Mean 3rd Qu. Max. 24 24 72 368 288 8760

A problem is that there are so many missing values in cont_date there are very few fires you can find the fire duration for relative to the whole data set. Among those available, there is a huge variance in the duration of fires. These shorter lasting fires are not just prescribed burns and have many different causes.

CAUSE
Table 9: Causes of the Biggest Fires on Average since 1878
cause string_cause avg_size count
4 Campfire 3530.47878 380
1 Lightning 2969.34811 3454
16 Aircraft 2474.22459 14
9 Miscellaneous 2285.43451 3379
7 Arson 2200.11588 903
10 Vehicle 1534.41743 454
14 Unknown/Unidentified 1496.36036 9541
2 Equipment Use 1460.92934 1246
11 Power Line 1274.56765 412
6 Railroad 1169.29763 80
3 Smoking 1048.65940 342
13 Non-Firefighter Training 1013.68027 11
5 Debris 875.22631 723
8 Playing with Fire 657.54545 196
18 Escaped Prescribed Burn 623.48734 90
12 Firefighter Training 594.01999 5
15 Structure 391.06574 21
19 Illegal Alien Campfire 83.46374 17

As I previously mentioned, the reason some fires duration is so long is because they are in remote areas and allowed to continue to burn. Despite this, human induced fires like campfires, still produce bigger fires on average than nature induced fires like lightning. In the next sections we will explore if certain causes of fires are becoming more or less frequent.

COMMENTS

The max length of a comment is 260 characters, comments are longer but they are cut off by an * indicating there is more to that comment somewhere. By reading comments interesting ones to me were “The cause was target shooting”, “… Total Cost 18,600,600”, “Children playing with fire”, and names of people. Most of the comments in the data set are empty.

GIS_ACRES
Table 10: Highest Acres Burnage in California History since 1878
year gis_acres square_miles
2020 4159334.0 6498.959
2018 1590431.2 2485.049
2017 1424559.4 2225.874
2008 1382462.2 2160.097
2007 1040224.3 1625.350
2003 970479.3 1516.374
2002 963898.5 1506.091
1987 862910.9 1348.298
2012 847714.6 1324.554
1999 801137.0 1251.777

For reference 500,000 acres is equal to 780 square miles, a square mile being a square with each side being 1 mile in length!!! The size of Yosemite national park is 1169 square miles, and the size of New York is 302 square miles.

[1] "Number of fire reports where the difference between report_ac and gis_acres is greater than 100: 1250"

There are many differences between REPORT_AC and GIS_ACRES. Many of the discrepancies come from the 2000s as well so it is not the fault of the older data. There is less missing data for GIS_ACRES so we will primarily be depending on this. There is the possibility of using REPORT_AC data if GIS_ACRES is missing and vice versa.

C_METHOD

get_mode <- function(vec) { 
  frequencies <- table(vec) %>% sort(decreasing=TRUE)
  strtoi(names(frequencies)[1])
} 

c_method_yr <- fires %>% 
  filter(!is.na(c_method) & c_method != 8) %>%
  group_by(year) %>%
  summarise(c_method = get_mode(c_method))
year <- fires %>% filter(year == "2020")
#Collection methods 
first_yr_c_method <- fires %>% select(year, c_method) %>% 
  drop_na() %>% arrange(year) %>% head()
total_c <- table(fires$c_method)
labels <- c("GPS Ground", "GPS Air", "Infared", "Other Imagery", 
            "Photo Interpretation", "Hand Drawn", "Mixed Collection Tools",
            "Unknown")
df <- data.frame(Method = labels, count = as.vector(total_c))
df <- arrange(df, desc(count))

Here we can see the transition of the most frequently used collection method as time goes on.

OBJECTIVE
Table 11: Top 5 Biggest Wildfires in California since 1878
year fire_name gis_acres cause objective
2020 AUGUST COMPLEX FIRES 1032699.0 1 1
2002 BISCUIT 501082.0 1 1
2018 RANCH 410202.5 14 1
2020 SCU COMPLEX 396399.0 1 1
2020 CREEK 379882.2 14 1
Table 11: Top 5 Biggest Prescribed Burns since 1878
year fire_name gis_acres cause objective
2008 CLOVER 15788.590 1 2
2008 TEHIPITE 11648.092 1 2
1977 FERGUSON 10421.790 1 2
2005 COMB 9756.467 1 2
2010 SHEEP COMPLEX 9021.522 1 2

As we saw previously with the longest lasting fires in recent history most of them originated from wildfires. It is interesting that the cause of the biggest prescribed fires have all been lightning. In fact, there are no prescribed burns that do not originate from lightning in this data set that has GIS_ACRES data.

#Are number of prescribed burns increasing? 
yearly_resource_burns <- resource %>% 
  filter(!is.na(gis_acres) & gis_acres > 0) %>% 
  select(year, gis_acres) %>% 
  group_by(year) %>% 
  summarise(num_fires = n(), total_acres_burned = sum(gis_acres)) 

The number of acres burned in prescribed burns has never been lower until this last decade. At the same time, California has experienced some of the worst fires in the last decade. While correlation does not prove causation, poor fire management in the off season may explain the recent surge in fires.

Exploratory Data Analysis

# Number of acres burned 2010-2020
decade_fires <- fires %>% select(year, gis_acres) %>%
  filter(year >= "2010" & !is.na(gis_acres)) 
decade_fires_sum <- decade_fires %>% group_by(year) %>% 
  summarise(gis_acres = sum(gis_acres))
decade_fires_sum$gis_acres <- as.integer(decade_fires_sum$gis_acres)
ggplot(decade_fires_sum, aes(x=year, y=gis_acres)) + 
  geom_bar(stat="identity", fill="gray70") + 
  geom_text(aes(label=gis_acres)) + 
  xlab("Years") + ylab("Acres Burned") + 
  ggtitle("Number of Acres Burned 2010-2020") 

Arson does not have much statistical analysis, it is just surprising how much land is burned every year due to this crime.

Conclusion

The biggest discovery is that the number of prescribed burns has never been lower from 2010-2020 in recent California Fire History. This may be cause for why we have seen such a big spike in fires in recent history. Also, R’s ggplot makes it extremely difficult to customize your graph.