Introduction

California has had our largest fires in recorded history the last 2 years. Fires have had an effect on everyone in California weather it be due to air quality, or loss of property and loved ones. One school morning I woke up to a sky full of dark smoke, the smoke was so bad I could barely see down the block. Since that day in Chico there have been multiple other instances where smoke has blanketed the skies of California and ash has fallen from the skies. Smoke over the horizon is a terrible feeling as life as you know it can rapidly change without there being anything you can do about it. As a data science student, we are often just on our computer but we are learning complex skills which we can apply to the real world to make a difference. This project is a great opportunity to spend our time helping in the effort to solve current real world problems.

Goal

I will be exploring this fire perimeter dataset to learn more about what measures cal fire is taking to combat fires. I would like to see if I can tell if any of these measures have been successful or unsuccessful. In addition I would like to see what is causing these fires in California. I would like to see just how many acres are burning and what areas have been affected the most.

My initial data science questions are

What were some of the biggest fires in California history?
Which agencies respond to the most fires?
How long does it take for fires to be contained?
What causes the biggest fires?
Are certain causes of fires becoming more or less frequent?
Can the fire cause be predicted?
Is there a relationship between year and fire size?
Are no burn days effective in preventing human caused fires?
Can we calculate what troops are the busiest?
Is there a linear relationship between fire size and duration?

About the dataset

I am using the 2020 California Fire Perimeters data set available from gis.data.gov. A fire perimeter is the boundary of a fire measured by acres in this data set. This data set includes data from many different government agencies from different time periods. This data set includes prescribed burns and wildfire perimeters dating back to 1950 for CAL FIRE. Data from 2021 is not released until the fire season is over, which is on going at the time of this (Dec 18 2021). The United States Forest Service, USFS, has submitted records as far back as 1878 and the National Park Service, NPS, has submitted records from as far back as 1921. The types of fires included in this report has varied over the centuries which may add bias to the results. A couple important notes about the data is that from 1950 to 2001 it included USFS fires 10 acres and greater and CAL FIRE fires 300 acres and greater. BLM and NPS started inputting data since 2002 collecting fires 10 acres and greater. From 2002-2014 CAL FIRE expanded its criteria by including timber fires 10 acres or greater, brush fires 50 acres and greater, grass fires 300 acres and greater in size, wild land fires destroying 3 or more structures, and wild land fires causing 300,000 or more in damage. From 2014 and on the monetary requirement was dropped and the damage requirement is 3 or more habitable structures or commercial structures. In 1989 each unit was assigned to verify all 300 plus Acre fires from pre 1989 and as a result there is a statewide GIS layer from 1950-1999. Some errors that could occur when exploring this data is that duplicates may exist. For example, USFS and CAL FIRE could both capture the fire perimeter and submit it. In some cases they could even report different parameters of the same fire. While duplicate records is rare, there is an asterisk next to the cells that are the most accurate.

Explanation of variables

YEAR This set contains fire data from 1878 to 2020.
STATE While this data set is from CALFIRE there is some data from Oregon, Nevada, and Arizona. This data reports fires that occurred in both states, it does not specify the origin.
AGENCY Different services may respond to the fires depending on jurisdiction. These services provide data to Cal Fire as a courtesy Different values we may see are CDF for California Department of Forestry and Fire Protection (Cal Fire), United States Forest Service (USFS), Bureau of Land Management (BLM), National Park Service (NPS), Contract Country (CC), Other FED (Federal Fire Protection Agency).
UNIT_ID This is a series of digits to uniquely identify what units are responding to the fire.
FIRE_NAME Fires are often named for geographic location or nearby landmarks like roads, lakes, rivers, and mountains.
INC_NUM Number assigned by thee Emergency Command Center of the responsible agency for the fire.
ALARM_DATE The date the fire was brought to the attention of CALFIRE.
CONT_DATE The date the fire was contained. A fire is 100% contained when a perimeter has been formed around the fire that will prevent it from spreading beyond the line. To form the perimeter fire fighters may use trenches (normally 10-12 feet and shallow), natural barriers like rivers, or even already burned patches of land. Once a fire is contained it may still be burning but within the perimeter.
CAUSE An enumeration of values 1-19 for the reason the fire started. Enumeration 4 is campfire but enumeration 19 is illegal alien campfire which is confusing. Another confusing enumeration is 12 and 13 for firefighter training and non-firefighter training. Wouldn’t every fire that wasn’t firefighter training fall under this category of non-firefighter training. Interesting enumerations is 18 for escaped prescribed burn, 17 volcanic, 11 power line, 7 arson, 14 unknown, and 16 aircraft.
COMMENTS Miscellaneous comments that can provide more information about the fire.
REPORT_AC Estimated area consumed in fire reported in acres.
GIS_ACRES GIS is a geographic information system that uses information from satellites to make inferences. This is numerical data and the units are acres. Given the data is more complete for GIS_ACRES we will be working with this mostly.
C_METHOD The method used to collect perimeter data. C stands for collection here. This is a range of digits from 1-8 that can be GPS, infrared, photos, hand drawn, or mixed collection methods.
OBJECTIVE Either 1 suppression (wildfire) or 2 resource benefit (WFU). A WFU is allowing naturally ignited wild land fires like those started by lightning or lava to burn when in inaccessible terrain where people are not threatened. This is to avoid putting firefighters at risk and keep the land healthy.
FIRE_NUM This has no description and is unclear at the moment. It is probably a method used to identify fires. There is not much research on it either, this will mostly be ignored.
SHAPE_Length This is most likely GIS data. These map units are based on the coordinate system it could be square meters or something else. It could also be angular or linear.
SHAPE_Area The units are unknown.

Discovery:

[1] "The fire dataset dimensions are (21318, 17)"

missing_data <- c(1:ncol(fires))
for(col in 1:ncol(fires)) { 
  colname <- colnames(fires)[col]
  nans <- sum(is.na(fires[, col]))
  bad_strings <- c('', ' ', "UNKNOWN", "UKNOWN", "N/A")
  bad_strings_count <- sum(fires[, col] %in% bad_strings)
  zeros <- sum(fires[, col] <= 0)
  total <- 0
  if(!is.na(nans)) { 
    total <- nans 
  }
  if(!is.na(bad_strings_count)) { 
    total <- total + bad_strings_count
  }
  if(!is.na(zeros)) { 
    total <- total + zeros
  }
  missing_data[col] <- total
}
missing_data <- data.frame(colnames(fires), missing_data)
colnames(missing_data) <- c("Column", "Num NA's, '', or <= 0's")
missing_data

         Column Num NA's, '', or <= 0's
1          year                      77
2         state                       6
3        agency                      10
4       unit_id                      46
5     fire_name                   13416
6       inc_num                    1852
7    alarm_date                   10728
8     cont_date                   25360
9         cause                      48
10     comments                   36974
11    report_ac                   12551
12    gis_acres                       7
13     c_method                   12222
14    objective                     195
15     fire_num                    8123
16 shape_length                       0
17   shape_area                   21317

There are many ways to deal with missing data like ignoring it or changing them to the mean or median. It is important to not change the integrity of the data if you manipulate missing data. The majority of the NA’s are in columns report_ac and c_method which means I will use shape_area instead. The rest of the values mostly come from missing values like ’’. This does not matter so much for comments but is a concern for inc_num, alarm_date, cont_data, and fire_num. For fire_name, many of the “UNKNOWN” fire names may stem from small prescribed burns. This is something to keep in mind as I use this data and will check to see if those old values originate from the older data or not.

Brief summaries of every variable

YEAR

[1] "The range of years in the dataset is from 1878 to 2020"

This graph does not prove that more fires are occurring every year because we do not know how accurate reporting is in the early 1900s. CAL FIRE has data set going back to 1950 but USFS has data from 1878 in here. The two years with the most amount of data is 2017 and 2020 which has had the worse fires in recent history.

STATE

Table 1: Number of observations for each state
states	Freq
AZ	1
CA	21226
NV	73
OR	15

No fires that happened out of this state were under jurisdiction of Cal Fire which suggests these fires may have originated outside of California. These fires can be potentially misleading since the boundary can include both California and the other state. There is no way to determine what the boundary is for the California part of the fire.

AGENCY

Table 2: Number of reports by each agency
agency	num_fires
USF	9556
CDF	6563
CCO	3222
NPS	1011
BLM	652
DOD	163
LRA	111
FWS	22
BIA	12
OTH	1

Surprisingly the United States Forest Service has been in charge of more fires than CalFire. However, this is likely due to them reporting many smaller prescribed burns. I wonder who is in charge of more land, and how jurisdiction is delegated. Private (PVT) was an option in the official documentation but there appears to be no occurrences in this data set. It seems like the groups to pay the most attention to are BLM, CCO, CDF, LRA, NPS, and USF.

We continue to see that the National Park Service has jurisdiction over more large fires than CAL Fire. Despite the increase of fires in recent years, the number of contracted counties in charge of fires has decreased. However, the Bureau of Land Management and National Park Service have been increasing their aid in combating California Fires.

UNIT_ID

out_of_state_units <- fires %>% 
  select(state, unit_id) %>% 
  filter(state != "CA" & state != "" & !is.na(state) & unit_id != '') %>% 
  unique()
in_state_units <- fires %>% 
  select(state, unit_id) %>% 
  filter(state == "CA" & state != '' & !is.na(state) & unit_id != '') %>%
  unique()
#See if there are any units belonging to both states
cross_state_units <- c() 
for(row in 1:nrow(in_state_units)) { 
  unit <- in_state_units[row, "unit_id"]
  if(sum(out_of_state_units[, "unit_id"] == unit) > 0) { 
    cross_state_units <- c(cross_state_units, unit)
    }
}
cross_state_units_df <- fires %>% 
  filter(unit_id == cross_state_units & state != '' & !is.na(state)) %>% 
  select(state, agency, unit_id, gis_acres, fire_name) %>% 
  group_by(state, agency, unit_id) %>% 
  summarise(gis_acres_total = sum(gis_acres)) %>% 
  arrange(unit_id)
kable(cross_state_units_df, caption = 
        "The following units were responsible for fires in multiple states")

Table 3: The following units were responsible for fires in multiple states
state	agency	unit_id	gis_acres_total
CA	BLM	CCD	1392.92076
NV	BLM	CCD	25.46643
CA	USF	HTF	26751.67800
NV	USF	HTF	14267.09300
CA	USF	KNF	235463.13349
OR	USF	KNF	476.97339
CA	BLM	NOD	18492.59674
NV	BLM	NOD	17.40460
CA	USF	TMU	3092.50918
CA	USF	TOI	18235.83105
NV	USF	TOI	10366.60557

The reason that there are some units in charge of fires in multiple states is because they belong to federal organizations like United States Forest Service or Bureau of Land Management.

calfire_units_biggest_fire <- fires %>% 
  filter(agency == "CDF" & !is.na(gis_acres)) %>% 
  group_by(unit_id) %>% 
  slice(which.max(gis_acres)) %>% 
  select(year, agency, unit_id, fire_name, gis_acres) %>% 
  arrange(desc(gis_acres)) %>% 
  head(n=10)
kable(calfire_units_biggest_fire, caption="CAL FIRE Units that Responded to the Top 10 Biggest Fires in California History")

Table 4: CAL FIRE Units that Responded to the Top 10 Biggest Fires in California History
year	agency	unit_id	fire_name	gis_acres
2018	CDF	LNU	RANCH	410202.47
2020	CDF	SCU	SCU COMPLEX	396399.00
2007	CDF	SBC	ZACA	240358.70
2018	CDF	SHU	CARR	229651.41
1977	CDF	BEU	MARBLE-CONE	173333.45
2007	CDF	MVU	WITCH	162070.47
2018	CDF	BTU	CAMP	153335.56
1990	CDF	TGU	CAMPBELL	131504.22
2020	CDF	CZU	CZU LIGHTNING COMPLEX	86553.46
1985	CDF	SLU	LAS PILITAS	84271.42

I would like to calculate the busiest years for each unit, however an accurate way of calculating this is stopping me. Factors to consider are center, spread, range, count, and max fires for each unit every year. I need a function that will return the hardest year based off this data.

FIRE_NAME

#There's still other bad fire names in the data set
bad_fire_names <- c("UNKNOWN", "N/A", "", "UKNOWN", " ")
popular_fire_names <- fires %>%
  filter(!is.na(fire_name)) %>% 
  subset(!(fire_name %in% bad_fire_names)) %>% 
  group_by(fire_name) %>% 
  count(name="count") %>% 
  arrange(desc(count)) %>% 
  head(n=10)
kable(popular_fire_names, caption="10 Most Popular Names for Fires")

Table 5: 10 Most Popular Names for Fires
fire_name	count
CANYON	45
RIVER	39
LAKE	37
CREEK	35
COTTONWOOD	33
BEAR	32
RIDGE	30
RANCH	28
SPRING	28
PINE	27

biggest_fire_names <- fires %>% 
  arrange(desc(gis_acres)) %>% 
  select(year, fire_name, gis_acres) %>% 
  head(n=5)
kable(biggest_fire_names, caption="Top 5 Biggest Fires in California History and their Names")

Table 6: Top 5 Biggest Fires in California History and their Names
year	fire_name	gis_acres
2020	AUGUST COMPLEX FIRES	1032699.0
2002	BISCUIT	501082.0
2018	RANCH	410202.5
2020	SCU COMPLEX	396399.0
2020	CREEK	379882.2

[1] "A moment of condolences for anyone affected by the Camp Fire in Paradise or any other fire."

  fire_name year unit_id gis_acres alarm_date  cont_date
1      CAMP 2018     BTU  153335.6 2018-11-08 2018-11-26

Fire names are not unique and there are some fires that share the same name. There are many fire names that have been entered in incorrectly, that also contain back slashes or other gibberish.

ALARM_DATE

new_dates <- fires
new_dates$alarm_date <- as.Date(new_dates$alarm_date, "%Y/%m/%d")
new_dates$cont_date <- as.Date(new_dates$cont_date, "%Y/%m/%d")
busiest_days <- new_dates %>% 
  filter(!is.na(alarm_date)) %>% 
  group_by(alarm_date) %>% 
  summarise(num_fires = n(), 
            mean = mean(gis_acres), 
            median = median(gis_acres), 
            sd=sd(gis_acres)) %>% 
  arrange(desc(num_fires)) %>% 
  head(n=5)
kable(busiest_days, caption="Top 5 Number of Daily Fires Since 1878")

Table 7: Top 5 Number of Daily Fires Since 1878
alarm_date	num_fires	mean	median	sd
2008-06-21	122	5178.631	245.33860	16386.395
1987-08-30	61	8144.967	1932.24430	13913.601
2003-09-03	56	515.189	58.68177	1629.138
2015-07-30	50	4250.587	389.58737	10438.913
1999-08-23	47	7053.712	1571.95800	19095.666

It is hard to imagine that 122 fires are logged on June 6 even if it is a summer day. Before I analyze more time’s, lets find out the meaning behind this. One explanation is it could be a bunch of controlled burns. Many of these did share the same inc_number. However, many had different names and containment dates. While the primary cause on July 30, 2015 for lots of fires was smoking, on July 6, 2008 it was just lots of lightning.

CONT_DATE

format <- "%Y/%m/%d %H:%M:%S %z"
longest_fires <- new_dates %>% 
  select(fire_name, alarm_date, cont_date, cause, objective) %>% 
  filter(!is.na(alarm_date) & !is.na(cont_date)) %>% 
  mutate(fire_duration = 
           difftime(cont_date, alarm_date, format, units="days")) %>% 
  arrange(desc(fire_duration)) %>% 
  head(n=10)
kable(longest_fires, caption="Longest Lasting Fires in California History Since 1878")

Table 8: Longest Lasting Fires in California History Since 1878
fire_name	alarm_date	cont_date	cause	objective	fire_duration
PIUTE	1990-08-13	1999-09-14	1	1	3319 days
RALSTON	2006-09-05	2007-09-17	9	1	377 days
WESTLAKE	2006-07-08	2007-07-09	2	1	366 days
LAVAL	2012-06-03	2013-06-03	14	1	365 days
GORGE	2018-08-19	2019-08-19	14	1	365 days
PANTHER	2013-05-01	2013-12-09	9	1	222 days
POWERHOUSE	2013-05-30	2013-12-18	9	1	202 days
LAKE	2015-06-17	2015-12-31	9	1	197 days
ROSASCO	2019-05-23	2019-12-02	9	1	193 days
PONY	2016-06-07	2016-12-15	1	1	191 days

In the data set there are some alarm_dates and cont_dates entered incorrectly, like the cont_date occurring before the alarm_date. Hence, there is a possibility that these results have also been entered in incorrectly. The reason that some of these fires last so long, is that fire responders allow it, because it is in a rural area and fires can be beneficial. The cause for most of these fires is lightning, unknown, or miscellaneous.

[1] “Fire Duration (Hours) Summary Statistics” Min. 1st Qu. Median Mean 3rd Qu. Max. 24 24 72 368 288 8760

A problem is that there are so many missing values in cont_date there are very few fires you can find the fire duration for relative to the whole data set. Among those available, there is a huge variance in the duration of fires. These shorter lasting fires are not just prescribed burns and have many different causes.

CAUSE

Table 9: Causes of the Biggest Fires on Average since 1878
cause	string_cause	avg_size	count
4	Campfire	3530.47878	380
1	Lightning	2969.34811	3454
16	Aircraft	2474.22459	14
9	Miscellaneous	2285.43451	3379
7	Arson	2200.11588	903
10	Vehicle	1534.41743	454
14	Unknown/Unidentified	1496.36036	9541
2	Equipment Use	1460.92934	1246
11	Power Line	1274.56765	412
6	Railroad	1169.29763	80
3	Smoking	1048.65940	342
13	Non-Firefighter Training	1013.68027	11
5	Debris	875.22631	723
8	Playing with Fire	657.54545	196
18	Escaped Prescribed Burn	623.48734	90
12	Firefighter Training	594.01999	5
15	Structure	391.06574	21
19	Illegal Alien Campfire	83.46374	17

As I previously mentioned, the reason some fires duration is so long is because they are in remote areas and allowed to continue to burn. Despite this, human induced fires like campfires, still produce bigger fires on average than nature induced fires like lightning. In the next sections we will explore if certain causes of fires are becoming more or less frequent.

COMMENTS

The max length of a comment is 260 characters, comments are longer but they are cut off by an * indicating there is more to that comment somewhere. By reading comments interesting ones to me were “The cause was target shooting”, “… Total Cost 18,600,600”, “Children playing with fire”, and names of people. Most of the comments in the data set are empty.

GIS_ACRES

Table 10: Highest Acres Burnage in California History since 1878
year	gis_acres	square_miles
2020	4159334.0	6498.959
2018	1590431.2	2485.049
2017	1424559.4	2225.874
2008	1382462.2	2160.097
2007	1040224.3	1625.350
2003	970479.3	1516.374
2002	963898.5	1506.091
1987	862910.9	1348.298
2012	847714.6	1324.554
1999	801137.0	1251.777

For reference 500,000 acres is equal to 780 square miles, a square mile being a square with each side being 1 mile in length!!! The size of Yosemite national park is 1169 square miles, and the size of New York is 302 square miles.

[1] "Number of fire reports where the difference between report_ac and gis_acres is greater than 100: 1250"

There are many differences between REPORT_AC and GIS_ACRES. Many of the discrepancies come from the 2000s as well so it is not the fault of the older data. There is less missing data for GIS_ACRES so we will primarily be depending on this. There is the possibility of using REPORT_AC data if GIS_ACRES is missing and vice versa.

C_METHOD

get_mode <- function(vec) { 
  frequencies <- table(vec) %>% sort(decreasing=TRUE)
  strtoi(names(frequencies)[1])
} 

c_method_yr <- fires %>% 
  filter(!is.na(c_method) & c_method != 8) %>%
  group_by(year) %>%
  summarise(c_method = get_mode(c_method))
year <- fires %>% filter(year == "2020")

#Collection methods 
first_yr_c_method <- fires %>% select(year, c_method) %>% 
  drop_na() %>% arrange(year) %>% head()
total_c <- table(fires$c_method)
labels <- c("GPS Ground", "GPS Air", "Infared", "Other Imagery", 
            "Photo Interpretation", "Hand Drawn", "Mixed Collection Tools",
            "Unknown")
df <- data.frame(Method = labels, count = as.vector(total_c))
df <- arrange(df, desc(count))

Here we can see the transition of the most frequently used collection method as time goes on.

OBJECTIVE

Table 11: Top 5 Biggest Wildfires in California since 1878
year	fire_name	gis_acres	cause	objective
2020	AUGUST COMPLEX FIRES	1032699.0	1	1
2002	BISCUIT	501082.0	1	1
2018	RANCH	410202.5	14	1
2020	SCU COMPLEX	396399.0	1	1
2020	CREEK	379882.2	14	1

Table 11: Top 5 Biggest Prescribed Burns since 1878
year	fire_name	gis_acres	cause	objective
2008	CLOVER	15788.590	1	2
2008	TEHIPITE	11648.092	1	2
1977	FERGUSON	10421.790	1	2
2005	COMB	9756.467	1	2
2010	SHEEP COMPLEX	9021.522	1	2

As we saw previously with the longest lasting fires in recent history most of them originated from wildfires. It is interesting that the cause of the biggest prescribed fires have all been lightning. In fact, there are no prescribed burns that do not originate from lightning in this data set that has GIS_ACRES data.

#Are number of prescribed burns increasing? 
yearly_resource_burns <- resource %>% 
  filter(!is.na(gis_acres) & gis_acres > 0) %>% 
  select(year, gis_acres) %>% 
  group_by(year) %>% 
  summarise(num_fires = n(), total_acres_burned = sum(gis_acres))

The number of acres burned in prescribed burns has never been lower until this last decade. At the same time, California has experienced some of the worst fires in the last decade. While correlation does not prove causation, poor fire management in the off season may explain the recent surge in fires.

Exploratory Data Analysis

# Number of acres burned 2010-2020
decade_fires <- fires %>% select(year, gis_acres) %>%
  filter(year >= "2010" & !is.na(gis_acres)) 
decade_fires_sum <- decade_fires %>% group_by(year) %>% 
  summarise(gis_acres = sum(gis_acres))
decade_fires_sum$gis_acres <- as.integer(decade_fires_sum$gis_acres)
ggplot(decade_fires_sum, aes(x=year, y=gis_acres)) + 
  geom_bar(stat="identity", fill="gray70") + 
  geom_text(aes(label=gis_acres)) + 
  xlab("Years") + ylab("Acres Burned") + 
  ggtitle("Number of Acres Burned 2010-2020")

Arson does not have much statistical analysis, it is just surprising how much land is burned every year due to this crime.

Conclusion

The biggest discovery is that the number of prescribed burns has never been lower from 2010-2020 in recent California Fire History. This may be cause for why we have seen such a big spike in fires in recent history. Also, R’s ggplot makes it extremely difficult to customize your graph.