Flood Prediction for Small Rivers and Streams with Limited Data

Collin Barnwell (barnwell.collin@gmail.com)

Northwestern University

EECS 349 - Machine Learning: Final Project

Synopsis

Task

I used machine learning techniques to predict the river discharge of small rivers given daily time series data of precipitation and previous discharge up until the day being predicted, as well as the current precipitation.

I also attempted to create a model that can be used to predict discharge in a river after training exclusively on other rivers.

Motivation

Existing stream discharge models are very complex and involve modeling evaporation rates, soil storage potential, climate, and many other variables. These models need a lot of input data - everything from varying soil types and plant cover across the watershed to evaporation rates while the water is flowing must be known.

The models are not very portable because for an effective model to be built for a river, the geology and biology of a watershed must be well understood.

My model, which uses only stream discharge and precipitation, is much easier to apply to rivers and streams in areas that are much less scientifically well understood.

Summary of Results

I used a linear regression model to accomplish this task with decent success. I was able to predict discharge both training on the river itself, and training on rivers with very similar geography.

Report

Data

I originally got my data from the US Geological Survey Water Information System (discharge) and the National Oceaonic an Atmospheric Administration (precipitation). Both came in the form of time series csv data. I processed it in Python so that I would have features to work with that were associated with specific dates. I also had to make the dates for the two data sources line up, and check for long periods with null values. I used data for 6 rivers and about 60,000 total days. For each day, I created the following features:

P – Precipitation today

Tminus1P – Precipitation on previous day

Tminus1Q – Discharge on previous day

Tminus2P – Precipitation two days ago

Tminus2Q – Discharge two days ago

weekBeforeP – Total precipitation the week before Tminus2

weekBeforeQ – Average discharge on the week before Tminus2

weekBeforeBeforeP – etc.

weekBeforeBeforeQ

monthBeforeP

monthBeforeQ

monthBeforeBeforeP

monthBeforeBeforeQ

fourMonthsBeforeP

fourMonthsBeforeQ

fourMonthsBeforeBeforeP

fourMonthsBeforeBeforeQ

I then normalized all of the features (also using Python) by calculating their Z-scores. I did this to make rivers more similar to eachother and therefore easier to predict one river's discharge after training on data from another river.

My Solution, Training and Testing Procedures

There were two different problems that were interesting to me that required training and testing in different ways.

Predicting discharge in a river after training on its own past data - This would be useful for places where basic flow and rainfall data was available for a river, but not enough geological or biological data for a physical model to be effective.

Predicting river discharge in a river after training on other rivers - This would be much more helpful, because it wouldn't require any learning time after a river discharge sensor is initially installed in a river, but also much more difficult, as the learner would have to compensate for not knowing the differences in climate and geography in two different rivers. This problem can be solved partially by only training on similar streams in similar areas.

After experimenting with several different machine learning techniques, my best model was a linear regression for both problems.

Choosing Features

In solving the first problem for most of the rivers, a linear regression using just Tminus1Q, Tminus2Q and P was almost as good as anything with any more attributes. I included both models in my results below.

Results

Above are my 10-fold cross validation error percentages for my model. In the first column, the linear regression is trained on the same river it is tested on, and it uses all features (see "The Data"). The middle column is the same, except the model is built using a smaller feature set (see "Choosing Features"). In the third column, one run is done where the model is trained on the stream in Lena, MS, and tested on the stream in Merigold, MS, and another run is done with the two rivers switched.

The results were much better in streams that never ran dry (FL, MS and CA), and best in wetter areas (FL, MS).

I was very impressed with the quality of the prediction . Unfortunately, I got wild results when I tried to compare streams that weren't extremely similar geographically (Even comparing the two dry streams or the wet streams in MS to the slower moving stream in FL). Still, this result could potentially be very useful, as it would only require prior data for one stream in an area to predict the behaviors of other streams after sensors are installed.

Other Approaches

Early in the project, I originally planned to implement a neural network in Python, but I decided that I wasn't confident enough that a neural network would be the best approach and I preferred to try a multitude of approaches and find out what worked than to work hard to implement and fine-tune a single algorithm that wouldn't necessarily be the best for the problem. For this reason, I chose to work in Weka to do my actual Machine Learning analysis.

I think moving away from neural networks turned out to be a good call. I only ever acheived mediocre results with the multi-layer perceptron algorithm in Weka.

Using Weka, I also explored nearest neighbor, which performed nearly as well as a linear regression for some rivers, but much worse in others. In a later iteration of the project, nearest neighbor could potentially be used with multiple linear regression models to help find similar rivers (for Problem 2) and weight the linear regression models accordingly.

Limitations

There are a few things that my model did not address that could improve its usefulness, beyond just improving accuracy across the board. These are a few ways my model could be improved:

The fact that my model doesn't weight past precipitation very highly is a little bit concerning, as precipitation is definitely a major cause of water level change. I would like to find a way to address this.

Predicting discharge in larger rivers with larger watersheds - I stuck to smaller rivers so that I only needed one precipitation location. Some watersheds are big enough where there could be precipitation in one area of the watershed and no precipitation in another area, but both could affect discharge.

Weighting the importance of certain scenarios - ie. it's much worse to predict a low water level during a destructive flood than it is to predict an average water level when the water is low.

Along those same lines, I could change the ouput space to be more relavent to emergency situations and possibly improve accuracy when it matters most - ie. a binary output: Will there be a destructive flood today?

Areas with snow - I did not include any attributes that might help the model account for snow melt, and I only tested on rivers in warm areas for this project

Conclusion

I was impressed with my results, especially that I was able to achieve the same level of accuracy training on one similar river and testing on another, and I think this part of the project would be the part I would be most interested in continuing to investigate.

I learned a lot working on the project and was able to explore something that interests me. I believe there is a lot of potential for machine learning to be applied to the broader environmental science community and that machine learning could possibly be used for predicting other natural disasters. From what I can tell, this is an area that has not been researched widely.