I used machine learning techniques to predict the river discharge of small rivers given daily time series data of precipitation and previous discharge up until the day being predicted, as well as the current precipitation.
I also attempted to create a model that can be used to predict discharge in a river after training exclusively on other rivers.
Existing stream discharge models are very complex and involve modeling evaporation rates, soil
storage potential, climate, and many other variables. These models need a lot of input data - everything
from varying soil types and plant cover across the watershed to evaporation rates while the water is flowing
must be known.
The models are not very portable because for an effective model to be built for a river, the geology and
biology of a watershed must be well understood.
My model, which uses only stream discharge and precipitation, is much easier to apply to
rivers and streams in areas that are much less scientifically well understood.
I used a linear regression model to accomplish this task with decent success. I was able to predict discharge both training on the river itself, and training on rivers with very similar geography.
I originally got my data from the US Geological Survey Water Information System (discharge) and the National Oceaonic an Atmospheric Administration (precipitation). Both came in the form of time series csv data. I processed it in Python so that I would have features to work with that were associated with specific dates. I also had to make the dates for the two data sources line up, and check for long periods with null values. I used data for 6 rivers and about 60,000 total days. For each day, I created the following features:
P – Precipitation today |
Tminus1P – Precipitation on previous day |
Tminus1Q – Discharge on previous day |
Tminus2P – Precipitation two days ago |
Tminus2Q – Discharge two days ago |
weekBeforeP – Total precipitation the week before Tminus2 |
weekBeforeQ – Average discharge on the week before Tminus2 |
weekBeforeBeforeP – etc |
weekBeforeBeforeQ |
monthBeforeP |
monthBeforeQ |
monthBeforeBeforeP |
monthBeforeBeforeQ |
fourMonthsBeforeP |
fourMonthsBeforeQ |
fourMonthsBeforeBeforeP |
fourMonthsBeforeBeforeQ |
I then normalized all of the features (also using Python) by calculating their Z-scores. I did this to make rivers more similar to eachother and therefore easier to predict one river's discharge after training on data from another river.
There were two different problems that were interesting to me that required training and testing in different ways.
In solving the first problem for most of the rivers, a linear regression using just Tminus1Q, Tminus2Q and P was almost as good as anything with any more attributes. I included both models in my results below.
Above are my 10-fold cross validation error percentages for my model. In the first column, the linear regression is trained on the same river it is tested on, and it uses all features (see "The Data"). The middle column is the same, except the model is built using a smaller feature set (see "Choosing Features"). In the third column, one run is done where the model is trained on the stream in Lena, MS, and tested on the stream in Merigold, MS, and another run is done with the two rivers switched.
The results were much better in streams that never ran dry (FL, MS and CA), and best in wetter areas (FL, MS).
I was very impressed with the quality of the prediction . Unfortunately, I got wild results when I tried to compare streams that weren't extremely similar geographically (Even comparing the two dry streams or the wet streams in MS to the slower moving stream in FL). Still, this result could potentially be very useful, as it would only require prior data for one stream in an area to predict the behaviors of other streams after sensors are installed.
Early in the project, I originally planned to implement a neural network in Python, but I decided that I wasn't confident enough that a neural network would be the best approach and I preferred to try a multitude of approaches and find out what worked than to work hard to implement and fine-tune a single algorithm that wouldn't necessarily be the best for the problem. For this reason, I chose to work in Weka to do my actual Machine Learning analysis.
I think moving away from neural networks turned out to be a good call. I only ever acheived mediocre results with the multi-layer perceptron algorithm in Weka.
Using Weka, I also explored nearest neighbor, which performed nearly as well as a linear regression for some rivers, but much worse in others. In a later iteration of the project, nearest neighbor could potentially be used with multiple linear regression models to help find similar rivers (for Problem 2) and weight the linear regression models accordingly.
There are a few things that my model did not address that could improve its usefulness, beyond just improving accuracy across the board. These are a few ways my model could be improved:
I was impressed with my results, especially that I was able to achieve the same level of accuracy training on one similar river and testing on another, and I think this part of the project would be the part I would be most interested in continuing to investigate.
I learned a lot working on the project and was able to explore something that interests me. I believe there is a lot of potential for machine learning to be applied to the broader environmental science community and that machine learning could possibly be used for predicting other natural disasters. From what I can tell, this is an area that has not been researched widely.