K-means Clustering for Old Faithful Geyser Eruptions Analysis

Kasumi Gunasekara
8 min readDec 24, 2018

--

A simple Machine Learning approach to Geology

Geyser Eruption in Yellowstone, Wyoming, USA

The environmental aspects are the key concern in every industry today due to natural phenomena, explorations, pollution and the scarcity of resources. Machine Learning can be a useful tool for geology, meteorology and in other environmental research areas for resource estimation and predictions.

Scenario, ’cause it’s interesting!

A natural geyser is a spring that intermittently discharges hot water via the surface of the Earth which extracts heat from the Earth. Geysers are resulted from the heating of ground water by shallow bodies of magma which are usually associated with historic volcanic activities involved areas around the globe. This discharging/spouting action is caused by the release of pressure from the boiling water beneath a geyser in a considerable depth with a narrow dimension.

This pressure is released to the atmosphere through a conduit from the origin point to the Earth’s surface. Then the water at the depth exceeds its boiling point and flashes causing more water through the conduit while reducing the pressure further.

The mechanism of a geyser

The boiling temperature of water increases with prevailing pressure, depth of the geyser, and the Geo-thermal power generated from the steam depends on the volcanic heat sources that result the geyser. When the water ejected is cooled, dissolved silica (Silicon dioxide) is precipitated on the surface.

Geysers are rare and most of them are in Yellowstone, USA, and others are in Russia, New Zealand, and Iceland which are also surrounded with volcanic areas. Old Faithful is the most famous geyser in Yellowstone National Park, Wyoming, USA which spouts boiling water and steam to a height of 30–55 meters.

This cone-typed geyser was named in 1870 as Faithful because its spouting happens faithfully every 63–70 minutes although continuous observations has revealed that the current eruptions occurred with an irregularity with varying duration. Moreover, the duration have become less predictable after Borah Peak — Idaho earthquake in 1983 and other earthquakes around the area.

An eruption in Old Faithful geyser, USA

The observations of the community upon natural geysers have been driven by several significant factors about their existence and behavior:

  • The geysers are considered as a model for volcano eruption in order to identify the process of eruption, propagation to the Earth’s surface and through atmosphere via generalization of volcanoes and geysers regardless of their impact variations.
  • The geysers are geothermal source which is a gateway to hot water from the Earth which can be utilized as a replication of geothermal energy and its applications.
  • The eruptions in geysers are natural phenomena which would reveal the way of energy propagation inside the Earth, and the mechanism and causes behind the geysers.

The geothermal energy is a renewable power source with Eco-friendly aspects with having no effects on environmental pollution which is a serious issue occurring these days. The observations on geysers can be utilized to understand and apply their behavior to geothermal energy supplement to the power scarcity. Moreover, considerable energy that is released from geyser eruptions can also be accumulated to the energy sources by using them appropriately without disturbing their nature.

The geyser eruption data can be used to analyze and identify other natural phenomena including global warming, earthquakes. Those aspects would be able to provide more details and predictions upon future outcomes from the environment for ongoing activities done by human-beings.

Therefore, the scientists, geologists and government authorities have the interest and have engaged in the observations and analysis of the geysers and eruptions throughout several decades. Old Faithful is one of the most studied and observed geysers although it is not the largest geyser active currently. In this approach, Old Faithful Geyser data is being analyzed with a machine learning approach to support the observations for predicting future behavior of not only geysers, but also the Earth.

  • The data set of Old Faithful that has been occupied here was obtained from:

https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat

https://www.picostat.com/dataset/r-dataset-faithful

  • The data set contains waiting time between two consecutive eruptions (in minutes) and the duration of the eruption (in minutes) for the Old Faithful geyser.
  • There are 272 observations of eruptions occurred in Old Faithful.

Here comes K-means Clustering

Clustering is an unsupervised learning algorithm in Machine Learning which has the target of identifying interesting patterns in data by grouping objects together in a manner that the objects in the same cluster have similarities rather than with objects in other clusters. The similarity is measured using either Euclidean distance or correlation-based distance. The clustering mechanism is done either on the basis of features to find clusters or on the basis of samples to find subsets of features.

Clustering Mechanism

K-Means is a clustering algorithm which is a centroid-based, iterative algorithm that partitions data into K number of non-overlapping subgroups. This algorithm is randomly initialized, and then it iterates while assigning centroids, clustering points around centroids and comparing each data point with every centroid to find the difference. This difference is measured with Euclidean distance between considered points in the calculation.

Applying K-means Algorithm to Old Faithful Geyser Data

The data set has 272 observations regarding the geyser with waiting time and the duration of eruptions. Therefore, it can be drawn that there are two features to be considered:

1. Waiting time between two consecutive eruptions (integer)

2. Duration of the eruption (floating-point)

The following table includes 20 data pairs which have been selected to represent the method of applying K-means algorithm to the selected data set of Old Faithful geyser.

Old Faithful geyser erutions

It is recommended to obtain a graphical representation of data prior to applying any analyzing steps to it; therefore the selected data points have been plotted. In this scenario, it is easy to plot data because of two-dimensional nature.

Plot of data
  1. Initial allocation of points for clusters

The value for K is 2 (K=2) which reflects that there should be two centroids within the data for the two clusters. The following tables provide data points assigned for two clusters.

Data points for clusters

2. Calculation of centroids

Centroid for C1:

Centroid for C2:

3.Measuring distance to centroids

The Euclidean distance for each data point from every centroid is calculated and assigns the points to the nearest centroid. Euclidean distance between two points is the square root of the sum of squared difference between the points.

The Euclidean distance for each point

The above process should iterate until there is no difference to the centroids that is there is no change on assigned points to each cluster.

Calculation of centroids:

Centroid for C1:

Centroid for C2:

The coordinates of centroids are different from previous values; therefore the process has to be iterated again by calculating Euclidean distance for points and finding their nearest centroid iteratively until there is no difference between previous centroids and current centroids calculated.

The Implementation of K-means Algorithmic Analysis upon Eruption Data from Old Faithful Geyser

This implementation is performed using Python programming language and related libraries to achieve the task. All 272 observations from the source have been taken into account for the purpose.

The value that has been assigned for K is 2 to perform K-means algorithm, and two clusters have been derived from the standardized data. The standardizing of data means containing data with a zero mean and standard deviation of one which is recommended because the features might not be in the same measurement units.

After several iterations (random states), the centroids can be obtained from K-means Clustering:

1. Centroid 1: (0.70970327, 0.67674488)

2. Centroid 2: (-1.26008539, -1.20156744)

The two clusters in data interprets that there are two series of eruptions in Old Faithful geyser; eruptions with short intervals and eruptions with long intervals (more than 3 minutes). The eruptions with long intervals last longer than short interval eruptions, because longer eruptions require more effort than short interval discharges. Furthermore, the geyser is having an increasing number of long eruptions than shorter eruptions.

According to the above details, it can be assumed that Old Faithful geyser has varying behavior upon eruption in different situations. These conditions including atmospheric temperature, availability of water, wind speed, depth of the conduit, distant earthquakes should be analyzed further for authenticating those variations. The approach that is implemented in this scenario with K-means could provide predictions for future eruptions in terms of their duration and waiting time.

You can find the full approach described in:

Here’s the thing…

Although K-means clustering algorithm is appropriate for this scenario, there would be issues when applying it to even for similar situations. The algorithm occupied that is K-means Clustering is an iterative approach which depends on the random initialization. Therefore, the number of iterations required to perform and convergence to a stable state are highly influenced by the initialization of centroids. Furthermore, the specification of number of clusters to a considered problem at the initial stage might not be efficient since it has to be derived from domain knowledge and intuition.

Photo by Nicole De Khors from Burst

I hope this article has been useful for you. If I missed anything, please let me know. I would like to discuss advanced methods of spam filtering in future stories.

Merry Christmas!

--

--