Title: | Partitioning Using Local Subregions |
---|---|
Description: | A method of clustering functional data using subregion information of the curves. It is intended to supplement the 'fda' and 'fda.usc' packages in functional data object clustering. It also facilitates the printing and plotting of the results in a tree format and limits the partitioning candidates into a specific set of subregions. |
Authors: | Mark Greenwood [aut] |
Maintainer: | Tan Tran <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.2 |
Built: | 2025-02-11 03:27:41 UTC |
Source: | https://github.com/vinhtantran/puls |
A data set containing the daily ice extent at Arctic Sea from 1978 to 2019, collected by National Oceanic and Atmospheric Administration (NOAA)
arctic_2019
arctic_2019
A data frame with 13391 rows and 6 variables:
Years of available data (1978–2019).
Month (01–12).
Day of the month indicated in Column Month.
Daily ice extent, to three decimal places.
Whether a day is missing (1) or not (0)).
data source in NOAA database.
https://nsidc.org/data/G02135/versions/3
library(dplyr) library(lubridate) library(ggplot2) data(arctic_2019) # Create day in the year column to replace Month and Day north <- arctic_2019 %>% mutate(yday = yday(make_date(Year, Month, Day)), .keep = "all") %>% select(Year, yday, Extent) ggplot(north) + geom_linerange(aes(x = yday, ymin = Year - 0.2, ymax = Year + 0.2), size = 0.5, color = "red") + scale_y_continuous(breaks = seq(1980, 2020, by = 5), minor_breaks = NULL) + labs(x = "Day", y = "Year", title = "Measurement frequencies were not always the same")
library(dplyr) library(lubridate) library(ggplot2) data(arctic_2019) # Create day in the year column to replace Month and Day north <- arctic_2019 %>% mutate(yday = yday(make_date(Year, Month, Day)), .keep = "all") %>% select(Year, yday, Extent) ggplot(north) + geom_linerange(aes(x = yday, ymin = Year - 0.2, ymax = Year + 0.2), size = 0.5, color = "red") + scale_y_continuous(breaks = seq(1980, 2020, by = 5), minor_breaks = NULL) + labs(x = "Day", y = "Year", title = "Measurement frequencies were not always the same")
An implementation of the monoClust::as_MonoClust()
S3 method for PULS
object. The purpose of this is to reuse plotting and printing functions from
monoClust package.
## S3 method for class 'PULS' as_MonoClust(x, ...)
## S3 method for class 'PULS' as_MonoClust(x, ...)
x |
A PULS object to be coerced to MonoClust object. |
... |
For extensibility. |
A MonoClust object coerced from PULS object.
monoClust::MonoClust.object and PULS.object
Calculate the distance between functional objects over the defined range.
fdistmatrix(fd, subrange, distmethod)
fdistmatrix(fd, subrange, distmethod)
fd |
A functional data object |
subrange |
A vector of two values indicating the value range of functional object to calculate on. |
distmethod |
The method for calculating the distance matrix. Choose
between |
If choosing distmethod = "manual"
, the L2 distance between all pairs of
functions and
is given by:
A distance matrix with diagonal value and the upper half.
library(fda) # Examples taken from fda::Data2fd() data(gait) # Function only works on two dimensional data gait <- gait[, 1:5, 1] gaitbasis3 <- create.fourier.basis(nbasis = 5) gaitfd3 <- Data2fd(gait, basisobj = gaitbasis3) fdistmatrix(gaitfd3, c(0.2, 0.4), "usc")
library(fda) # Examples taken from fda::Data2fd() data(gait) # Function only works on two dimensional data gait <- gait[, 1:5, 1] gaitbasis3 <- create.fourier.basis(nbasis = 5) gaitfd3 <- Data2fd(gait, basisobj = gaitbasis3) fdistmatrix(gaitfd3, c(0.2, 0.4), "usc")
After partitioning using PULS, this function can plot the functional waves and color different clusters as well as their medoids.
ggwave( toclust.fd, intervals, puls.obj, xlab = NULL, ylab = NULL, lwd = 0.5, alpha = 0.4, lwd.med = 1 )
ggwave( toclust.fd, intervals, puls.obj, xlab = NULL, ylab = NULL, lwd = 0.5, alpha = 0.4, lwd.med = 1 )
toclust.fd |
A functional data object (i.e., having class |
intervals |
A data set (or matrix) with rows are intervals and columns are the beginning and ending indexes of of the interval. |
puls.obj |
A |
xlab |
Labels for x-axis. If not provided, the labels stored in |
ylab |
Labels for y-axis. If not provided, the labels stored in |
lwd |
Linewidth of normal waves. |
alpha |
Transparency of normal waves. |
lwd.med |
Linewidth of medoid waves. |
A ggplot2 object.
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") ggwave(toclust.fd = yfd$fd, intervals = intervals, puls = PULS4_pam)
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") ggwave(toclust.fd = yfd$fd, intervals = intervals, puls = PULS4_pam)
Print the PULS tree in the form of dendrogram.
## S3 method for class 'PULS' plot( x, branch = 1, margin = c(0.12, 0.02, 0, 0.05), text = TRUE, which = 4, digits = getOption("digits") - 2, cols = NULL, col.type = c("l", "p", "b"), ... )
## S3 method for class 'PULS' plot( x, branch = 1, margin = c(0.12, 0.02, 0, 0.05), text = TRUE, which = 4, digits = getOption("digits") - 2, cols = NULL, col.type = c("l", "p", "b"), ... )
x |
A |
branch |
Controls the shape of the branches from parent to child node. Any number from 0 to 1 is allowed. A value of 1 gives square shouldered branches, a value of 0 give V shaped branches, with other values being intermediate. |
margin |
An extra fraction of white space to leave around the borders of the tree. (Long labels sometimes get cut off by the default computation). |
text |
Whether to print the labels on the tree. |
which |
Labeling modes, which are:
|
digits |
Number of significant digits to print. |
cols |
Whether to shown color bars at leaves or not. It helps matching
this tree plot with other plots whose cluster membership were colored. It
only works when |
col.type |
When |
... |
Arguments to be passed to |
A plot of splitting order.
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") plot(PULS4_pam)
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") plot(PULS4_pam)
Render the PULS split tree in an easy to read format with important information such as terminal nodes, etc.
## S3 method for class 'PULS' print(x, spaces = 2L, digits = getOption("digits"), ...)
## S3 method for class 'PULS' print(x, spaces = 2L, digits = getOption("digits"), ...)
x |
A |
spaces |
Spaces indent between 2 tree levels. |
digits |
Number of significant digits to print. |
... |
Arguments to be passed to |
A nicely displayed PULS split tree in text.
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") print(PULS4_pam)
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") print(PULS4_pam)
PULS function for functional data (only used when you know that the data shouldn't be converted into functional because it's already smooth, e.g. your data are step function)
PULS( toclust.fd, method = c("pam", "ward"), intervals = c(0, 1), spliton = NULL, distmethod = c("usc", "manual"), labels = toclust.fd$fdnames[2]$reps, nclusters = length(toclust.fd$fdnames[2]$reps), minbucket = 2, minsplit = 4 )
PULS( toclust.fd, method = c("pam", "ward"), intervals = c(0, 1), spliton = NULL, distmethod = c("usc", "manual"), labels = toclust.fd$fdnames[2]$reps, nclusters = length(toclust.fd$fdnames[2]$reps), minbucket = 2, minsplit = 4 )
toclust.fd |
A functional data object (i.e., having class |
method |
The clustering method you want to run in each subregion. Can be
chosen between |
intervals |
A data set (or matrix) with rows are intervals and columns are the beginning and ending indexes of of the interval. |
spliton |
Restrict the partitioning on a specific set of subregions. |
distmethod |
The method for calculating the distance matrix. Choose
between |
labels |
The name of entities. |
nclusters |
The number of clusters. |
minbucket |
The minimum number of data points in one cluster allowed. |
minsplit |
The minimum size of a cluster that can still be considered to be a split candidate. |
If choosing distmethod = "manual"
, the L2 distance between all pairs of
functions and
is given by:
A PULS
object. See PULS.object for details.
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") PULS4_pam
library(fda) # Build a simple fd object from already smoothed smoothed_arctic data(smoothed_arctic) NBASIS <- 300 NORDER <- 4 y <- t(as.matrix(smoothed_arctic[, -1])) splinebasis <- create.bspline.basis(rangeval = c(1, 365), nbasis = NBASIS, norder = NORDER) fdParobj <- fdPar(fdobj = splinebasis, Lfdobj = 2, # No need for any more smoothing lambda = .000001) yfd <- smooth.basis(argvals = 1:365, y = y, fdParobj = fdParobj) Jan <- c(1, 31); Feb <- c(31, 59); Mar <- c(59, 90) Apr <- c(90, 120); May <- c(120, 151); Jun <- c(151, 181) Jul <- c(181, 212); Aug <- c(212, 243); Sep <- c(243, 273) Oct <- c(273, 304); Nov <- c(304, 334); Dec <- c(334, 365) intervals <- rbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) PULS4_pam <- PULS(toclust.fd = yfd$fd, intervals = intervals, nclusters = 4, method = "pam") PULS4_pam
The structure and objects contained in PULS, an object returned from
the PULS()
function and used as the input in other functions in the
package.
Data frame in the form of a tibble::tibble()
representing
a tree structure with one row for each node. The columns include:
Index of the node. Depth of a node can be derived by
number %/% 2
.
Name of the variable used in the split at a node or
"<leaf>"
if it is a leaf node.
Cluster size, the number of observations in that cluster.
Weights of observations. Unusable. Saved for future use.
Inertia value of the cluster at that node.
Position of the next split row in the data set (that position will belong to left node (smaller)).
Position of the next split variable in the data set.
Proportion of inertia value of the cluster at that node to the inertia of the root.
Position of the data point regarded as the medoid of its cluster.
y-coordinate of the splitting node to facilitate showing
on the tree. See plot.PULS()
for details.
Percent inertia explained as described in
Chavent (2007). It is 1 - (sum(current inertia)/inertial[1])
.
Indicator of an alternative cut yielding the same reduction in inertia at that split.
Vector of the same length as the number of rows in the
data, containing the value of frame$number
corresponding to the leaf
node that an observation falls into.
Distance matrix calculated using the method indicated in
distmethod
argument of PULS()
.
Vector of subregion names in the data that were used to split.
Named vector of positions of the data points regarded as medoids of clusters.
Indicator of having an alternate splitting route occurred when splitting.
Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monothetic divisive hierarchical clustering method. Computational Statistics & Data Analysis, 52(2), 687-701. doi:10.1016/j.csda.2007.03.013.
Raw Arctic data were smoothed and then transformed into functional data using
fda
package. To overcome the difficulty of exporting an fda
object in a
package, the object was discretized into a data set with 365 columns
corresponding to 365 days a year and 39 rows corresponding to
39 years. The years are from 1979 to 1986, then from 1989 to 2018. The years
1978, 1987, and 1988 were removed because the measurements were not complete.
smoothed_arctic
smoothed_arctic
A data frame with 39 rows corresponding to 39 years (1979 to 1986, 1989 to 2019) and 366 columns.
NOAA's raw data at arctic_2019 and the code to generate this data in data-raw/ folder of source code.