Cluster similar strings

Calculate string similarity using the Levenshtein distance and return clusters of similar strings.

Usage

ClusterStrings(x, maxCluster = 12)

Arguments

x: Character vector.
maxCluster: Integer specifying maximum number of clusters to consider.

Value

NameClusters() returns an integer assigning each element of x to a cluster, with an attribute med specifying the median string in each cluster, and silhouette reporting the silhouette coefficient of the optimal clustering. Coefficients < 0.5 indicate weak structure, and no clusters are returned. If the number of unique elements of x is less than maxCluster, all occurrences of each entry are assigned to an individual cluster.

Author

Martin R. Smith (martin.smith@durham.ac.uk)

Examples

ClusterStrings(c(paste0("FirstCluster ", 1:5),
                 paste0("SecondCluster.", 8:12),
                 paste0("AnotherCluster_", letters[1:6])))
#>  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3
#> attr(,"silhouette")
#> [1] 0.911867
#> attr(,"med")
#> [1] "FirstCluster 1"   "SecondCluster.10" "AnotherCluster_a"

Usage

Arguments

Value

See also

Author

Examples