K means clustering and its use cases in Security Domain

5 min readJul 19, 2021

The ultimate goal of creating a machine learning model is to achieve accurate predictions with the right algorithms. Such learning algorithms are generally broken down into two types — supervised and unsupervised.

K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

First of all What is Clustering?

» Finding “natural” groupings between objects

» We want to find similar objects to treat them in the same way

Example:

» A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information.

» Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories.

While categorizing ML into Supervised learning and Unsupervised learning, Classification comes under Supervised, and Clustering comes under Unsupervised learning.

What is K-means clustering?

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Works for n-dimensional spaces as well. The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

How does the K-Mean Clustering algorithm work?

FlowChart:

Step 1: Begin with a decision on the value of k = number of clusters.
Step 2: Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly, or systematically as the following: 1.Take the first k training sample as single element clusters 2. Assign each of the remaining (N-k) training samples to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster.
Step 3: Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample.
Step 4: Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments.

Euclidean Distance:

Applications of K-Mean Clustering 

It is relatively efficient and fast. It computes results at O(tkn), where n is a number of objects or points, k is a number of clusters and t is a number of iterations.
k-means clustering can be applied to machine learning or data mining.
Used on acoustic data in speech understanding to convert waveforms into one of the k categories (known as Vector Quantization or Image Segmentation). 
Also used for choosing color palettes on old-fashioned graphical display devices and Image Quantization.

K means Usecases in Security (Cyber Profiling)

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal

Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity

Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods

The Cyber Profiling process can be directed to the benefit of:

Identification of users of computers that have been used previously. 
Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.
Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats
Identify the suspected abuser

The new approach to cyber profiling is to use clustering techniques to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles

The results of log analysis datasets using the K-Means algorithm to cyber profiling process show that the algorithm has to group activity based on the data of internet users who visited the website. This grouping is divided into three, namely the visit low, medium, and high.

Conclusion

K-means algorithm is useful for undirected knowledge discovery and is relatively simple. K-means has found widespread usage in a lot of fields, ranging from unsupervised learning of neural networks, Pattern recognitions, Classification analysis, Artificial intelligence, image processing, machine vision, and many others.