Clustering of Large volume of string data



Hi Experts

I need your help with below case study.


The following dataset contains process start and stop events collected from individual Windows-based desktop computers and servers. Each event is on a separate line in the form of “time, user@domain, computer, process name, start/end” and represents a process event at the given time.
Specific users that are well known system related (SYSTEM, Local Service) were not de-identified though any well-known administrators
account were still de-identified. The specific timeframe used is not disclosed for security purposes. All data starts with a time epoch of 1 using a time resolution of 1 second. Below is the sample datasets examples and attached as well.


Can anybody help me here to analyze above dataset and to identify ( To understand what is going on in these local systems inside the internal network) . clusters of users and processes based on their execution. (821.2 KB)
I did some analysis and thought for using K-Means algorithim for clustering but it seems K-Means works with numeric data only & in above case data are string based.


Check this link. Can guide you and help you explore more before reaching out for direct answers. Its a critical skill in Data Science to keep searching for an answer.