A NEW AI-BASED METHOD FOR CLUSTERING SURVEY RESPONSES

Aim: Many research projects, particularly in social science research, depend on clustering survey responses. When analyzing survey data, traditional clustering algorithms have several drawbacks. The ability to analyze survey data more effectively has been made possible by recent developments in artificial intelligence (AI) and machine learning (ML). The aim of this article is to present a new, AI-based method of clustering survey responses using a Variational Autoencoder (VAE). Materials and methods: To determine the effectiveness of grouping, the new VAE clustering method was compared with K-means, PCA and k-means, and Agglomerative Hierarchical Clustering methods by applying the Silhouette score, the Calinski-Harabasz score, and the Davies-Bouldin score metrics. Results: In the case of the Silhouette Score, the developed VAE method obtained a 69% higher average score than the others. For the Calinski-Harabasz Score and the Davies-Bouldin Score, respectively, the VAE method outperformed the other methods by 164% and 111%, respectively. Conclusions: The VAE method allowed for the most effective grouping of responses given by respondents. It has made it possible to capture complex relationships and patterns in the data. In addition, the method is suitable for analyzing different types of survey data (continuous, categorical, and mixed data) and is resistant to noise.


Introduction
Research methodology plays an important role in research processes by shaping their formal basis and translating the theoretical assumptions made into the language of empirical procedures.This is especially true of surveys, which originate from the group of social methods and are widely used in the organization and management sciences allowing the identification of the designated opinions of people (respondents) in relation to certain socio-economic phenomena.The survey research method is categorized as an empirical method and focuses on solving the research problem from the experience side by capturing conditions as close to reality as possible.By its nature, it is part of the nomothetic research approach, focused on the search for generalized judgments, laws and rules of the organizational world, which is carried out through an inductive research path, allowing the truth of a phenomenon to be established on the basis of sentences that confirm its existence in some cases only.Thanks to their relative simplicity, speed and low cost of implementation, surveys have been a key data collection tool in the social sciences for many years.Analysis of survey data is also a common component of organizational research, which enables researchers to analyze patterns and trends, facilitate choices and create plans to improve organizational performance.
The data collected from surveys is usually analyzed using traditional statistical methods such as descriptive statistics, inferential statistics, regression analysis, factor analysis, cluster analysis, conjoint analysis, and discriminant analysis.However, when it comes to evaluating survey data, these conventional techniques have some drawbacks.They presume that the data follows a particular distribution, such as a normal distribution, which is one of their key constraints.For survey data, which can have complicated and non-linear correlations, this assumption might not always be valid.These methods also do not account for the great dimensionality and heterogeneity of survey data, which can lead to inaccurate and biased results.
To overcome these limitations, recent advances in artificial intelligence (AI) and machine learning (ML) have opened up new opportunities for analyzing survey data.In particular, deep learning techniques such as variational autoencoder (VAE) have shown promise in clustering survey responses.Variational Autoencoder (VAE) is a deep learning generative model that encodes input data into a lower-dimensional latent space and then decodes it back to the original high-dimensional space in order to learn a compact representation of the data.Currently, VAE-based data analysis methods are being successfully applied in various fields of science and technology, such as image and video analysis, natural language processing, anomaly detection, drug discovery and recommendation systems.
The purpose of this article is to present and evaluate the effectiveness of a new method for grouping survey responses using Variational Autoencoder (VAE).
In order to achieve such a research objective, independent analyses were made of the results of a survey on the value system of employees in the 50+ generation using VAE and three other popular data grouping methods, namely K-means, PCA and k-means and Agglomerative Hierarchical Clustering.To determine the effectiveness of grouping, the methods presented above were compared by applying metrics such as the Silhouette index, the Calinski-Harabasz index, and the Davies-Bouldin index.
Survey research is a prominent methodology in the social sciences, notably in organizational research.The primary goal of survey research is to collect data from a sample of respondents in order to understand more about their attitudes, habits, and opinions on a specific topic.Data must be evaluated after it has been collected in order to derive valuable findings (Fowler, 2013, p. 134-140).Some of the most common methods for analyzing survey data include descriptive statistics (Holcomb, 2016, p. 1-98), inferential statistics (Asadoorian, 2005, p. 2-28), factor analysis (Tucker, 1951, p. 1-35), regression analysis (Kleinbaum, 2013, p. 34-704) and cluster analysis (Punj, 1983, p. 134-148).
There are several methods for clustering survey response data, including: • K-Means Clustering: This approach of grouping data is well-liked and straightforward.Based on the average distance of the data points from the centroid of each cluster, it divides the data into K clusters (Bock, 2007, p. 5-28;Likas, 2003, p. 1-27).• Hierarchical Clustering: This approach creates a hierarchy of clusters by first treating each data point as its own cluster, then grouping similar clusters into larger clusters until every data point is a member of a single cluster (Day, 1984, p. 7-24;Murtagh, 2012, p. 86-97) (Campello, 2013, p. 160-172;Kriegel, 2011, p. 231-240).• Model-Based Clustering: The distribution of data in each cluster is described using statistical models in this manner.Latent class analysis and Gaussian mixture models are two popular model-based clustering techniques (Fraley, 1998, p. 578-587;Kriegel, 2011, p. 231-240).
• Affinity Propagation: The foundation of this approach is the idea of "message transmission" between data points.By comparing the similarity of different data points, it discovers clusters (Wang, 2007(Wang, , p. 1242(Wang, -1246)).• Spectral Clustering: This technique divides the data into clusters using the eigenvalues and eigenvectors of a similarity matrix.It is frequently applied to non-linear clustering issues (Ng, 2001, 14-19).• Multiple measures can be used to assess a clustering method's performance.Several of the frequently used metrics include: • Silhouette score: By taking into account both intra-cluster and inter-cluster similarity, this evaluates the caliber of a clustering solution.A high silhouette score means that the data points have been successfully divided into discrete clusters using the clustering process (Shahapure, 2020, p. 124-131;Shutaywi, 2021, p. 759).• Calinski-Harabasz score: This measures a ratio of the sum of between-cluster dispersion and within-cluster dispersion (Lima, 2020, p. 97-106).• Davies-Bouldin score: This measures the average similarity of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances (Arturo, 2018, p. 1-8;Petrovic, 2006, p. 1-12).

Data characteristic
The CAVI questionnaire was the primary research tool used to collect data.The survey was anonymous, with a sample size of 600 people (377 women and 223 men).The survey was designed to investigate the value systems of women and men representing the "silver" generation of employees and it included respondents aged 50 and up who are professionally active (Laskowska, 2022, p. 194-224).More than half (52%) of those polled had a secondary education.The majority of respondents (23%) lived in large cities and worked in commerce (27%) as well as industry and construction (15%).The attempt was deliberate.The research was carried out in the first quarter of 2022.
To investigate the value system that members of the "silver" generation live by, respondents were asked to rate characteristics chosen based on the assumptions of Shalom H. Schwartz's theory of basic human values (Schwartz, 2012, p. 663-688).The questionnaire contained 16 questions with a semantic differential scale based on Charles E. Osgood's theory of semantic differences (Osgood, 1964, 171-200;Themistocleous, 2019, p. 394-407).The scales used have values ranging from 1 to 10, with 1 being the least significant and 10 being the most significant.The intervals between successive scale values were designed to be equal, resulting in interval scales.The internal consistency of the survey questionnaire was examined using Cronbach's Alpha (α) and McDonald's omega (ω) test (α = 0.72-0.91 and ω = 0.81-0.90).

Data clustering methods
A neural network model was created expressly for the task of clustering survey data responses.The Variational Autoencoder (VAE) model serves as the foundation for the neural network's structure.To compare the results from various clustering strategies, three distinct methodologies will be used: • K-means; • PCA and k-means; • Agglomerative Hierarchical Clustering; Due to the algorithms used, the total number of groups was set in advance.The initial number of clusters for our data has been set by dendrogram.Figure 1 indicates that three clusters should be obtained from our data.This number of clusters has poor variance between classes because it has been split into all most positive, mean, and all most negative sets.The deviation from the whole attitude mean value for each group is shown in Table 1.So, the next proposed number of clusters with lower dissimilarity was 4. A total of four clusters produced satisfactory results.

K-means
K-means is an algorithm that allows us to separate samples into groups of equal variances by minimizing the known criterion (equation 1).This algorithm requires a predetermined number of k clusters.The set of N samples X is divided into K clusters C, where each cluster is described by the mean u j of the samples in this cluster (Arthur, 2007, p. 1-9). (1)

PCA and k-means
The K-means algorithm for multidimensional spaces suffers from the socalled "course of dimensionality," because Euclidean distances tend to become inflated.Therefore, it is a good idea to use a dimensionality reduction algorithm to mitigate the problem and speed up calculations.In order to reduce the dimensions, the Principal Component Analysis (PCA) algorithm was chosen.PCA is a technique that reduces the dimensionality of data sets.The use of this technique increases the interpretability of data and minimize the loss of information.PCA creates new variables that are uncorrelated and maximizes variance.This method is an adaptive data analysis technique, due to the fact that the search for principal components is based on the available dataset, and the solution is the eigenvalue problem (Jollife, 2016, p. 374-382).

Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering (AHC) is a family of different methods that are related to each other at the computational level.All of the methods in this family establish structured relationships between data rather than assuming a priori data structure.AHC creates hierarchically ordered clusters that represent the proximity structure of the analyzed data.The data is not presented as a spatial cluster, but as a dendrogram or constituency tree (Campello, 2013, p. 160-172).Agglomerative clustering performs a hierarchical clustering operation using a bottom-up approach.Each observation is initially treated as a separate cluster, and then the observations are combined with each other by the linkage criteria.

Variational Autoencoder (VAE)
Variational Autoencoder was used to reduce the dimension of the input data and map it to 2D space.VAE is an Artificial Neural Network (ANN) architecture.This architecture belongs to the generative modeling field in machine learning.The main goal of this method is to capture dependencies between each input vector and maximize the probability for each X in the dataset so the model could generate data very similar to the input data.Let X be out datapoints in some high-dimens`ional space X and let P(X) be distribution that is defined over X, let Z be a latent variable vector in a high-dimensional space Z, and assume that it is possible to sample in accordance with the probability density function that is defined over Z. μ(Z;Ѳ) is a deterministic function family, and it is parametrized by a vector Ѳ in space Θ, where f: Z ⨉ Θ → X, if we assume that Z is random, and Ѳ is fixed variable, then f(Z;Ѳ) is random variable in space X.Finally, the function (2) that is maximized in the training process is (Doersch, 2016, p. 1-21): (2)

Fig. 2. Structure of VAE
A Variational Autoencoder is made up of two basic components: encoder and decoder.The role of the encoder is to reduce the dimensionality of the input data; the decoder needs to recreate the input data from the output of the encoder so that the loss function is minimized.In our case, VAE is probabilistic, which means that we use latent distributions to sample latent space points.In the encoder part, the input data is fed to two convolution layers; later, it is flattened and encoded as a distribution over the latent space.2D point coordinates in latent space are sampled from the latent distribution.Encoded distributions are specified to be normal, which allows our encoder to return the mean and the covariance matrix.This helps in regularizing the latent space so that the returned distributions are close to the standard normal distribution.The output of the encoder part is input to the decoder.The role of the decoder is to reconstruct the input data from latent spatial coordinates.The data is fed to three transposed convolution layers, then flattened to size 96 and fed to a dense layer with 16 neurons.The structure of our VAE is shown in figure 2.
In the described model, the sampled values are Z mean -the mean value, and Z log-var -the log variance.Both values are respectively represented by equations ( 3) and ( 4).
Let q(z|x, W) be a function that is used to approximate the true posteriori and that, based on variable x produces a distribution over the latent variable z.The parameters W correspond to the distribution q.
The reparameterization trick described by (Kingma, 2019, p. 307-392) was used for sampling, which consists in introducing a random variable є with a known distribution p(є).Sample is obtained from this distribution, and then let be a deterministic, differentiable function -equation ( 5).

Z = g (x, є, W)
(5) By applying this trick, we can use the Monte Carlo method to estimate the expectation and differentiation of the equation ( 6). ( where: L -number of samples;
We use the sum of the mean square error as the loss function for reconstruction and the Kullback-Leibler divergence for distribution -equation ( 7) (Kingma, 2019, p. 307-392): (7) The second step is to cluster the data.For two-point clustering, we have used Ward's agglomerative clustering algorithm.This method is calculating the Euclidean distance between all the points as a way to find a pair with the smallest possible dissimilarity.In Ward's method, the initial cluster distance is the squared Euclidean distance between points, so our metric is defined as equation ( 8) (Ward, 1963, p. 236-244):

Methods comparison
To determine the effectiveness of grouping, the methods presented above were compared by applying the following metrics: • Silhouette Score; • Calinski-Harabasz Score; • Davies-Bouldin Score.
The Silhouette Score is determined by equation ( 9) (Rousseeuw, 1987, p. 53-65): where: i -sample; a -i -sample's average distance from every other point in its class; b -average distance between sample and every other point in the follow ing cluster.
This score is used to evaluate how effective a clustering method is.The value varies from -1 to 1, where 1 indicates clusters that are clearly distinct and spaced widely apart.Zero means that clusters are overlapping.
The Calinski-Harabasz Score (Varian2ce Ratio Criterion) is defined by equation ( 10) (Caliński, 1974, p. 1-27) where: C q -set of points in cluster q; c e -center of cluster q; n q -amount of data in cluster q; c E -center of whole data.
Clustered data with more clearly defined clusters correlates with a higher Calinski-Harabasz score.Lower values of the Davies-Bouldin Score indicate that data is better separated between clusters.This method represents the average of a metric that contrasts the size of the clusters with the distance between clusters.The discussed method was presented using the equation ( 13) (Davies, 1979, p. 224-227): (13) where: k -number of clusters; i -cluster; j -most similar cluster to i ; s -cluster diameter; dij -the distance between i and j (cluster centroids).

Results and discussion
In order to acquire a 2D representation of our data, we used only the encoder's output.To visualize variables from latent space: z 1 and z 2 were used.The mapped data in 2D space is show in figure 3a.The next step is to cluster the 2D data.In order to do this, the hierarchical agglomerative clustering algorithm with minimum variance clustering was used.The criterion of the merge in this method is a function of all individual distances from the centroid (Manning, 2009, p. 235).The results of this algorithm for our 3D representations are shown in Figure 3b.The deviation from the attitude mean value for each group is shown in Table 3.As we can see, at this stage, none of the groups formed have visible features that could provide a logical link between them.Therefore, for further analysis, the data need to be filtered.For this purpose, the significance interval has been calculated by equation ( 14).Table 4 shows deviation from the attitude mean value for each group after filtering.(14) where: X m r -set of values from -th row; N c -number of clusters; β -size coefficient.The use of the Variational Auto Encoder allowed the data to be grouped into 4 clusters, for which, by determining the average value for each answer and applying the formula (4), the results presented in Table 4 were obtained.As we can see, the groups of responses obtained were characterized by dominant sets of features, from which it is possible to determine the characteristic pattern (attitude) of each group.
To verify the effectiveness of the VAE method, independent analyses were made of the results of a survey on the value system of employees in the 50+ generation using three other popular data grouping methods, namely K-means, PCA and K-means and Agglomerative Hierarchical ClusteringThe results obtained by these algorithms are presented in tables 5, 6, and 7, respectively.As we can see, only the use of the VAE model made it possible to group the answers from the survey into 4 clusters, which, compared to other methods, are characterized by features that can be distinguished.Additionally, the groups obtained in this way have a more even quantitative distribution than the other algorithms discussed, which also proved the superiority of the VAE method.The comparison of clusters' group sizes is presented in Table 8.

Conclusion
The research has confirmed that the proposed AI-based VAE clustering method provided the most effective grouping of respondents' responses.Complex relationships, trends, patterns in the data could be captured using the VAE method, which was not achievable using the other grouping techniques.The study also demonstrated the flexibility and scalability of VAE as a strategy for handling a variety of survey data types, including continuous, categorical, and mixed data.Additionally, VAE can handle missing data and is robust to noise, making it a suitable method for analyzing survey data, which can often be noisy and have missing responses.
Using the VAE method to cluster survey responses has a number of advantages for organizational research.First, compared to conventional clustering approaches, the suggested method is more adept at handling high-dimensional and heterogeneous survey data.Multiple questions and variables are frequently used in surveys to measure various characteristics of the topic being studied.The suggested strategy can uncover the data's underlying structure and spot trends that conventional approaches might miss.Second, the suggested approach is capable of simulating intricate, non-linear interactions between survey responses.Complex interactions between various variables that influence organizational results are possible in organizational research.The suggested approach can record these interactions and offer a more precise and in-depth comprehension of the topic under study.Third, the outcomes produced by the suggested strategy may be easier to understand and more significant.In a lower-dimensional latent space, VAE can learn a compressed representation of the data.As a result, it may be simpler to see and understand the clustering results, which may result in wiser judgments.
However, there are also some challenges to using VAE for clustering survey data.Choosing the proper model architecture and hyperparameters, which can have a big impact on the outcomes, is one of the issues.Additionally, because the latent space representation could not be exactly related to the original data, it can be challenging to comprehend the VAE results.Therefore, additional study is required to confirm the usefulness of the suggested strategy in various scenarios and to evaluate it against other cutting-edge AI-driven methodologies.

Fig. 3 .
Fig. 3. Survey data mapped to 2D space (a) and clustered data

Table 3 .
Deviation from the attitude mean value for each group

Table 4 .
Deviation from the attitude mean value for each group with filter -VAE

Table 5 .
Deviation from the attitude mean value for each group with filter -K-means

Table . 6
. Deviation from the attitude mean value for each group with filter -K-means + PCA

Table . 7
. Deviation from the attitude mean value for each group with filter -Agglomerative clustering algorithm