I’m working with Spark 2.1.0, pyspark and DataFrames for clustering documents in topics using Latent Dirichlet allocation (LDA):
lda = LDA(k = 10)
model = lda.fit(dataset)
topics = model.describeTopics(5)
transformed = model.transform(dataset)
As it appears in the MLlib programming guide, the results for topics and transformed are Spark DataFrames. topics is a DataFrame built for each topic including a label and a sparse vector, while transformed includes a label for the document, a sparse vector for the features and a dense vector for the topicDistribution. Now I would like to work with these vectors (making some calculations between them) and I have tried to slice the vectors, translate to Pandas, create udf’s, convert into RDD’s and then back to DataFrames, and others methods, but I cannot find a way. Please, is it possible to convert these vectors into something more operational?