In a previous post I have shown how to create text-processing pipelines for machine learning in python using scikit-learn. The core of such pipelines in many cases is the vectorization of text using the tf-idf transformation. In this post I will show some ways of analysing and making sense of the result of a tf-idf. As an example I will use the same kaggle dataset, namely webpages provided and classified by StumbleUpon as either ephemeral (content that is short-lived) or evergreen (content that can be recommended long after its initial discovery).
Tf-idf
As explained in the previous post, the tf-idf …