Articles with the kaggle tag

Jun 22 · Data Posts

Analyzing tf-idf results in scikit-learn

In a previous post I have shown how to create text-processing pipelines for machine learning in python using scikit-learn. The core of such pipelines in many cases is the vectorization of text using the tf-idf transformation. In this post I will show some ways of analysing and making sense of the result of a tf-idf. As an example I will use the same kaggle dataset, namely webpages provided and classified by StumbleUpon as either ephemeral (content that is short-lived) or evergreen (content that can be recommended long after its initial discovery).

Tf-idf

As explained in the previous post, the tf-idf …

Jun 17 · Data Posts

Pipelines for text classification in scikit-learn

Scikit-learn’s pipelines provide a useful layer of abstraction for building complex estimators or classification models. Its purpose is to aggregate a number of data transformation steps, and a model operating on the result of these transformations, into a single object that can then be used in place of a simple estimator. This allows for the one-off definition of complex pipelines that can be re-used, for example, in cross-validation functions, grid-searches, learning curves and so on. I will illustrate their use, and some pitfalls, in the context of a kaggle text-classification challenge.

StumbleUpon Evergreen

The challenge

The goal in the StumbleUpon Evergreen …

Oct 23 · Reports

Titanic survival prediction

In this report I will provide an overview of my solution to kaggle’s “Titanic” competition. The aim of this competition is to predict the survival of passengers aboard the titanic using information such as a passenger’s gender, age or socio-economic status. I will explain my data munging process, explore the available predictor variables, and compare a number of different classification algorithms in terms of their prediction performance. All analysis presented here was performed in R. The corresponding source code is available on github.

Titanic

Data munging

The data set provided by kaggle contains 1309 records of passengers aboard the …