Oct 07 · Data Posts

Reading from distributed cache in Hadoop

The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new API and very few books and tutorials have updated examples.

Named File

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv" + "#filelabel"));

In the mapper:

@Override
public void setup(Context context …
Oct 07 · Reports

Interest rates on P2P loans

In this post I will look at linear regression to model the process determining interest rate on peer-to-peer loans provided by the Lending club. Like other peer-to-peer services, the Lending Club aims to directly connect producers and consumers, or in this case borrowers and lenders, by cutting out the middleman. Borrowers apply for loans online and provide details about the desired loan as well their financial status (such as their FICO score). Lenders use the information provided to choose which loans to invest in. The Lending Club, finally, uses a proprietary algorithm to determine the interest charged on an applicant …

Sep 15 · Data Posts

Ideological twitter communities

My current academic work revolves around the interactional “autonomy” of ideological communities in social networks. As part of my investigation I sometimes come across interesting little factoids. For example, I have been looking at the interaction of communities formed around the major political parties in Spain, and the most important media outlets (newspapers and TV stations). One observation I thought was interesting has to do with the bias in media consumption exhibited by different communities. For example, without going into detail here about how I identified the individual communities, here is a graph showing the inequality in retweet activity exhibited …

Sep 13 · Data Posts

Exploration of voopter airfare data

I’ve recently started working as a data science freelancer for voopter.com.br, helping them analyze the data generated by airfare searches on their website. Voopter is a metasearch engine for flights from and to Brazil. The first thing I did was to create an interactive dashboard in R and shiny for some explorative statistics of the millions of seaches performed by users of their website (which has already led to more specific business-driven questions).

The dashboard provides a quick and easy way to filter and aggregate the data, which is stored in an SQL database. The idea is …

Jul 20 · Data Posts

Elegans now features PubMed search

I’ve added a new PubMed search feature to Elegans, the visual worm brain explorer. The idea here is to show the network of C. Elegans neurons that get mentioned in more than n papers on PubMed, in the context of a given search query. So, for example, if one is interested in the worm’s chemotaxis behaviour, one would type in ‘chemotaxis’ and choose the citation threshold n. Initiating the search will then return the neurons that get mentioned in at least n papers along with the word ‘chemotaxis’. The search is in fact performed once for each neuron …

Jun 22 · Data Posts

Analyzing tf-idf results in scikit-learn

In a previous post I have shown how to create text-processing pipelines for machine learning in python using scikit-learn. The core of such pipelines in many cases is the vectorization of text using the tf-idf transformation. In this post I will show some ways of analysing and making sense of the result of a tf-idf. As an example I will use the same kaggle dataset, namely webpages provided and classified by StumbleUpon as either ephemeral (content that is short-lived) or evergreen (content that can be recommended long after its initial discovery).

Tf-idf

As explained in the previous post, the tf-idf …

Jun 17 · Data Posts

Pipelines for text classification in scikit-learn

Scikit-learn’s pipelines provide a useful layer of abstraction for building complex estimators or classification models. Its purpose is to aggregate a number of data transformation steps, and a model operating on the result of these transformations, into a single object that can then be used in place of a simple estimator. This allows for the one-off definition of complex pipelines that can be re-used, for example, in cross-validation functions, grid-searches, learning curves and so on. I will illustrate their use, and some pitfalls, in the context of a kaggle text-classification challenge.

StumbleUpon Evergreen

The challenge

The goal in the StumbleUpon Evergreen …

Feb 02 · Data Posts

Sql to excel

A little python tool to execute an sql script (postgresql in this case, but should be easily modifiable for mysql etc.) and store the result in a csv or excel (xls file):

"""
Executes an sql script and stores the result in a file.
"""

import os, sys
import subprocess
import csv
from xlwt import Workbook


def sql_to_csv(sql_fnm, csv_fnm):
    """ Write result of executing sql script to txt file"""

    with open(sql_fnm, 'r') as sql_file:
        query = sql_file.read()
        query = "COPY (" + query + ") TO STDOUT WITH CSV HEADER"
        cmd = 'psql -c "' + query + '"'
        print cmd

        data = subprocess.check_output(cmd, shell=True)

        with open(csv_fnm, 'wb …
Jan 27 · Data Posts

Retrieving your Google Scholar data

For my interactive CV I decided to try not only to automate the creation of a bibliography of my publications, but also to extend it with a citation count for each paper, which Google Scholar happens to keep track of. Unfortunately there is no Scholar API. But I figured since my own profile is based on data I essentially donated to Google, it is only fair that I can have access to it too. Hence I wrote a little scraper that iterates over the publications in my Scholar profile, extracts all citations, and bins them per year. That way I …

Jan 27 · Data Posts

Tag graph plugin for Pelican

On my front page I display a sort of sitemap for my blog. Since the structure of the site is not very hierarchical, I decided to show pages and posts as a graph along with their tags. To do so, I created a mini plugin for the Pelican static blog engine. The plugin is essentially a sort of callback that gets executed when the engine has generated all posts and pages from their markdown files. I then simply take the results and write them out in a json format that d3.js understands (a list of nodes and a list …