Articles with the R tag

Oct 07 · Reports

Interest rates on P2P loans

In this post I will look at linear regression to model the process determining interest rate on peer-to-peer loans provided by the Lending club. Like other peer-to-peer services, the Lending Club aims to directly connect producers and consumers, or in this case borrowers and lenders, by cutting out the middleman. Borrowers apply for loans online and provide details about the desired loan as well their financial status (such as their FICO score). Lenders use the information provided to choose which loans to invest in. The Lending Club, finally, uses a proprietary algorithm to determine the interest charged on an applicant …

Sep 15 · Data Posts

Ideological twitter communities

My current academic work revolves around the interactional “autonomy” of ideological communities in social networks. As part of my investigation I sometimes come across interesting little factoids. For example, I have been looking at the interaction of communities formed around the major political parties in Spain, and the most important media outlets (newspapers and TV stations). One observation I thought was interesting has to do with the bias in media consumption exhibited by different communities. For example, without going into detail here about how I identified the individual communities, here is a graph showing the inequality in retweet activity exhibited …

Sep 13 · Data Posts

Exploration of voopter airfare data

I’ve recently started working as a data science freelancer for voopter.com.br, helping them analyze the data generated by airfare searches on their website. Voopter is a metasearch engine for flights from and to Brazil. The first thing I did was to create an interactive dashboard in R and shiny for some explorative statistics of the millions of seaches performed by users of their website (which has already led to more specific business-driven questions).

The dashboard provides a quick and easy way to filter and aggregate the data, which is stored in an SQL database. The idea is …

Nov 06 · Reports

Categorisation of inertial activity data

The ubiquity of mobile phones equipped with a wide range of sensors presents interesting opportunities for data mining applications. In this report we aim to find out whether data from accelerometers and gyroscopes can be used to identify physical activities performed by subjects wearing mobile phones on their wrist.

Human activity

Methods

The data used in this analysis is based on the “Human activity recognition using smartphones” data set available from the UCL Machine Learning Repository [1]. A preprocessed version was downloaded from the Data Analysis online course [2]. The set contains data derived from 3-axial linear acceleration and 3-axial angular velocity …

Oct 23 · Reports

Titanic survival prediction

In this report I will provide an overview of my solution to kaggle’s “Titanic” competition. The aim of this competition is to predict the survival of passengers aboard the titanic using information such as a passenger’s gender, age or socio-economic status. I will explain my data munging process, explore the available predictor variables, and compare a number of different classification algorithms in terms of their prediction performance. All analysis presented here was performed in R. The corresponding source code is available on github.

Titanic

Data munging

The data set provided by kaggle contains 1309 records of passengers aboard the …