Articles with the hadoop tag

Oct 07 · Data Posts

Reading from distributed cache in Hadoop

The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new API and very few books and tutorials have updated examples.

Named File

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv" + "#filelabel"));

In the mapper:

@Override
public void setup(Context context …
Sep 15 · Data Posts

Ideological twitter communities

My current academic work revolves around the interactional “autonomy” of ideological communities in social networks. As part of my investigation I sometimes come across interesting little factoids. For example, I have been looking at the interaction of communities formed around the major political parties in Spain, and the most important media outlets (newspapers and TV stations). One observation I thought was interesting has to do with the bias in media consumption exhibited by different communities. For example, without going into detail here about how I identified the individual communities, here is a graph showing the inequality in retweet activity exhibited …