Oct 07 ยท Data Posts

Reading from distributed cache in Hadoop

The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new API and very few books and tutorials have updated examples.

Named File

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv" + "#filelabel"));

In the mapper:

public void setup(Context context) throws IOException, InterruptedException
  URI[] cacheFiles = context.getCacheFiles();
  if (cacheFiles != null && cacheFiles.length > 0)
      BufferedReader reader = new BufferedReader(new FileReader("filelabel"));

File system

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv"));

In the mapper:

public void setup(Context context) throws IOException, InterruptedException
  URI[] cacheFiles = context.getCacheFiles();
  if (cacheFiles != null && cacheFiles.length > 0)
        FileSystem fs = FileSystem.get(context.getConfiguration());
        Path path = new Path(cacheFiles[0].toString());
        BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(path)));
