Reading from distributed cache in Hadoop

The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new API and very few books and tutorials have updated examples.

Named File

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv" + "#filelabel"));

In the mapper:

@Override
public void setup(Context context) throws IOException, InterruptedException
{
  URI[] cacheFiles = context.getCacheFiles();
  if (cacheFiles != null && cacheFiles.length > 0)
  {
    try
    {
      BufferedReader reader = new BufferedReader(new FileReader("filelabel"));
    }
...
}

File system

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv"));
...

In the mapper:

@Override
public void setup(Context context) throws IOException, InterruptedException
{
  URI[] cacheFiles = context.getCacheFiles();
  if (cacheFiles != null && cacheFiles.length > 0)
  {
    try
    {
        FileSystem fs = FileSystem.get(context.getConfiguration());
        Path path = new Path(cacheFiles[0].toString());
        BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(path)));
    }
...
}

Reading from distributed cache in Hadoop

Named File

File system

Comments

Author

Categories

Tags