Articles with the java tag

Oct 07 · Data Posts

Reading from distributed cache in Hadoop

The distributed cache can be used to make small files (or jars etc.) available to mapreduce functions locally on each node. This can be useful e.g. when a global stopword list is needed by all mappers for index creation. Here are two correct ways of reading a file from distributed cache in Hadoop 2. This has changed in the new API and very few books and tutorials have updated examples.

Named File

In the driver:

Job job = Job.getInstance(new Configuration());
job.addCacheFile(new URI ("/path/to/file.csv" + "#filelabel"));

In the mapper:

@Override
public void setup(Context context …