I am curious about whether mapreduce job is using multiple threading in a single machine. For example, I have 10 servers in the hadoop cluster, by default, if the input file is large enough, there will be 10 mappers. Is the single mapper using multiple threading in a single machine?
Is the single mapper using multiple threading in a single machine?
YES. Mapreduce job can use multithreaded mapper(Multiple threads or thread pool running map method) .
I have used for better CPU utilization for Map only Hbase jobs...
MultiThreadedMapper is a good fit if your operation is highly CPU intensive, could increase the speed.
mapper class should extend org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of regular org.apache.hadoop.mapreduce.Mapper .
The Multithreadedmapper has a different implementation of run()
method. like below.
run(org.apache.hadoop.mapreduce.Mapper.Context context)
Run the application's maps using a thread pool.
You can set the number of threads within a mapper in MultiThreadedMapper by
MultithreadedMapper.setNumberOfThreads(n); or you can set the property in loading from a property file mapred.map.multithreadedrunner.threads = n
and use above setter method(per job basis) to control jobs which are less cpu intensive.
The affect of doing this, you can see in mapreduce counters specially CPU related counters.
Example Code snippet of MultithreadedMapper implementation:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.regex.Pattern;
public class MultithreadedWordCount {
// class should be thread safe
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
public static enum PREPOST { SETUP, CLEANUP }
#Override()
protected void setup(Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws java.io.IOException, java.lang.InterruptedException {
// will be called several times
context.getCounter(PREPOST.SETUP).increment(1);
}
#Override
protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String[] words = value.toString().toLowerCase().split("[\\p{Blank}[\\p{Punct}]]+");
for (String word : words) {
context.write(new Text(word), new LongWritable(1));
}
}
#Override()
protected void cleanup(Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException {
// will be called several times
context.getCounter(PREPOST.CLEANUP).increment(1);
}
}
public static class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
#Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context
) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value: values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(WordCount.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
MultithreadedMapper.setMapperClass(job, MultithreadedWordCount.WordCountMapper.class);
MultithreadedMapper.setNumberOfThreads(job, 10);
job.setMapperClass(MultithreadedMapper.class);
job.setCombinerClass(MultithreadedWordCount.WordCountReducer.class);
job.setReducerClass(MultithreadedWordCount.WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
/* begin defaults */
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
/* end defaults */
job.waitForCompletion(true);
}
}
Please refer https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/Mapper.html
Applications may override the run(Context) method to exert greater control on map processing e.g. multi-threaded Mappers etc.
Moreover there is also a MultithreadedMapper . I have never used this though.
Related
We are currently exploring Apache Spark (with Hadoop) for performing large scale
data transformation (in Java).
We are using the new looking (and experimental) DataSourceV2 interfaces to build our custom
output data files. A component of this is an implementation of the org.apache.spark.sql.sources.v2.writer.DataWriter
interface. It all works beautifully, except for one problem:
The org.apache.spark.sql.sources.v2.writer.DataWriter.write(record) method is often (but not always)
called twice for the same input record.
Here is what I hope is enough code for you to get the gist of what we're doing:
Basically we have many large sets of input data that we land via a Spark application
into Hadoop tables using code that looks something like:
final Dataset<Row> jdbcTableDataset = sparkSession.read()
.format("jdbc")
.option("url", sqlServerUrl)
.option("dbtable", tableName)
.option("user", jdbcUser)
.option("password", jdbcPassword)
.load();
final DataFrameWriter<Row> dataFrameWriter = jdbcTableDataset.write();
dataFrameWriter.save(hdfsDestination + "/" + tableName);
There's roughly fifty of these tables, for what it is worth. I know that there are no duplicates
in the data because dataFrameWriter.count() and dataFrameWriter.distinct().count()
returns the same value.
The transformation process involves performing join operations on these tables and writing
the result to files in the (shared) file system in a custom format. The resulting rows contain a unique key,
a dataGroup column, a dataSubGroup column and about 40 other columns. The selected records are
ordered by dataGroup, dataSubGroup and key.
Each output file is distinguished by the dataGroup column, which is used to partition the write operation:
final Dataset<Row> selectedData = dataSelector.selectData();
selectedData
.write()
.partitionBy("dataGroup")
.format("au.com.mycompany.myformat.DefaultSource")
.save("/path/to/shared/directory/");
To give you an idea of the scale, the resulting selected data consists of fifty-sixty million
records, unevenly split between roughly 3000 dataGroup files. Large, but not enormous.
The partitionBy("dataGroup") neatly ensures that each dataGroup file is processed by a
single executor. So far so good.
My datasource implements the new looking (and experimental) DataSourceV2 interface:
package au.com.mycompany.myformat;
import java.io.Serializable;
import java.util.Optional;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.sources.DataSourceRegister;
import org.apache.spark.sql.sources.v2.DataSourceOptions;
import org.apache.spark.sql.sources.v2.WriteSupport;
import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class DefaultSource implements DataSourceRegister, WriteSupport , Serializable {
private static final Logger logger = LoggerFactory.getLogger(DefaultSource.class);
public DefaultSource() {
logger.info("created");
}
#Override
public String shortName() {
logger.info("shortName");
return "myformat";
}
#Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
return Optional.of(new MyFormatSourceWriter(writeUUID, schema, mode, options));
}
}
There's a DataSourceWriter implementation:
public class MyFormatSourceWriter implements DataSourceWriter, Serializable {
...
}
and a DataSourceWriterFactory implementation:
public class MyDataWriterFactory implements DataWriterFactory<InternalRow> {
...
}
and finally a DataWriter implementation. It seems that a DataWriter is created and sent to
each executor. Therefore each DataWriter will process many of the dataGroups.
Each record has a unique key column.
public class MyDataWriter implements DataWriter<InternalRow>, Serializable {
private static final Logger logger = LoggerFactory.getLogger(XcdDataWriter.class);
...
MyDataWriter(File buildDirectory, StructType schema, int partitionId) {
this.buildDirectory = buildDirectory;
this.schema = schema;
this.partitionId = partitionId;
logger.debug("Created MyDataWriter for partition {}", partitionId);
}
private String getFieldByName(InternalRow row, String fieldName) {
return Optional.ofNullable(row.getUTF8String(schema.fieldIndex(fieldName)))
.orElse(UTF8String.EMPTY_UTF8)
.toString();
}
/**
* Rows are written here. Each row has a unique key column as well as a dataGroup
* column. Right now we are frequently getting called with the same record twice.
*/
#Override
public void write(InternalRow record) throws IOException {
String nextDataFileName = getFieldByName(record, "dataGroup") + ".myExt";
// some non-trivial logic for determining the right output file
...
// write the output record
outputWriter.append(getFieldByName(row, "key")).append(',')
.append(getFieldByName(row, "prodDate")).append(',')
.append(getFieldByName(row, "nation")).append(',')
.append(getFieldByName(row, "plant")).append(',')
...
}
#Override
public WriterCommitMessage commit() throws IOException {
...
outputWriter.close();
...
logger.debug("Committed partition {} with {} data files for zip file {} for a total of {} zip files",
partitionId, dataFileCount, dataFileName, dataFileCount);
return new MyWriterCommitMessage(partitionId, dataFileCount);
}
#Override
public void abort() throws IOException {
logger.error("Failed to collect data for schema: {}", schema);
...
}
}
Right now I'm working around this by keeping track of the last key that was processed and ignoring
duplicates.
package org.apache.spark.examples.kafkaToflink;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.Properties;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import com.microsoft.azure.datalake.store.ADLException;
import com.microsoft.azure.datalake.store.ADLFileOutputStream;
import com.microsoft.azure.datalake.store.ADLStoreClient;
import com.microsoft.azure.datalake.store.IfExists;
import com.microsoft.azure.datalake.store.oauth2.AccessTokenProvider;
import com.microsoft.azure.datalake.store.oauth2.ClientCredsTokenProvider;
import scala.util.parsing.combinator.testing.Str;
public class App {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.72:9092");
properties.setProperty("group.id", "test");
DataStream<String> stream = env.addSource(
new FlinkKafkaConsumer010<String>("tenant", new SimpleStringSchema(), properties), "Kafka_Source");
stream.addSink(new ADLSink()).name("Custom_Sink").setParallelism(128);
env.execute("App");
}
}
class ADLSink<String> extends RichSinkFunction<String> {
private java.lang.String clientId = "***********";
private java.lang.String authTokenEndpoint = "***************";
private java.lang.String clientKey = "*****************";
private java.lang.String accountFQDN = "****************";
private java.lang.String filename = "/Bitfinex/ETHBTC/ORDERBOOK/ORDERBOOK.json";
#Override
public void invoke(String value) {
AccessTokenProvider provider = new ClientCredsTokenProvider(authTokenEndpoint, clientId, clientKey);
ADLStoreClient client = ADLStoreClient.createClient(accountFQDN, provider);
try {
client.setPermission(filename, "744");
ADLFileOutputStream stream = client.getAppendStream(filename);
System.out.println(value);
stream.write(value.toString().getBytes());
stream.close();
} catch (ADLException e) {
System.out.println(e.requestId);
} catch (Exception e) {
System.out.println(e.getMessage());
System.out.println(e.getCause());
}
}
}
I am continuously trying to append a file which is in Azure data lake Store using while loop .But sometimes it gives this , Operation APPEND failed with HTTP500, error in starting or sometimes after 10 min. I am using java
Anubhav, Azure Data Lake streams are single-writer streams - i.e., you cannot write to the same stream from multiple threads, unless you do some form of synchronization between these threads. This is because each write specifies the offset it is writing to, and with multiple threads, the offsets are not consistent.
You seem to be writing from multiple threads (.setParallelism(128) call in your code)
In your case, you have two choices:
Write to a different file in each thread. I do not know your use-case, but we have found that for a lot of cases that is the natural use of different threads - to write to different files.
If it is important to have all the threads write to the same file, then you will need to refactor the sink a little bit so that all the instances have reference to the same ADLFileOutputStream, and you will need to make sure the calls to write() and close() are synchronized.
Now, there is one more issue here - the error you got should have been an HTPP 4xx error (indicating a lease conflict, since the ADLFileOutputStreams acquire lease), rather than HTTP 500, which says there was a server-side problem. To troubleshoot that, I will need to know your account name and time of access. That info is not safe to share on StackOverflow, so please open a support ticket for that and reference this SO question, so the issue gets eventually routed to me.
the Spring Batch docs say of the Map-backed job repository:
Note that the in-memory repository is volatile and so does not allow restart between JVM instances. It also cannot guarantee that two job instances with the same parameters are launched simultaneously, and is not suitable for use in a multi-threaded Job, or a locally partitioned Step. So use the database version of the repository wherever you need those features.
I would like to use a Map job repository, and I do not care about restarting, prevention of concurrent job executions, etc. but I do care about being able to use multi-threading and local partitioning.
My batch application has some partitioned steps, and at first glance it seems to run just fine with a Map-backed job repository.
What is the reason it said to be not possible with MapJobRepositoryFactoryBean? Looking at the implementation of Map DAOs, they are using ConcurrentHashMap. Is this not thread-safe ?
I would advise you to follow the documentation, rather than relying on implementation details. Even if the maps are individually thread-safe, there might be race conditions in changes than involve more than one of these maps.
You can use an in-memory database very easily. Example
#Grapes([
#Grab('org.springframework:spring-jdbc:4.0.5.RELEASE'),
#Grab('com.h2database:h2:1.3.175'),
#Grab('org.springframework.batch:spring-batch-core:3.0.6.RELEASE'),
// must be passed with -cp, for whatever reason the GroovyClassLoader
// is not used for com.thoughtworks.xstream.io.json.JettisonMappedXmlDriver
//#Grab('org.codehaus.jettison:jettison:1.2'),
])
import org.h2.jdbcx.JdbcDataSource
import org.springframework.batch.core.Job
import org.springframework.batch.core.JobParameters
import org.springframework.batch.core.Step
import org.springframework.batch.core.StepContribution
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory
import org.springframework.batch.core.launch.JobLauncher
import org.springframework.batch.core.scope.context.ChunkContext
import org.springframework.batch.core.step.tasklet.Tasklet
import org.springframework.batch.repeat.RepeatStatus
import org.springframework.beans.factory.annotation.Autowired
import org.springframework.context.annotation.AnnotationConfigApplicationContext
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration
import org.springframework.core.io.ResourceLoader
import org.springframework.jdbc.datasource.init.DatabasePopulatorUtils
import org.springframework.jdbc.datasource.init.ResourceDatabasePopulator
import javax.annotation.PostConstruct
import javax.sql.DataSource
#Configuration
#EnableBatchProcessing
class AppConfig {
#Autowired
private JobBuilderFactory jobs
#Autowired
private StepBuilderFactory steps
#Bean
public Job job() {
return jobs.get("myJob").start(step1()).build()
}
#Bean
Step step1() {
this.steps.get('step1')
.tasklet(new MyTasklet())
.build()
}
#Bean
DataSource dataSource() {
new JdbcDataSource().with {
url = 'jdbc:h2:mem:temp_db;DB_CLOSE_DELAY=-1'
user = 'sa'
password = 'sa'
it
}
}
#Bean
BatchSchemaPopulator batchSchemaPopulator() {
new BatchSchemaPopulator()
}
}
class BatchSchemaPopulator {
#Autowired
ResourceLoader resourceLoader
#Autowired
DataSource dataSource
#PostConstruct
void init() {
def populator = new ResourceDatabasePopulator()
populator.addScript(
resourceLoader.getResource(
'classpath:/org/springframework/batch/core/schema-h2.sql'))
DatabasePopulatorUtils.execute populator, dataSource
}
}
class MyTasklet implements Tasklet {
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
println 'TEST!'
}
}
def ctx = new AnnotationConfigApplicationContext(AppConfig)
def launcher = ctx.getBean(JobLauncher)
def jobExecution = launcher.run(ctx.getBean(Job), new JobParameters([:]))
println "Status is: ${jobExecution.status}"
I am writing a graphics program however I am having trouble calling for the thread to start in the main and getting it to run in the separate class. My main class looks like the following..
package shooter;
import java.awt.Color;
import javax.swing.JFrame;
public class ShooterRunner extends JFrame {
public ShooterRunner()
{
super("Shooter Game");
setSize(600,600);
setBackground(Color.black);
setResizable(false);
setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
setLocationRelativeTo(null);
getContentPane().add(new Shooter());
setFocusable(true);
setVisible(true);
}
public static void main(String[] args) {
ShooterRunner test = new ShooterRunner();
Thread thread = new Thread(new Shooter());
}
}
I am aware that I am creating a new instance of shooter when I create the thread however I get an error if I try create a Shooter object above and use it for both getContentPane.add() and as the parameters of creating the thread.
on trying to monitor object sizes, string objects are not shown in the profiled results.
Can anyone tell me wat is the procedure to make them shown in results....
public class calling extends called {
called myobj3 = new called();
called myobj4 = new called();
public void function() {
myobj3.d="Plz";
myobj3.e="Help";
myobj4.d="Thank";
myobj4.e="You";
myobj3.act();
myobj4.act();
}
public static void main(String [] args) {
System.out.println("calls an object from called.java");
calling obj = new calling();
obj.function();
}
}
class called {
public String d;
public String e;
public void act() {
System.out.println(d+e);
}
}
memory profile Retained size Shallow Size
[Unreachable] called 40 40
[Unreachable] called 40 40
Perhaps, your objects have been collected or scheduled for collection (unreachable). In which point do you capture memory snapshot?
Disclaimer: I'm a YourKit developer.