Increment column value in Spark - apache-spark

I've a Spark Streaming object that fetches data from RabbitMQ and saves it into HBase. This save is an Increment operation. I'm using the saveAsNewAPIHadoopDataset but I keep getting below Exception
Code:
pairDStream.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
#Override
public void call(JavaPairRDD<String, Integer> arg0)
throws Exception {
Configuration dbConf = HBaseConfiguration.create();
dbConf.set("hbase.table.namespace.mappings", "tablename:/mapr/tablename");
Job jobConf = Job.getInstance(dbConf);
jobConf.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tablename");
jobConf.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
JavaPairRDD<ImmutableBytesWritable, Increment> hbasePuts = arg0.mapToPair(
new PairFunction<Tuple2<String,Integer>, ImmutableBytesWritable, Increment>() {
#Override
public Tuple2<ImmutableBytesWritable, Increment> call(
Tuple2<String, Integer> arg0)
throws Exception {
String[] keys = arg0._1.split("_");
Increment inc = new Increment(Bytes.toBytes(keys[0]));
inc.addColumn(Bytes.toBytes("data"),
Bytes.toBytes(keys[1]),
arg0._2);
return new Tuple2<ImmutableBytesWritable, Increment>(new ImmutableBytesWritable(), inc);
}
});
// save to HBase- Spark built-in API method
hbasePuts.saveAsNewAPIHadoopDataset(jobConf.getConfiguration());
}
});
Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task 1.3 in stage 6.0 (TID 100, dev-arc-app036.vega.cloud.ironport.com): java.io.IOException: Pass a Delete or a Put
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:128)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:87)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
Is it possible to use "saveAsNewAPIHadoopDataset" method to Increment rather than Put?
Any help is greatly appreciated.
Thanks
Akhila.

Related

Abort the driver immediately when any of the executor fails

while loading data into database if particular column is a bad record
If one executor fails then driver needs to be updated with the message and job has to be terminated
I thought to do using accumulator.Please give me suggestion of how to do this.....
Attached my code below...
public static void main(String[] args) {
SparkSession spark= SparkSession.builder().appName("loadSqlData").master("local[*]").getOrCreate();
Properties connectionProperties= new Properties();
connectionProperties.put("user","postgres");
connectionProperties.put("password","root");
Dataset<Row> personcsvdata = spark.read().option("header","true").csv("C:\\Users\\Manasa\\Documents\\nulldata.csv");
personcsvdata.show();
LongAccumulator countErrors = spark.sparkContext().longAccumulator();
try {
personcsvdata.write().mode(SaveMode.Append).jdbc("jdbc:postgresql://localhost:5432/postgres", "public.employee", connectionProperties);
countErrors.add(1);
}
catch (Exception e) {
}
}

adding Cassandra as sink in Flink error : All host(s) tried for query failed

I was following up with an example at https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/cassandra.html to connect Cassandra as sink in Flink
My code for is shown below
public class writeToCassandra {
private static final String CREATE_KEYSPACE_QUERY = "CREATE KEYSPACE test WITH replication= {'class':'SimpleStrategy', 'replication_factor':1};";
private static final String createTable = "CREATE TABLE test.cassandraData(id varchar, heart_rate varchar, PRIMARY KEY(id));" ;
private final static Collection<String> collection = new ArrayList<>(50);
static {
for (int i = 1; i <= 50; ++i) {
collection.add("element " + i);
}
}
public static void main(String[] args) throws Exception {
//setting the env variable to local
StreamExecutionEnvironment envrionment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple2<String, String>> dataStream = envrionment
.fromCollection(collection)
.map(new MapFunction<String, Tuple2<String, String>>() {
final String mapped = " mapped ";
String[] splitted;
#Override
public Tuple2<String, String> map(String s) throws Exception {
splitted = s.split("\\s+");
return Tuple2.of(
UUID.randomUUID().toString(),
splitted[0] + mapped + splitted[1]
);
}
});
CassandraSink.addSink(dataStream)
.setQuery("INSERT INTO test.cassandraData(id,heart_rate) values (?,?);")
.setHost("127.0.0.1")
.build();
envrionment.execute();
} //main
} //writeToCassandra
I am getting the following error
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
Not sure if this is always required, but the way that I set up my CassandraSink is like this:
CassandraSink
.addSink(dataStream)
.setClusterBuilder(new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return Cluster.builder()
.addContactPoints(myListOfCassandraUrlsString.split(","))
.withPort(portNumber)
.build();
}
})
.build();
I have annotated POJOs that are returned by the dataStream so I don't need the query, but you would just include ".setQuery(...)" after the ".addSink(...)" line.
The exception simply indicates that the example program cannot reach the C* database.
flink-cassandra-connector offers streaming API to connect to designated C* database. Thus, you need to have a C* instance running.
Each streaming job is pushed/serialized to the node that Task Manager runs at. In your example, you assume C* is running on the same node as the TM node. An alternative is to change the C* address from 127.0.0.1 to a public address.

how to count number of items per second in spark streaming?

I get a json stream and I want to computer number of items that has a status of "Pending" every second. How do I do that? I have the code below so far and 1) I am not sure if it is correct. 2) It returns me a Dstream but my objective is to store a number every second to cassandra or queue or you can imagine there is function public void store(Long number){} .
// #1
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).countByValue().foreachRDD(new VoidFunction<JavaPairRDD<String, Long>>() {
#Override
public void call(JavaPairRDD<String, Long> stringLongJavaPairRDD) throws Exception {
store(stringLongJavaPairRDD.count());
}
});
Tried the following: still didn't work since it prints zero all the time not sure if it is right?
// #2
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
store(stringJavaRDD.count());
}
});
part of the stack trace
16/09/10 17:51:39 INFO SparkContext: Starting job: count at Consumer.java:88
16/09/10 17:51:39 INFO DAGScheduler: Got job 17 (count at Consumer.java:88) with 4 output partitions
16/09/10 17:51:39 INFO DAGScheduler: Final stage: ResultStage 17 (count at Consumer.java:88)
16/09/10 17:51:39 INFO DAGScheduler: Parents of final stage: List()
16/09/10 17:51:39 INFO DAGScheduler: Missing parents: List()
16/09/10 17:51:39 INFO DAGScheduler: Submitting ResultStage 17 (MapPartitionsRDD[35] at filter at Consumer.java:72), which has no missing parents
BAR gets printed but not FOO
//Debug code
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
System.out.println("****************FOO******************");
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
System.out.println("*****************BAR******************");
store(stringJavaRDD.count());
}
});
Since you have already filtered the result-set, you could just do a count() on the DStream/RDD.
Also I dont think you would need windowing here, if you are reading from the source every second. Windowing is needed, when the micro-batch interval doesn't match with the aggregation frequency. Are you looking at a micro-batch frequency of less than a second?
It returns me a Dstream but my objective is to store a number every second to cassandra or queue
The way Spark works is it gives a DStream every time you do a computation on an existing DStream. That way you could easily chain functions together. You should also be aware of the distinction between transformations and actions in Spark. Functions like filter(), count() etc. are transformations, in the sense that they operate on a DStream and give a new DStream. But if you need side-effects (like printing, pushing to a DB, etc.), you should be looking at Spark actions.
If you need to push DStream to cassandra, you should look at cassandra connectors which will have functions exposed (actions in Spark terminology) that you can use to push data into cassandra.
You can use sliding window of 1 second along with reduceByKey function irrespective of batch interval. Once you choose the 1 second slide interval you will receive a event for store call every second.

java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported

I get below exception when I try to create a dStream within a Function call of Spark.
My call method :
#Override
public JavaRDD<Object> call(JavaRDD<Object> v1) throws Exception {
Queue<JavaRDD<Object>> queue = new LinkedList<>();
queue.add(v1);
JavaDStream<Object> dStream = context.queueStream(queue);
JavaDStream<Object> newDStream = dStream.map(AbstractProcessor.this);
final JavaRDD<Object> rdd = context.sparkContext().emptyRDD();
newDStream.foreachRDD(new SaxFunction<JavaRDD<Object>, Void>() {
private static final long serialVersionUID = 672054140484217234L;
#Override
public Void execute(JavaRDD<Object> object) throws Exception {
rdd.union(object);
return null;
}
});
return rdd;
}
Exception :
Caused by: java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
at org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:220)
at org.apache.spark.streaming.dstream.DStream.<init>(DStream.scala:64)
at org.apache.spark.streaming.dstream.InputDStream.<init>(InputDStream.scala:42)
at org.apache.spark.streaming.dstream.QueueInputDStream.<init>(QueueInputDStream.scala:29)
at org.apache.spark.streaming.StreamingContext.queueStream(StreamingContext.scala:513)
at org.apache.spark.streaming.StreamingContext.queueStream(StreamingContext.scala:492)
at org.apache.spark.streaming.api.java.JavaStreamingContext.queueStream(JavaStreamingContext.scala:436)
Is there any way I can create a dStream and do operations on it in runtime or I can update DAG after context is started?
Thanks in advance.

Large data processing using Spring Batch Multi-threaded Step and RepositoryItemWriter/ RepositoryItemReader

I am trying to write a batch processing application using spring batch with multi-thread step.this is simple application reading data from a table and writing to another table but data is large around 2 million record .
I am using RepositoryItemReader & RepositoryItemWriter for reading and writing data. But after processing some data it failing due to Unable to acquire JDBC Connection.
//Config.Java
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(10);
return taskExecutor;
}
#Bean(name = "personJob")
public Job personKeeperJob() {
Step step = stepBuilderFactory.get("step-1")
.<User, Person> chunk(1000)
.reader(userReader)
.processor(jpaProcessor)
.writer(personWriter)
.taskExecutor(taskExecutor())
.throttleLimit(10)
.build();
Job job = jobBuilderFactory.get("person-job")
.incrementer(new RunIdIncrementer())
.listener(this)
.start(step)
.build();
return job;
}
//Processor.Java
#Override
public Person process(User user) throws Exception {
Optional<User> userFromDb = userRepo.findById(user.getUserId());
Person person = new Person();
if(userFromDb.isPresent()) {
person.setName(userFromDb.get().getName());
person.setUserId(userFromDb.get().getUserId());
person.setDept(userFromDb.get().getDept());
}
return person;
}
//Reader.Java
#Autowired
public UserItemReader(final UserRepository repository) {
super();
this.repository = repository;
}
#PostConstruct
protected void init() {
final Map<String, Sort.Direction> sorts = new HashMap<>();
sorts.put("userId", Direction.ASC);
this.setRepository(this.repository);
this.setSort(sorts);
this.setMethodName("findAll");
}
//Writer.Java
#PostConstruct
protected void init() {
this.setRepository(repository);
}
#Transactional
public void write(List<? extends Person> persons) throws Exception {
repository.saveAll(persons);
}
application.properties
# Datasource
spring.datasource.platform=h2
spring.datasource.url=jdbc:h2:mem:batchdb
spring.main.allow-bean-definition-overriding=true
spring.datasource.hikari.maximum-pool-size=500
Error :
org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:447)
......................
Caused by: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
at org.hibernate.exception.internal.SQLExceptionTypeDelegate.convert(SQLExceptionTypeDelegate.java:48)
............................
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30927ms.
You run out of connections.
Try to set the Hikari Connection Pool to a bigger number:
spring.datasource.hikari.maximum-pool-size=20

Resources