our application creates parquet file and we need to load this data to DynamoDB. I am not finding any Java examples using emr-dynamodb-connector. How do we convert Dataset data to hadoopRDD or DynamoDBItemWritable? Appreciate help. Not finding good documentation on emr-dynamodb-connector either.
//Read file
Dataset<Row> parquetFileDF = spark.read().parquet("/src/main/resources/account.parquet");
//Create hadoop RDD using above data ( Need to figure out this part)
JavaPairRDD<Text, DynamoDBItemWritable> hadoopRDD = sc.hadoopRDD(jobConf,
DynamoDBInputFormat.class, Text.class, DynamoDBItemWritable.class);
//Save
hadoopRDD.saveAsHadoopDataset(jobConf);
private static JobConf getDynamoDbJobConf (JavaSparkContext sc, String tableNameForWrite){
final JobConf jobConf = new JobConf(sc.hadoopConfiguration());
jobConf.set("dynamodb.servicename", "dynamodb");
jobConf.set("dynamodb.input.tableName", tableNameForWrite);
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com");
jobConf.set("dynamodb.regionid", "us-east-1");
jobConf.set("dynamodb.proxy.hostname", "");
jobConf.set("dynamodb.proxy.port", "");
jobConf.set("mapred.output.format.class",
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
jobConf.set("mapred.input.format.class",
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");
//jobConf.set("dynamodb.customAWSCredentialsProvider", profile);
return jobConf;
}
Related
I am new to Spark.
I want to keep getting message from kafka, and then save to S3 once the message size over 100000.
I implemented it by Dataset.collectAsList(), but it throw error with Total size of serialized results of 3 tasks (1389.3 MiB) is bigger than spark.driver.maxResultSize
So I turned to use foreach, and it said null point exception when used SparkSession to createDataFrame.
Any idea about it? Thanks.
---Code---
SparkSession spark = generateSparkSession();
registerUdf4AddPartition(spark);
Dataset<Row> dataset = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", args[0])
.option("kafka.group.id", args[1])
.option("subscribe", args[2])
.option("kafka.security.protocol", SecurityProtocol.SASL_PLAINTEXT.name)
.load();
DataStreamWriter<Row> console = dataset.toDF().writeStream().foreachBatch((rawDataset, time) -> {
Dataset<Row> rowDataset = rawDataset.selectExpr("CAST(value AS STRING)");
//using foreach
rowDataset.foreach(row -> {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
});
// using collectAsList
List<Row> rows = rowDataset.collectAsList();
for (Row row : rows) {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
}
});
StreamingQuery start = console.start();
start.awaitTermination();
public static void batchSave(SparkSession spark){
synchronized (spans){
if(spans.size() == 100000){
System.out.println(spans.isEmpty());
Dataset<Row> spanDataSet = spark.createDataFrame(spans, Span.class);
Dataset<Row> finalResult = addCustomizedTimeByUdf(spanDataSet);
StringBuilder pathBuilder = new StringBuilder("s3a://fwk-dataplatform-np/datalake/log/FWK/ART2/test/leftAndRight");
finalResult.repartition(1).write().partitionBy("year","month","day","hour").format("csv").mode("append").save(pathBuilder.toString());
spans.clear();
}
}
}
Since the main SparkSession is running in driver, and tasks in foreach... is running distributed in executors, so the spark is not defined to all other executors.
BTW, there is no meaning to use synchronized inside foreach task since everything is distributed.
Before sending an Avro GenericRecord to Kafka, a Header is inserted like so.
ProducerRecord<String, byte[]> record = new ProducerRecord<>(topicName, key, message);
record.headers().add("schema", schema);
Consuming the record.
When using Spark Streaming, the header from the ConsumerRecord is intact.
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)).foreachRDD(rdd -> {
rdd.foreach(record -> {
System.out.println(new String(record.headers().headers("schema").iterator().next().value()));
});
});
;
But when using Spark SQL Streaming, the header seems to be missing.
StreamingQuery query = dataset.writeStream().foreach(new ForeachWriter<>() {
...
#Override
public void process(Row row) {
String topic = (String) row.get(2);
int partition = (int) row.get(3);
long offset = (long) row.get(4);
String key = new String((byte[]) row.get(0));
byte[] value = (byte[]) row.get(1);
ConsumerRecord<String, byte[]> record = new ConsumerRecord<String, byte[]>(topic, partition, offset, key,
value);
//I need the schema to decode the Avro!
}
}).start();
Where can I find the custom header value when using Spark SQL Streaming approach?
Version:
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
UPDATE
I tried 3.0.0-preview2 of spark-sql_2.12 and spark-sql-kafka-0-10_2.12. I added
.option("includeHeaders", true)
But I still only get these columns from the Row.
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
Kafka headers in Structured Streaming supported only from 3.0: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
Please look for includeHeaders for more details.
I am working on spark module, where I need to load the collections from multiple sources (databases) but I can't get the collection from second db.
Databases
DB1
L_coll1
DB2
L_coll2
Logic code
String mst ="local[*]";
String host= "localhost";
String port = "27017";
String DB1 = "DB1";
String DB2 = "DB2";
SparkConf conf = new SparkConf().setAppName("cust data").setMaster(mst);
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.mongodb.input.uri", "mongodb://"+host+":"+port+"/")
.config("spark.mongodb.input.database",DB1)
.config("spark.mongodb.input.collection","coll1")
.getOrCreate();
SparkSession spark1 = SparkSession
.builder()
.config(conf)
.config("spark.mongodb.input.uri", "mongodb://"+host+":"+port+"/")
.config("spark.mongodb.input.database",DB2)
.config("spark.mongodb.input.collection","coll2")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaSparkContext jsc1 = new JavaSparkContext(spark1.sparkContext());
Reading configurations
ReadConfig readConfig = ReadConfig.create(spark);
Dataset<Row> MongoDatset = MongoSpark.load(jsc,readConfig).toDF();
MongoDatset.show();
ReadConfig readConfig1 = ReadConfig.create(spark1);
Dataset<Row> MongoDatset1 = MongoSpark.load(jsc1,readConfig1).toDF();
MongoDatset1.show();
After running the about code, I am getting the first dataset multiple time. If I comment the first SparkSession spark instance than only getting the collection from second db DB2.
Instead of using the multiple spark sessions you can use ReadConfig's override option to get multiple database and collections.
Creating spark session
String DB = "DB1";
String DB1 = "DB2";
String Coll1 ="Coll1";
String Coll2 ="Coll2";
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection")
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
Get database function
private static Dataset<Row> getDB(JavaSparkContext jsc_, String DB, String Coll1) {
// Create a custom ReadConfig
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("database",DB );
readOverrides.put("collection", Coll1);
readOverrides.put("readPreference.name", "secondaryPreferred");
System.out.println(readOverrides);
ReadConfig readConfig = ReadConfig.create(jsc_).withOptions(readOverrides);
return MongoSpark.load(jsc_,readConfig).toDF();
}
Using getDB to create multiple databases
Dataset<Row> MongoDatset1 = getDB(jsc, DB, Coll1);
Dataset<Row> MongoDatset2 = getDB(jsc, DB1, Coll2);
MongoDatset1.show(1);
MongoDatset2.show(1);
I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.
I am learning spark and would like to seek best approaches for solving the below problem.
I have 2 datasets users and transactions as below and would like to join them to find unique locations per item sold.
The headers for the files are as below
id,email,language,location ----------- USER HEADERS
txid,productid,userid,price,desc -------------------- TRANSACTION HEADERS
Below is my approach
/*
* Load user data set into userDataFrame
* Load transaction data set into transactionDataFrame
* join both on user id - userTransactionFrame
* select productid and location columns from the joined dataset into a new dataframe - productIdLocationDataFrame
* convert the new dataframe into a javardd - productIdLocationJavaRDD
* make the javardd a pair rdd - productIdLocationJavaPairRDD
* group the pair rdd by key - productLocationList
* apply mapvalues on the grouped key to convert the list of values to a set of valued for duplicate filtering - productUniqLocations
*
* */
I am not very sure that I have done this the right way and still feel "can be done better, differently".
I am doubtful of the part where I have done duplicate filtering from the JavaPairRDD.
Please evaluate the approach and code and let me know better solutions.
Code
SparkConf conf = new SparkConf();
conf.setAppName("Sample App - Uniq Location per item");
JavaSparkContext jsc = new JavaSparkContext("local[*]","A 1");
//JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
//id email language location ----------- USER HEADERS
DataFrame userDataFrame = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.load("user");
//txid pid uid price desc -------------------- TRANSACTION HEADERS
DataFrame transactionDataFrame = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.load("transactions");
Column joinColumn = userDataFrame.col("id").equalTo(transactionDataFrame.col("uid"));
DataFrame userTransactionFrame = userDataFrame.join(transactionDataFrame,joinColumn,"rightouter");
DataFrame productIdLocationDataFrame = userTransactionFrame.select(userTransactionFrame.col("pid"),userTransactionFrame.col("location"));
JavaRDD<Row> productIdLocationJavaRDD = productIdLocationDataFrame.toJavaRDD();
JavaPairRDD<String, String> productIdLocationJavaPairRDD = productIdLocationJavaRDD.mapToPair(new PairFunction<Row, String, String>() {
public Tuple2<String, String> call(Row inRow) throws Exception {
return new Tuple2(inRow.get(0),inRow.get(1));
}
});
JavaPairRDD<String, Iterable<String>> productLocationList = productIdLocationJavaPairRDD.groupByKey();
JavaPairRDD<String, Iterable<String>> productUniqLocations = productLocationList.mapValues(new Function<Iterable<String>, Iterable<String>>() {
public Iterable<String> call(Iterable<String> inputValues) throws Exception {
return new HashSet<String>((Collection<? extends String>) inputValues);
}
});
productUniqLocations.saveAsTextFile("uniq");
The good part is that the code runs and generates the output that I expect.
The lowest hanging fruit is getting rid of groupByKey.
Using aggregateByKey should do the job since the output type of the value is different (we want a set per key).
Code in Scala :
pairRDD.aggregateByKey(new java.util.HashSet[String])
((locationSet, location) => {locationSet.add(location); locationSet},
(locSet1, locSet2) => {locSet1.addAll(locSet2); locSet1}
)
Java Equivalent:
Function2<HashSet<String>, String, HashSet<String>> sequenceFunction = new Function2<HashSet<String>, String, HashSet<String>>() {
public HashSet<String> call(HashSet<String> aSet, String arg1) throws Exception {
aSet.add(arg1);
return aSet;
}
};
Function2<HashSet<String>, HashSet<String>, HashSet<String>> combineFunc = new Function2<HashSet<String>, HashSet<String>, HashSet<String>>() {
public HashSet<String> call(HashSet<String> arg0, HashSet<String> arg1) throws Exception {
arg0.addAll(arg1);
return arg0;
}
};
JavaPairRDD<String, HashSet<String>> byKey = productIdLocationJavaPairRDD.aggregateByKey(new HashSet<String>(), sequenceFunction, combineFunc );
Secondly, joins work the best when the datasets are co-partitioned.
Since you are dealing with dataframes, partitioning out of box is not possible if you are using Spark < 1.6. Thus, you may want to read data into RDDs, partition them and then create dataframes. For your use case, it might be better to not involve dataframes at all.