Persisting data to DynamoDB using Apache Spark - apache-spark

I have a application where
1. I read JSON files from S3 using SqlContext.read.json into Dataframe
2. Then do some transformations on the DataFrame
3. Finally I want to persist the records to DynamoDB using one of the record value as key and rest of JSON parameters as values/columns.
I am trying something like :
JobConf jobConf = new JobConf(sc.hadoopConfiguration());
jobConf.set("dynamodb.servicename", "dynamodb");
jobConf.set("dynamodb.input.tableName", "my-dynamo-table"); // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com");
jobConf.set("dynamodb.regionid", "us-east-1");
jobConf.set("dynamodb.throughput.read", "1");
jobConf.set("dynamodb.throughput.read.percent", "1");
jobConf.set("dynamodb.version", "2011-12-05");
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");
DataFrame df = sqlContext.read().json("s3n://mybucket/abc.json");
RDD<String> jsonRDD = df.toJSON();
JavaRDD<String> jsonJavaRDD = jsonRDD.toJavaRDD();
PairFunction<String, Text, DynamoDBItemWritable> keyData = new PairFunction<String, Text, DynamoDBItemWritable>() {
public Tuple2<Text, DynamoDBItemWritable> call(String row) {
DynamoDBItemWritable writeable = new DynamoDBItemWritable();
try {
System.out.println("JSON : " + row);
JSONObject jsonObject = new JSONObject(row);
System.out.println("JSON Object: " + jsonObject);
Map<String, AttributeValue> attributes = new HashMap<String, AttributeValue>();
AttributeValue attributeValue = new AttributeValue();
attributeValue.setS(row);
attributes.put("values", attributeValue);
AttributeValue attributeKeyValue = new AttributeValue();
attributeValue.setS(jsonObject.getString("external_id"));
attributes.put("primary_key", attributeKeyValue);
AttributeValue attributeSecValue = new AttributeValue();
attributeValue.setS(jsonObject.getString("123434335"));
attributes.put("creation_date", attributeSecValue);
writeable.setItem(attributes);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return new Tuple2(new Text(row), writeable);
}
};
JavaPairRDD<Text, DynamoDBItemWritable> pairs = jsonJavaRDD
.mapToPair(keyData);
Map<Text, DynamoDBItemWritable> map = pairs.collectAsMap();
System.out.println("Results : " + map);
pairs.saveAsHadoopDataset(jobConf);
However I do not see any data getting written to DynamoDB. Nor do I get any error messages.

I'm not sure, but your's seems more complex than it may need to be.
I've used the following to write an RDD to DynamoDB successfully:
val ddbInsertFormattedRDD = inputRDD.map { case (skey, svalue) =>
val ddbMap = new util.HashMap[String, AttributeValue]()
val key = new AttributeValue()
key.setS(skey.toString)
ddbMap.put("DynamoDbKey", key)
val value = new AttributeValue()
value.setS(svalue.toString)
ddbMap.put("DynamoDbKey", value)
val item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
val ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "my-dynamo-table")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
Also, have you checked that you have upped the capacity correctly?

Related

is it possible to connect to java jOOQ DB?

I discovered a new interesting service and I'm trying to understand how it works. Please explain how to connect to my jOOQ database from another program?
MockDataProvider provider = new MyProvider();
MockConnection connection = new MockConnection(provider);
DSLContext create = DSL.using(connection, SQLDialect.H2);
Field<Integer> id = field(name("BOOK", "ID"), SQLDataType.INTEGER);
Field<String> book = field(name("BOOK", "NAME"), SQLDataType.VARCHAR);
So, I create but can I connect to it?
Here I have added your code, Lukas.
try (Statement s = connection.createStatement();
ResultSet rs = s.executeQuery("SELECT ...")
) {
while (rs.next())
System.out.println(rs.getString(1));
}
This example was found here
https://www.jooq.org/doc/3.7/manual/tools/jdbc-mocking/
public class MyProvider implements MockDataProvider {
#Override
public MockResult[] execute(MockExecuteContext ctx) throws SQLException {
// You might need a DSLContext to create org.jooq.Result and org.jooq.Record objects
//DSLContext create = DSL.using(SQLDialect.ORACLE);
DSLContext create = DSL.using(SQLDialect.H2);
MockResult[] mock = new MockResult[1];
// The execute context contains SQL string(s), bind values, and other meta-data
String sql = ctx.sql();
// Dynamic field creation
Field<Integer> id = field(name("AUTHOR", "ID"), SQLDataType.INTEGER);
Field<String> lastName = field(name("AUTHOR", "LAST_NAME"), SQLDataType.VARCHAR);
// Exceptions are propagated through the JDBC and jOOQ APIs
if (sql.toUpperCase().startsWith("DROP")) {
throw new SQLException("Statement not supported: " + sql);
}
// You decide, whether any given statement returns results, and how many
else if (sql.toUpperCase().startsWith("SELECT")) {
// Always return one record
Result<Record2<Integer, String>> result = create.newResult(id, lastName);
result.add(create
.newRecord(id, lastName)
.values(1, "Orwell"));
mock[0] = new MockResult(1, result);
}
// You can detect batch statements easily
else if (ctx.batch()) {
// [...]
}
return mock;
}
}
I'm not sure what lines 3-5 of your example are supposed to do, but if you implement your MockDataProvider and put that into a MockConnection, you just use that like any other JDBC connection, e.g.
try (Statement s = connection.createStatement();
ResultSet rs = s.executeQuery("SELECT ...")
) {
while (rs.next())
System.out.println(rs.getString(1));
}

Converting UnixTimestamp to TIMEUUID for Cassandra

I'm learning all about Apache Cassandra 3.x.x and I'm trying to develop some stuff to play around. The problem is that I want to store data into a Cassandra table which contains these columns:
id (UUID - Primary Key) | Message (TEXT) | REQ_Timestamp (TIMEUUID) | Now_Timestamp (TIMEUUID)
REQ_Timestamp has the time when the message left the client at frontend level. Now_Timestamp, on the other hand, is the time when the message is finally stored in Cassandra. I need both timestamps because I want to measure the amount of time it takes to handle the request from its origin until the data is safely stored.
Creating the Now_Timestamp is easy, I just use the now() function and it generates the TIMEUUID automatically. The problem arises with REQ_Timestamp. How can I convert that Unix Timestamp to a TIMEUUID so Cassandra can store it? Is this even possible?
The architecture of my backend is this: I get the data in a JSON from the frontend to a web service that process it and stores it in Kafka. Then, a Spark Streaming job takes that Kafka log and puts it in Cassandra.
This is my WebService that puts the data in Kafka.
#Path("/")
public class MemoIn {
#POST
#Path("/in")
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.TEXT_PLAIN)
public Response goInKafka(InputStream incomingData){
StringBuilder bld = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new InputStreamReader(incomingData));
String line = null;
while ((line = in.readLine()) != null) {
bld.append(line);
}
} catch (Exception e) {
System.out.println("Error Parsing: - ");
}
System.out.println("Data Received: " + bld.toString());
JSONObject obj = new JSONObject(bld.toString());
String line = obj.getString("id_memo") + "|" + obj.getString("id_writer") +
"|" + obj.getString("id_diseased")
+ "|" + obj.getString("memo") + "|" + obj.getLong("req_timestamp");
try {
KafkaLogWriter.addToLog(line);
} catch (Exception e) {
e.printStackTrace();
}
return Response.status(200).entity(line).build();
}
}
Here's my Kafka Writer
package main.java.vcemetery.webservice;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
import org.apache.kafka.clients.producer.Producer;
public class KafkaLogWriter {
public static void addToLog(String memo)throws Exception {
// private static Scanner in;
String topicName = "MemosLog";
/*
First, we set the properties of the Kafka Log
*/
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// We create the producer
Producer<String, String> producer = new KafkaProducer<>(props);
// We send the line into the producer
producer.send(new ProducerRecord<>(topicName, memo));
// We close the producer
producer.close();
}
}
And finally here's what I have of my Spark Streaming job
public class MemoStream {
public static void main(String[] args) throws Exception {
Logger.getLogger("org").setLevel(Level.ERROR);
Logger.getLogger("akka").setLevel(Level.ERROR);
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("KafkaSparkExample").setMaster("local[2]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "group1");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
/* Se crea un array con los tópicos a consultar, en este caso solamente un tópico */
Collection<String> topics = Arrays.asList("MemosLog");
final JavaInputDStream<ConsumerRecord<String, String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
kafkaStream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
// Split each bucket of kafka data into memos a splitable stream
JavaDStream<String> stream = kafkaStream.map(record -> (record.value().toString()));
// Then, we split each stream into lines or memos
JavaDStream<String> memos = stream.flatMap(x -> Arrays.asList(x.split("\n")).iterator());
/*
To split each memo into sections of ids and messages, we have to use the code \\ plus the character
*/
JavaDStream<String> sections = memos.flatMap(y -> Arrays.asList(y.split("\\|")).iterator());
sections.print();
sections.foreachRDD(rdd -> {
rdd.foreachPartition(partitionOfRecords -> {
//We establish the connection with Cassandra
Cluster cluster = null;
try {
cluster = Cluster.builder()
.withClusterName("VCemeteryMemos") // ClusterName
.addContactPoint("127.0.0.1") // Host IP
.build();
} finally {
if (cluster != null) cluster.close();
}
while(partitionOfRecords.hasNext()){
}
});
});
ssc.start();
ssc.awaitTermination();
}
}
Thank you in advance.
Cassandra has no function to convert from UNIX timestamp. You have to do the conversion on client side.
Ref: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

Spark : cleaner way to build Dataset out of Spark streaming

I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.

Lucene - Sorting Date as NumericField

While trying to sort datetime (long) numeric fields I always get a FormatException.
When converting a string to DateTime, parse the string to take the
date before putting each variable into the DateTime object.
Adding the numeric field:
doc.Add(new NumericField("creationDate", Field.Store.YES, true)
.SetLongValue(DateTime.UtcNow.Ticks);
Add sorting:
// boolean query
var sortField = new SortField("creationDate", SortField.LONG, true);
var inverseSort = new Sort(sortField);
var results = searcher.Search(query, null, 100, inverseSort); // exception thrown here
Inspecting the index, I can verify that 'creationDate' field is storing "long" values. What could be causing this exception?
EDIT:
Query
var query = new BooleanQuery();
foreach (var termQuery in incomingProps.Select(p => new TermQuery(new Term(kvp.Key, kvp.Value.ToLowerInvariant()))
{
query.Add(new BooleanClause(termQuery , Occur.Must));
}
return query;
Version: Lucene.Net 3.0.3
UPDATE:
This issue is occurring again, now with INT values.
I downloaded Lucene.Net source code and debugged the issue.
So it's somewhere in the FieldCache, when trying to parse the value "`\b\0\0\0" to Integer, which seems a bit odd.
I'm adding these values as numeric fields:
doc.Add(new NumericField(VersionNum, int.MaxValue, Field.Store.YES,
true).SetIntValue(VersionValue));
I get the exception when I'm supposed to get at least 1 hit back.
After inspecting the Index I see that the field's term is as following:
And the field text is:
EDIT:
I've hardcoded an int value and added a few segments:
doc.Add(new Field(VersionNum, NumericUtils.IntToPrefixCoded(1), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
Which resulted on storing the version field as:
And still, when I try to sort I get the parsing error:
var sortVersion = new SortField(VersionNum, SortField.INT, true);
For every exception, Lucene is trying to parse " \b\0\0\0 ".
Looking at the prefixed coded stored as string, 1 would translate to " \b\0\0\0\1 " I'm guessing?
Is Lucene probably leaving some garbage behind in the FieldCache ?
Here's a unit test that tries to capture what you're asking. The test passes. Can you explain what the difference with your code is? (posting a full failing test would help us understand what you're doing :-) )
using System;
using System.Linq;
using System.Collections.Generic;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Lucene.Net.Search;
using Lucene.Net.Index;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Documents;
using Lucene.Net.Store;
namespace SO_answers
{
[TestClass]
public class UnitTest1
{
[TestMethod]
public void TestShopping()
{
var item = new Dictionary<string, string>
{
{"field1", "value1" },
{"field2", "value2" },
{"field3", "value3" }
};
var writer = CreateIndex();
Add(writer, item);
writer.Flush(true, true, true);
var searcher = new IndexSearcher(writer.GetReader());
var result = Search(searcher, item);
Assert.AreEqual(1, result.Count);
writer.Dispose();
}
private List<string> Search(IndexSearcher searcher, Dictionary<string, string> values)
{
var query = new BooleanQuery();
foreach (var termQuery in values.Select(kvp => new TermQuery(new Term(kvp.Key, kvp.Value.ToLowerInvariant()))))
query.Add(new BooleanClause(termQuery, Occur.MUST));
return Search(searcher, query);
}
private List<string> Search(IndexSearcher searcher, Query query)
{
var sortField = new SortField("creationDate", SortField.LONG, true);
var inverseSort = new Sort(sortField);
var results = searcher.Search(query, null, 100, inverseSort); // exception thrown here
var result = new List<string>();
var matches = results.ScoreDocs;
foreach (var item in matches)
{
var id = item.Doc;
var doc = searcher.Doc(id);
result.Add(doc.GetField("creationDate").StringValue);
}
return result;
}
IndexWriter CreateIndex()
{
var directory = new RAMDirectory();
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var writer = new IndexWriter(directory, analyzer, new IndexWriter.MaxFieldLength(1000));
return writer;
}
void Add(IndexWriter writer, IDictionary<string, string> values)
{
var document = new Document();
foreach (var kvp in values)
document.Add(new Field(kvp.Key, kvp.Value.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED));
document.Add(new NumericField("creationDate", Field.Store.YES, true).SetLongValue(DateTime.UtcNow.Ticks));
writer.AddDocument(document);
}
}
}

How to Insert/Update into Azure Table using Windows Azure SDK 2.0

I have multiple entities to be stored in the same physical Azure table. I'm trying to Insert/Merge the table entries from a file. I'm trying to find a way to do this w/o really serializing each property or for that matter creating a custom entities.
While trying the following code, I thought maybe I could use generic DynamicTableEntity. However, I'm not sure if it helps in an insert operation (most documentation is for replace/merge operations).
The error I get is
HResult=-2146233088
Message=Unexpected response code for operation : 0
Source=Microsoft.WindowsAzure.Storage
Any help is appreciated.
Here's an excerpt of my code
_tableClient = storageAccount.CreateCloudTableClient();
_table = _tableClient.GetTableReference("CloudlyPilot");
_table.CreateIfNotExists();
TableBatchOperation batch = new TableBatchOperation();
....
foreach (var pkGroup in result.Elements("PartitionGroup"))
{
foreach (var entity in pkGroup.Elements())
{
DynamicTableEntity tableEntity = new DynamicTableEntity();
string partitionKey = entity.Elements("PartitionKey").FirstOrDefault().Value;
string rowKey = entity.Elements("RowKey").FirstOrDefault().Value;
Dictionary<string, EntityProperty> props = new Dictionary<string, EntityProperty>();
//if (pkGroup.Attribute("name").Value == "CloudServices Page")
//{
// tableEntity = new CloudServicesGroupEntity (partitionKey, rowKey);
//}
//else
//{
// tableEntity = new CloudServiceDetailsEntity(partitionKey,rowKey);
//}
foreach (var element in entity.Elements())
{
tableEntity.Properties[element.Name.ToString()] = new EntityProperty(element.Value.ToString());
}
tableEntity.ETag = Guid.NewGuid().ToString();
tableEntity.Timestamp = new DateTimeOffset(DateTime.Now.ToUniversalTime());
//tableEntity.WriteEntity(/*WHERE TO GET AN OPERATION CONTEXT FROM?*/)
batch.InsertOrMerge(tableEntity);
}
_table.ExecuteBatch(batch);
batch.Clear();
}
Have you tried using DictionaryTableEntity? This class allows you to dynamically fill the entity as if it were a dictionary (similar to DynamicTableEntity). I tried something like your code and it works:
var batch = new TableBatchOperation();
var entity1 = new DictionaryTableEntity();
entity1.PartitionKey = "abc";
entity1.RowKey = Guid.NewGuid().ToString();
entity1.Add("name", "Steve");
batch.InsertOrMerge(entity1);
var entity2 = new DictionaryTableEntity();
entity2.PartitionKey = "abc";
entity2.RowKey = Guid.NewGuid().ToString();
entity2.Add("name", "Scott");
batch.InsertOrMerge(entity2);
table.ExecuteBatch(batch);
var entities = table.ExecuteQuery<DictionaryTableEntity>(new TableQuery<DictionaryTableEntity>());
One last thing, I see that you're setting the Timestamp and ETag yourself. Remove these two lines and try again.

Resources