Saving protobuf in Hbase/HDFS using Spark streaming - apache-spark

I am looking to store the protobuf messages in Hbase/HDFS using spark streaming. And I have below two questions
What is the efficient way of storing huge number of protobuf
messages and the efficient way of retrieving them to do some
analytics? For example, should they be stored as Strings/byte[] in Hbase or Should they be stored in parquet files in HDFS etc.
How should the hierarchical structure of a protobuf
messages be stored? I mean, should the nested elements be flattened
out before storage, or is there any mechanism to store them as is?
If the nested elements are collections or maps should they be
exploded and stored as multiple rows?
The sample structure of Protobuf message is shown below
> +--MsgNode-1
> +--Attribute1 - String
> +--Attribute2 - Int
> +--MsgNode-2
> +--Attribute1 - String
> +--Attribute2 - Double
> +--MsgNode-3 - List of MsgNode-3's
> +--Attribute1 - Int
I am planning to use Spark streaming to collect the protobuf messages as bytes and store them in Hbase/HDFS.

Question 1 :
What is the efficient way of storing huge number of protobuf messages
and the efficient way of retrieving them to do some analytics? For
example, should they be stored as Strings/byte[] in Hbase or Should
they be stored in parquet files in HDFS etc.
I would recommend
- store Proto-buf as Parquet AVRO files(splitting in to meaningful message with AVRO schema).
This can be achieved using dataframes api spark 1.5 and above (PartiotionBy with SaveMode.Append)
see this a-powerful-big-data-trio
If you store as string or byte array you cant do data analytics directly (query on raw data ) is not possible.
If you are using cloudera, impala(which supports parquet-avro) can be used to query rawdata.
Question 2:
How should the hierarchical structure of a protobuf messages be
stored? I mean, should the nested elements be flattened out before
storage, or is there any mechanism to store them as is? If the nested
elements are collections or maps should they be exploded and stored as
multiple rows?
If you store data in a raw format from spark streaming, How will you query if business wants to query and know what kind of data they received(this requirement is very common).
In the first place, You have to understand your data (i.e. relation between different messages with in protobuf so that single row or multiple rows you can decide) then develop protobuf parser to parse message structure of protobuf. based on your data, convert it to avro generic record to save as parquet file.
TIP :
protobuf parsers can be developed in different ways based on your requirements.
one of the generic way is like below example.
public SortedMap<String, Object> convertProtoMessageToMap(GeneratedMessage src) {
final SortedMap<String, Object> finalMap = new TreeMap<String, Object>();
final Map<FieldDescriptor, Object> fields = src.getAllFields();
for (final Map.Entry<FieldDescriptor, Object> fieldPair : fields.entrySet()) {
final FieldDescriptor desc = fieldPair.getKey();
if (desc.isRepeated()) {
final List<?> fieldList = (List<?>) fieldPair.getValue();
if (fieldList.size() != 0) {
final List<String> arrayListOfElements = new ArrayList<String>();
for (final Object o : fieldList) {
arrayListOfElements.add(o.toString());
}
finalMap.put(desc.getName(), arrayListOfElements);
}
} else {
finalMap.put(desc.getName(), fieldPair.getValue().toString());
}
}
return finalMap;
}

Related

Apache Beam - Write BigQuery TableRow to Cassandra

I'm trying to read data from BigQuery (using TableRow) and write the output to Cassandra. How to do that?
Here's what I've tried. This works:
/* Read BQ */
PCollection<CxCpmMapProfile> data = p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, CxCpmMapProfile>() {
public CxCpmMapProfile apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new CxCpmMapProfile((String) r.get("channel_no").toString(), (String) r.get("channel_name").toString());
}
}).fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`").usingStandardSql().withoutValidation());
/* Write to Cassandra */
data.apply(CassandraIO.<CxCpmMapProfile>write()
.withHosts(Arrays.asList("<IP addr1>", "<IP addr2>"))
.withPort(9042)
.withUsername("cassandra_user").withPassword("cassandra_password").withKeyspace("cassandra_keyspace")
.withEntity(CxCpmMapProfile.class));
But when I changed Read BQ part using TableRow like this:
/* Read from BQ using readTableRow */
PCollection<TableRow> data = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`")
.usingStandardSql().withoutValidation());
In Write to Cassandra I got the following error
The method apply(PTransform<? super PCollection<TableRow>,OutputT>) in the type PCollection<TableRow> is not applicable for the arguments (CassandraIO.Write<CxCpmMacProfile>)
The error is due to the input PCollection containing TableRow elements, while the CassandraIO read is expecting CxCpmMacProfile elements. You need to read the elements from BigQuery as CxCpmMacProfile elements. The BigQueryIO documentation has an example of reading rows from a table and parsing them into a custom type, done through the read(SerializableFunction) method.

Apache Spark write to multiple outputs [different parquet schemas] without caching

I want to transform my input data (XML files) and produce 3 different outputs.
Each output will be in parquet format and will have a different schema/number of columns.
Currently in my solution, the data is stored in RDD[Row], where each Row belongs to one of three types and has a different number of fields. What I'm doing now is caching the RDD, then filtering it (using the field telling me about the record type) and saving the data using the following method:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
...
// the same for filtered_data_2 and filtered_data_3
Is there any way to do it better, for example do not cache entire data in memory?
In MapReduce we have MultipleOutputs class and we can do it this way:
MultipleOutputs.addNamedOutput(job, "data_type_1", DataType1OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_2", DataType2OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_3", DataType3OutputFormat.class, Void.class, Group.class);
...
MultipleOutputs<Void, Group> mos = new MultipleOutputs<>(context);
mos.write("data_type_1", null, myRecordGroup1, filePath1);
mos.write("data_type_2", null, myRecordGroup2, filePath2);
...
We had exactly this problem, to re-iterate: we read 1000s of datasets into one RDD, all of different schemas (we used a nested Map[String, Any]) and wanted to write those 1000s of datasets to different Parquet partitions in their respective schemas. All in a single embarrassingly parallel Spark Stage.
Our initial approach indeed did the hacky thing of caching, but this meant (a) 1000 passes of the cached data (b) hitting a lot of memory issues!
For a long time now I've wanted to bypass the Spark's provided .parquet methods and go to lower level underlying libraries, and wrap that in a nice functional signature. Finally recently we did exactly this!
The code is too much to copy and paste all of it here, so I will just paste the main crux of the code to explain how it works. We intend on making this code Open Source in the next year or two.
val successFiles: List[String] = successFilePaths(tableKeyToSchema, tableKeyToOutputKey, tableKeyToOutputKeyNprs)
// MUST happen first
info("Deleting success files")
successFiles.foreach(S3Utils.deleteObject(bucket, _))
if (saveMode == SaveMode.Overwrite) {
info("Deleting past files as in Overwrite mode")
parDeleteDirContents(bucket, allDirectories(tableKeyToOutputKey, tableKeyToOutputKeyNprs, partitions, continuallyRunStartTime))
} else {
info("Not deleting past files as in Append mode")
}
rdd.mapPartitionsWithIndex {
case (index, records) =>
records.toList.groupBy(_._1).mapValues(_.map(_._2)).foreach {
case (regularKey: RegularKey, data: List[NotProcessableRecord Either UntypedStruct]) =>
val (nprs: List[NotProcessableRecord], successes: List[UntypedStruct]) =
Foldable[List].partitionEither(data)(identity)
val filename = s"part-by-partition-index-$index.snappy.parquet"
Parquet.writeUntypedStruct(
data = successes,
schema = toMessageType(tableKeyToSchema(regularKey.tableKey)),
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKey(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
Parquet.writeNPRs(
nprs = nprs,
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKeyNprs(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
} pipe Iterator.single
}.count() // Just some action to force execution
info("Writing _SUCCESS files")
successFiles.foreach(S3Utils.uploadFileContent(bucket, "", _))
Of course this code cannot be copy and pasted as many methods and values are not provided. The key points are:
We hand crank the deleting of _SUCCESS files and previous files when overwriting
Each spark partition will result in one-or-many output files (many when multiple data schemas are in the same partition)
We hand crank the writing of _SUCCESS files
Notes:
UntypedStruct is our nested representation of arbitrary schema. It's a little bit like Row in Spark but much better, as it's based on Map[String, Any].
NotProcessableRecord are essentially just dead letters
Parquet.writeUntypedStruct is the crux of the logic of writing a parquet file, so we'll explain this in more detail. Firstly
val toMessageType: StructType => MessageType = new org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter().convert
Should be self explanatory. Next fsMode contains within it the com.amazonaws.auth.AWSCredentials, then inside writeUntypedStruct we use that to construct org.apache.hadoop.conf.Configuration setting fs.s3a.access.key and fs.s3a.secret.key.
writeUntypedStruct basically just calls out to:
def writeRaw(
data: List[UntypedStruct],
schema: MessageType,
config: Configuration,
path: Path,
compression: CompressionCodecName = CompressionCodecName.SNAPPY
): Unit =
Using.resource(
ExampleParquetWriter.builder(path)
.withType(schema)
.withConf(config)
.withCompressionCodec(compression)
.withValidation(true)
.build()
)(writer => data.foreach(data => writer.write(transpose(data, new SimpleGroup(schema)))))
where SimpleGroup comes from org.apache.parquet.example.data.simple, and ExampleParquetWriter extends ParquetWriter<Group>. The method transpose is a very tedious self writing recursion through the UntypedStruct populating a Group (some ugly Java mutable low level thing).
Credit must go to https://github.com/davidainslie for figuring out how these underlying libraries work, and labouring out the code, which like I said, we intend on making Open Source soon!
AFAIK, there is no way to split one RDD into multiple RDD per se. This is just how the way Spark's DAG works: only child RDDs pulling data from parent RDDs.
We can, however, have multiple child RDDs reading from the same parent RDD. To avoid recomputing the parent RDD, there is no other way but to cache it. I assume that you want to avoid caching because you're afraid of insufficient memory. We can avoid Out Of Memory (OOM) issue by persisting the RDD to MEMORY_AND_DISK so that large RDD will spill to disk if and when needed.
Let's begin with your original data:
val allDataRDD = sc.parallelize(Seq(Row(1,1,1),Row(2,2,2),Row(3,3,3)))
We can persist this in memory first, but allow it to spill over to disk in case of insufficient memory:
allDataRDD.persist(StorageLevel.MEMORY_AND_DISK)
We then create the 3 RDD outputs:
filtered_data_1 = allDataRDD.filter(_.get(1)==1) // //
filtered_data_2 = allDataRDD.filter(_.get(2)==1) // use your own filter funcs here
filtered_data_3 = allDataRDD.filter(_.get(3)==1) // //
We then write the outputs:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
var resultDF_2 = sqlContext.createDataFrame(filtered_data_2, schema_2)
resultDF_2.write.parquet(output_path_2)
var resultDF_3 = sqlContext.createDataFrame(filtered_data_3, schema_3)
resultDF_3.write.parquet(output_path_3)
If you truly really want to avoid multiple passes, there is a workaround using a custom partitioner. You can repartition your data into 3 partitions and each partition will have its own task and hence its own output file/part. The caveat is that parallelism will be heavily reduced to 3 threads/tasks, and there's also the risk of >2GB of data stored in a single partition (Spark has a 2GB limit per partition). I am not providing detailed code for this method because I don't think it can write parquet files with different schema.

Spark: Read multiple AVRO files with different schema in parallel

I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.

How to create Spark broadcast variable from Java String array?

I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.
This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)
The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})

How to update multiple rows using Hector

Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.

Resources