How to read/write protocol buffer messages with Apache Spark? - apache-spark

I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:
1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.
2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).
3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).
4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.
5) Through twitter's Elephant-bird Library.

Though a bit hidden between the points, you seem to be asking how to write to a sequencefile in spark. I found an example here.
// Importing package
// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")
// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to
// type cast by saying new LongWritable(object)
map(x => (NullWritable.get(), x)).
// Make sure to replace `whoami` with the appropriate OS user id
// Saving in sequence file with key of type Int and value of type String
map(x => (x.split(",")(0).toInt, x.split(",")(1))).
// Make sure to replace `whoami` with the appropriate OS user id


Is dataframe.colums a Spark action?

If not, there is no action method in the following code, but "./demo.json" is read once.
val x ="./demo.json")
dataframe.columns is not an action per se, but it needs to get the schema of your dataframe. Depending on the file format, this needs a file-scan (json, csv). With other file formats like parquet, the columns can be extracted from the meta-data, so no actual file scan is needed is an action that reads all your data to infer the schema (unless you specify it manually). Hence x.columns will not trigger any action.
According to the latest documentation (click on json):
This function goes through the input once to determine the input
schema. If you know the schema in advance, use the version that
specifies the schema to avoid the extra scan.

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt") x :x[1].decode('utf-8')) #or another encoding depending on your file

How to read sequence files exported from HBase

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
It shows:
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass=''
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!
I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": ",org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Store large text file to DB using oData

I need to store a very large string into the backend table under one field which is of type string.
The string which I am storing is above 10 million (1 crore) character length. It is taking long time to store and retrieve from the backend.
I tried compressing algorithms,which failed to compress such large string.
So what is the best way to handle this situation and improve the performance.
Technologies used:
front end - SAP UI5,
gateway - oData,
backend - SAP ABAP.
Compressing methods tried:
the above compressing methods weren't able to solve my problem.
Well, Marc is right stating that transferring XLSX is definitely better/faster than JSON.
ABAP JSON tools are not so rich however sufficient for most manipulations. More peculiar tasks can be done via internal tables and transformations. So it is highly recommended to perform your operations (XLSX >> JSON) on the backend server.
What concerns backend DB table, I support Chris N that inserting 10M string into string field is a worst idea that can be ever imagined. The recommended way of storing big files in transparent tables is utilizing XSTRING type. This is a kind of BLOB for ABAP which is much faster in handling binary data.
I've made some SAT performance tests on my sample 14-million file and that's what I got.
INSERT into XSTRING field:
INSERT into STRING field:
As you can notice DB operations net time differs significantly, not in favour of STRING.
Your upload code can look like this:
DATA: len type i,
lt_content TYPE standard table of tdline,
ws_store_tab TYPE zstore_tab.
"Upload the file to Internal Table
call function 'GUI_UPLOAD'
filename = '/TEMP/FILE.XLSX'
filetype = 'BIN'
filelength = len
data_tab = lt_content
IF sy-subrc <> 0.
message 'Unable to upload file' type 'E'.
"Convert binary itab to xstring
call function 'SCMS_BINARY_TO_XSTRING'
input_length = len
buffer = zstore_tab-file "should be of type XSTRING!
binary_tab = gt_content
failed = 1
others = 2
IF sy-subrc <> 0.
MESSAGE 'Unable to convert binary to xstring' type 'E'.
INSERT zstore_tab FROM ws_store.
IF sy-subrc IS INITIAL.
MESSAGE 'Successfully uploaded' type 'S'.
MESSAGE 'Failed to upload' type 'E'.
For parsing and manipulating XLSX multiple AS ABAP wrappers already present, examples are here, here and here.
All this is about backend-side optimization. Optimization on the frontend
are welcomed from UI5-experts (to whom I don't belong), however general SAP recommendation is to move all massive manipulation to application server.
