How to read sequence files exported from HBase - apache-spark

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)
It shows:
u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd\t'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
Question:
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)
and
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!

I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
valueClass='org.apache.hadoop.hbase.client.Result',
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
conf=hadoop_conf)
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Related

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

How to read/write protocol buffer messages with Apache Spark?

I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:
1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.
2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).
3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).
4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.
5) Through twitter's Elephant-bird Library.
Though a bit hidden between the points, you seem to be asking how to write to a sequencefile in spark. I found an example here.
// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._
// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")
// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to
// type cast by saying new LongWritable(object)
dataRDD.
map(x => (NullWritable.get(), x)).
saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id
// Saving in sequence file with key of type Int and value of type String
dataRDD.
map(x => (x.split(",")(0).toInt, x.split(",")(1))).
saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources