Spark : Japanese letters are garbled in Paquet files created in HDFS - apache-spark

I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.
When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.
But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.
run on spark-cluster (data is garbled)
spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
run locally (data looks fine)
spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.
Reading CSV:
def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(path)
}
This is how I write to parquet:
finalDf.write
.format("parquet")
.mode(SaveMode.Append)
.option("path", hdfsTablePath)
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.partitionBy(parCols: _*)
.save()
This is how data on HDFS looks like:
Any tips on how to fix this ?
Does the input CSV file has to be in UTF-8 encoding ?
** Update **
Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :
Spark CSV reader : garbled Japanese text and handling multilines

Parquet format has no option for encoding or charset cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
Hence your code has no effect:
finalDf.write
.format("parquet")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
(...)
These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.

Related

What configuration setting should I be changing to handle this error relating to buffer length when decompressing snappy?

I'm running a simple test in EMR on a json file that has been compressed into snappy.
I'm getting this error:
Java.lang.InternalError: Could not decompress data. Buffer length is too small.
I'm running:
df = oSpark.session.read.options(mode='FAILFAST', \
primitivesAsString='true', \
multiLine='true', \
compression='snappy', \
encoding='UTF-8') \
.json(file)
df.printSchema()
print(df.head(1))
df.show(truncate=False)
I've tried playing around with:
spark.buffer.size, spark.kryoserializer.buffer.max, io.file.buffer.size but I'm not getting any improvement

Read data from mount in Databricks (using Autoloader)

I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let's say I have these folders in my mount:
mnt/
├─ blob_container_1
├─ blob_container_2
When I use .load('/mnt/') no new files are detected. But when I consider folders individually then it works fine like .load('/mnt/blob_container_1')
I want to load files from both mount paths using Autoloader (running continuously).
You can use the path for providing prefix patterns, for example:
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", <format>) \
.schema(schema) \
.load("<base_path>/*/files")
For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.option("pathGlobfilter", "*.png") \
.load(<base_path>)
Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns

pySpark writerStream not showing output to console in Jupyter Lab

I am trying to display some streaming data (twitter feeds) to screen.
This is being done so I can follow better what is going on in Spark (debugging to a certain extent), but I am not getting any output.
Writing to csv file works ok for the same query but to console nothing is coming out.
I am using Jupyter Lab.
The query is;
tweets_query = tweets\
.selectExpr("cast(value as string)")\
.select( f.from_json(f.col("value").cast("string"), schema).alias("tweets"))\
.select( "tweets.id", "tweets.text", "tweets.createdOnDate", "tweets.lang", "tweets.loc")
The part to write to the screen;
query = tweets_query \
.writeStream \
.format("console") \
.outputMode("append") \
.option("truncate","false") \
.start()
What am I missing?
you are missing the await. add the following line after you start the query.
sparkSession.streams.awaitAnyTermination()

Spark submit to pass unicode character

How to pass unicode character via spark-submit config?
while passing unicode character \u001D as csv delimeter via spark-submit, it throws below error:
Unsupported special character for delimiter: \u001D. null()
spark-submit
--conf spark.csv.delimeter="\u001D" \
below code works in spark-shell
val df = spark.read.option("sep","\u001D").option("header", "false").csv("PATH")
any option to pass unicode character via spark-submit

Found nothing in _spark_metadata

I am trying to read CSV files from a specific folder and write same contents to other CSV file in a different location on the local pc for learning purpose. I can read the file and show the contents on the console. However, if I want to write it to another CSV file at the specified output directory I get a folder named "_spark_metadata" which contain nothing inside.
I paste the whole code here step by step:
creating Spark Session:
spark = SparkSession \
.builder \
.appName('csv01') \
.master('local[*]') \
.getOrCreate();
spark.conf.set("spark.sql.streaming.checkpointLocation", <String path to checkpoint location directory> )
userSchema = StructType().add("name", "string").add("age", "integer")
Read from CSV file
df = spark \
.readStream \
.schema(userSchema) \
.option("sep",",") \
.csv(<String path to local input directory containing CSV file>)
Write to CSV file
df.writeStream \
.format("csv") \
.option("path", <String path to local output directory containing CSV file>) \
.start()
In "String path to local output directory containing CSV file" I only get a folder _spark_metadata which contains no CSV file.
Any help on this is highly appreciated
You don't use readStream to read from static data. You use that to read from a directory where files are added into that folder.
You only need spark.read.csv

Resources