How to pass unicode character via spark-submit config?
while passing unicode character \u001D as csv delimeter via spark-submit, it throws below error:
Unsupported special character for delimiter: \u001D. null()
spark-submit
--conf spark.csv.delimeter="\u001D" \
below code works in spark-shell
val df = spark.read.option("sep","\u001D").option("header", "false").csv("PATH")
any option to pass unicode character via spark-submit
Related
I am a beginner in coding. Currently trying to read a file (which was imported to HDFS using sqoop) with the help of pyspark. The spark job is not progressing and my jupyter pyspark kernel is like stuck. I am not sure whether I used the correct way to import the file to hdfs and whether the code used to read the file with spark is correct or not.
The sqoop import code I used is as follows
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --target-dir /user/root/Spar_Nord -m 1
The pyspark code I used is
df = spark.read.csv("/user/root/Spar_Nord/part-m-00000", header = False, inferSchema = True)
Also please advice how we can know the file type that we imported with sqoop? I just assumed .csv and wrote the pyspark code.
Appreciate a quick help.
When pulling data into HDFS via sqoop, the default delimiter is the tab character. Sqoop creates a generic delimited text file based on the parameters passed into the sqoop command. To make the file output with a comma delimiter to match a generic csv format, you should add:
--fields-terminated-by <char>
So your sqoop command would look like:
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --fields-terminated-by ',' --target-dir /user/root/Spar_Nord -m 1
I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.
When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.
But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.
run on spark-cluster (data is garbled)
spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
run locally (data looks fine)
spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.
Reading CSV:
def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(path)
}
This is how I write to parquet:
finalDf.write
.format("parquet")
.mode(SaveMode.Append)
.option("path", hdfsTablePath)
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.partitionBy(parCols: _*)
.save()
This is how data on HDFS looks like:
Any tips on how to fix this ?
Does the input CSV file has to be in UTF-8 encoding ?
** Update **
Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :
Spark CSV reader : garbled Japanese text and handling multilines
Parquet format has no option for encoding or charset cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
Hence your code has no effect:
finalDf.write
.format("parquet")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
(...)
These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.
I am running this line from scala shell
scala> spark-sql --jars /usr/local/spark/jars/sqlite-jdbc-3.23.1.jar;
My session
spark
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#421f447f
Strange problem happens
<console>:1: error: ';' expected but double literal found.
spark-sql --jars /usr/local/spark/jars/sqlite-jdbc-3.23.1.jar;
If I put quotes
scala> spark-sql --jars "/usr/local/spark/jars/sqlite-jdbc-3.23.1.jar";
<console>:1: error: ';' expected but string literal found.
spark-sql --jars "/usr/local/spark/jars/sqlite-jdbc-3.23.1.jar";
^
Why?
You are trying to access spark-sql cli from scala terminal,
Exit from scala terminal using (:q + enter),
Then from bash terminal access spark-sql cli
bash$ spark-sql --jars "/usr/local/spark/jars/sqlite-jdbc-3.23.1.jar"
(or)
You can initialize spark-shell with the jars then use spark.sql(...) to run your commands.
bash$ spark-shell --jars "/usr/local/spark/jars/sqlite-jdbc-3.23.1.jar"
scala> spark.sql("<sql_query>")
I am getting following error while executing sparkling-shell2.cmd bat file. I walked through and I am getting this error while executing spark-shell.cmd with following paramters
cd %TOPDIR%
%SPARK_HOME%/bin/spark-shell.cmd --jars %TOPDIR%/assembly/build/libs/%FAT_JAR% --driver-memory %DRIVER_MEMORY% --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m" %*
Error: The input line is too long.
How do I solve this issue?
Thanks
In Windows the maximum length of command line is 260 characters and you are hitting the limit.
Your options are to change multiple of %DEF% operators into one single operator and reduce the overall length with some experimentation.
I am trying to execute hive command using java code. My hive is installed on linux virtual machine and the java code is on a remote windows machine which acts as client. I am able to successfully call the hive commands like:
hive -e 'Select * from mytable;'
But when I tried using load command with syntax as :
hive -e 'LOAD DATA LOCAL INPATH '/home/mapr/file.csv' INTO TABLE mytable;'
It throws me an error saying "FAILED: ParseException line 1:23 mismatched input '/' expecting StringLiteral near 'INPATH' in load statement"
This seems to be a syntax error near the file path probable an escape character issue, because I am able to execute "Select * from mytable" without error.
Can anyone help me with the syntax for hive load command using hive -e ?
By looking at your error message, it is clear that you are using single quote escape character twice and massing up your hive command.
So now use single and double quote to distinguish escape character and it will work.
New hive statement can be given below:
hive -e 'LOAD DATA LOCAL INPATH "/home/mapr/file.csv" INTO TABLE mytable;'
Hope this help you!!!