zeppelin spark read parquet mysql write sql exception - apache-spark

After reading the parquet file in Apache Spark,
I specified the fields with spark sql .
If it is executed after setting the field to be the same as the created database table (type also), an SQL Exception occurs in the write jdbc part.
The questionable part is use near '"URL" TEXT , "GOD_NO" TEXT , "BRND_NM" TEXT , "CTGRY_NM" TEXT , "EVSP_ID" TEX...' at line 1.
seems to be being created. I'd like to know what I did wrong, thanks.
spark result
spark result
zeppelin interpreter

Related

CassandraBeamIO conversion Into Pcollection of ROWS

I am trying to read Data from Cassandra db using apache beam CassandraIO , my requirement is crating a Pcollection of Rows from cassandra db , currently my code look like this
PTransform<PBegin,PCollection<Row>>transform=CassandraIO.<Row>read()
.withHosts(Collections.singletonList("127.0.0.1"))
.withPort(9042)
.withKeyspace("\"testDb\"")
.withMapperFactoryFn(new CassandraRowMapper())
.withQuery(q)
.withTable("student")
.withEntity(Row.class)
.withCoder(SerializableCoder.of(Row.class));
any help will be appreciated

jdbc doesn't insert the data after the first run in Overwrite mode

I am trying to write the data to MySQL from my Spark job. While writing the data to the MySQL table, spark does nothing when the table is already present in the MySQL DB. If I drop the table and then run my job, the table is created and the data is inserted as expected. However, when I re-run the job, the data is not being inserted. Also, I get no spark error message on the console.
df=spark.read.parquet("xxx/xxx/xxx")
df.write.jdbc(url=connection_str,table="employee", mode="overwrite" , properties={"driver":"com.mysql.jdbc.Driver"})

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Resources