Error while overwriting Cassandra table from PySpark - cassandra

I am attempting to OVERWRITE data in cassandra with a PySpark dataframe. I get this error: keyword can't be an expression
I am able to append the data by
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable").mode("append").save()
However, overwriting is throwing error
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable", confirm.truncate="true").mode("overwrite").save()
Error: keyword can't be an expression

I found the solution.
df.write.format("org.apache.spark.sql.cassandra")
.mode("overwrite").option("confirm.truncate","true")
.options(keyspace="ks",table="testtable")
.save()

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

How to write into Microsoft SQL Server table even if table exist using PySpark

I have a PySpark Code which writes into SQL Server database like this
df.write.jdbc(url=url, table="AdventureWorks2012.dbo.people", properties=properties)
However problem is that I want to keep writing in the table people even if the table exist and I see in the Spark Document that there are possible options error, append, overwrite and ignore for mode and all of them throws error, the object already exist if the table already exist in the database.
Spark throw following error
py4j.protocol.Py4JJavaError: An error occurred while calling o43.jdbc.
com.microsoft.sqlserver.jdbc.SQLServerException: There is already an object named 'people' in the database
Is there way to write data into the table even if the table already exits ?
Please let me know you need more explanation
For me the issue was with Spark 1.5.2. The way it checks if the table exists (here) is by running SELECT 1 FROM $table LIMIT 1. If the query fails, the tables doesn't exist. That query failed even when the table was there.
This was changed to SELECT * FROM $table WHERE 1=0 in 1.6.0 (here).
So append and overwrite mode will not throw an error when the table already exists. From the spark documentation ( http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ) SaveMode.Append will "When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data." and SaveMode.Overwrite will "Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame." Depending on how you want to handle the existing table one of these two should likely meet your needs.

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources