Spark Dataframe join: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute - apache-spark

I have a dataframe:
As can be seen, 'origin' is one of the column.
I am trying to join this dataframe with another one:
myDF.join(
anotherDF,
anotherDF.col("IATA") === $"origin"
).select("City", "State", "date", "delay", "distance", "destination").show()
But I get error:
Exception in thread "main" java.lang.RuntimeException: Unsupported
literal type class
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute 'origin
Error message is pretty clear that 'origin' can't be resolved.
What am I missing here, as 'origin' is vary much present in dataframe?
Edit: Spark version is 3.0.0

Related

Left Join errors out: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans

***Edit
df_joint = df_raw.join(df_items,on='x',how='left')
The titled exception occurred in Apache Spark 2.4.5
df_raw has data of 2 columns "x", "y" and df_items is an empty data frame of schema with some other columns
left join is happening on a value to null, which should get the whole data from 1st dataframe with null columns from the 2nd dataframe.
It is completely working fine when "X" is float, how ever when I casted "X" to string its throwing error of implicit cartesian product
i received this error with spark 2.4.5.
Why it is happening and how to resolve this with out enabling the spark cross join
spark.conf.set("spark.sql.crossJoin.enabled", "true")
Might be a bug in Spark, but if you just want to add columns, you can do the following:
import pyspark.sql.functions as F
df_joint = df_raw.select(
'*',
*[F.lit(None).alias(c) for c in df_items.columns if c not in df_raw.columns]
)

Handling corrupted data in Pyspark dataframe

I have data which I need to handle using Pyspark dataframe even when it is corrupted. I tried using PERMISSIVE but still I am getting error. I can read the same code if have some data in the account_id
The data I have where the account_id(integer) has no value:
{
"Name:"
"account_id":,
"phone_number":1234567890,
"transactions":[
{
"Spent":1000,
},
{
"spent":1100,
}
]
}
The code I tried:
df=spark.read.option("mode","PERMISSIVE").json("path\complex.json",multiLine=True)
df.show()
The error and warning I get:
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
How can I read corrupted data in Pyspark Dataframe?

SPARK How to get column name in case of error (FAILFAST mode)?

i have the next code:
spark.read()
.schema(REPORT_OUTPUT_SCHEMA)
.option("header", "true")
.option("mode", "FAILFAST")
.csv(filePath)
.show();
when i got the next error:
.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. Caused by: java.lang.NumberFormatException: For input string: "0.00"
its not possible to determine in which column can be that error when we have a big schema.
How usually find the needed column in that case?

org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;

Below is my Streaming Data Frame created from a weblog file:
val finalDf = joinedDf
.groupBy(window($"dateTime", "10 seconds"))
.agg(
max(col("datetime")).as("visitdate"),
count(col("ipaddress")).as("number_of_records"),
collect_list("ipaddress").as("ipaddress")
)
.select(col("window"),col("visitdate"),col("number_of_records"),explode(col("ipaddress")).as("ipaddress"))
.join(joinedDf,Seq("ipaddress"))
.select(
col("window"),
col("category").as("category_page_category"),
col("category"),
col("calculation1"),
hour(col("dateTime")).as("hour_label").cast("String"),
col("dateTime").as("date_label").cast("String"),
minute(col("dateTime")).as("minute_label").cast("String"),
col("demography"),
col("fullname").as("full_name"),
col("ipaddress"),
col("number_of_records"),
col("endpoint").as("pageurl"),
col("pageurl").as("page_url"),
col("username"),
col("visitdate"),
col("productname").as("product_name")
).dropDuplicates().toDF()
There are no aggregations performed on this Data Frame earlier at this stage.
I have applied aggregation only once but still I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets;
There are indeed two aggregations here. The first one is explicit:
.groupBy(...).agg(...)
the second one is required for
.dropDuplicates()
which is implemented
.groupBy(...).agg(first(...), ...)
You'll have to redesign your pipeline.

Spark SQL - Cast to UUID of the Dataset Column throws Parse Exception

Dataset<Row> finalResult = df.selectExpr("cast(col1 as uuid())", "col2");
When we tried to cast the Column in the dataset to UUID and persist in Postgres, i see the following exception. Please suggest the alternate solution to convert the column in a data set to UUID.
java.lang.RuntimeException: org.apache.spark.sql.catalyst.parser.ParseException:
DataType uuid() is not supported.(line 1, pos 21)
== SQL ==
cast(col1 as UUID)
---------------------^^^
Spark has no uuid type, so casting to one is just not going to work.
You can try to use database.column.type metadata property as explained in Custom Data Types for DataFrame columns when using Spark JDBC and SPARK-10849.

Resources