Got a Error when using DataFrame.schema.fields.update - apache-spark

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!

You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Related

Convert DataFrame Format

I have my dataframe in below format -
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and convert into having multiple values-
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Example:
From:
1,12345, [pq -> r, ab -> c]
To:
1,12345, pq ,r
1,12345, ab ,c
I am trying this code but doesn't work-
val array2Df = array1Df.flatMap(line =>
line.getMap[String, String](2).map(
(line.getString(0),line.getString(1),_)
))
Try following
val arrayData = Seq(
Row("1","epoch_1",Map("epoch_1_key1"->"epoch_1_val1","epoch_1_key2"->"epoch_1_Val2")),
Row("2","epoch_2",Map("epoch_2_key1"->"epoch_2_val1","epoch_2_key2"->"epoch_2_Val2"))
)
val arraySchema = new StructType()
.add("Id",StringType)
.add("epoch", StringType)
.add("data", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.printSchema()
df.show(false)
After that you need to explode based on data column. Don't forget to
import org.apache.spark.sql.functions.explode
df.select($"Id",explode($"data")).show(false)

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.
Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = false)
>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')
>>> spark.sql("select * from default.withcolTest").printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = true)
Why does the nullable flag of the column newCol added with withColumn function change when the dataframe is persisted?

Handle string to array conversion in pyspark dataframe

I have a file(csv) which when read in spark dataframe has the below values for print schema
-- list_values: string (nullable = true)
the values in the column list_values are something like:
[[[167, 109, 80, ...]]]
Is it possible to convert this to array type instead of string?
I tried splitting it and using code available online for similar problems:
df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))
but if I run the above code the array which I get skips a lot of values in the original array i.e.
output of the above code is:
[, 109, 80, 69, 5...
which is different from original array i.e. (-- 167 is missing)
[[[167, 109, 80, ...]]]
Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.
So I'll repeat the question again :
How can I convert/cast an array stored as string to array i.e.
'[]' to [] conversion
Suppose your DataFrame was the following:
df.show()
#+----+------------------+
#|col1| col2|
#+----+------------------+
#| a|[[[167, 109, 80]]]|
#+----+------------------+
df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":
from pyspark.sql.functions import split, regexp_replace
df2 = df.withColumn(
"col3",
split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()
#+----+------------------+--------------+
#|col1| col2| col3|
#+----+------------------+--------------+
#| a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: string (containsNull = true)
If you wanted the column as an array of integers, you could use cast:
from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: integer (containsNull = true)

How to save none nullable dataframe to a hive table?

I have a dataframe in spark which has a non nullable column in it. When I save it to HIVE and then read it from HIVE the non nullable column is nullable. What might be wrong?
For some context I am taking an existing dataframe and changing it's schema to include that none nullable property.
df = spark.table("myhive_table")
df.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
schema = StructType([StructField("col1",spark_types.StringType(), True),
StructField("col2",spark_types.DoubleType(), False),
])
df2 = spark.createDataFrame(df.rdd,schema)
df2.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = false)
spark.sql('drop table myhive_table')
df.write.saveAsTable("myhive_table",overwrite = True)
spark.table("myhive_table").printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)

How to convert Timestamp to Date format in DataFrame?

I have a DataFrame with Timestamp column, which i need to convert as Date format.
Is there any Spark SQL functions available for this?
You can cast the column to date:
Scala:
import org.apache.spark.sql.types.DateType
val newDF = df.withColumn("dateColumn", df("timestampColumn").cast(DateType))
Pyspark:
df = df.withColumn('dateColumn', df['timestampColumn'].cast('date'))
In SparkSQL:
SELECT
CAST(the_ts AS DATE) AS the_date
FROM the_table
Imagine the following input:
val dataIn = spark.createDataFrame(Seq(
(1, "some data"),
(2, "more data")))
.toDF("id", "stuff")
.withColumn("ts", current_timestamp())
dataIn.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
You can use the to_date function:
val dataOut = dataIn.withColumn("date", to_date($"ts"))
dataOut.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
|-- date: date (nullable = false)
dataOut.show(false)
+---+---------+-----------------------+----------+
|id |stuff |ts |date |
+---+---------+-----------------------+----------+
|1 |some data|2017-11-21 16:37:15.828|2017-11-21|
|2 |more data|2017-11-21 16:37:15.828|2017-11-21|
+---+---------+-----------------------+----------+
I would recommend preferring these methods over casting and plain SQL.
For Spark 2.4+,
import spark.implicits._
val newDF = df.withColumn("dateColumn", $"timestampColumn".cast(DateType))
OR
val newDF = df.withColumn("dateColumn", col("timestampColumn").cast(DateType))
Best thing to use..tried and tested -
df_join_result.withColumn('order_date', df_join_result['order_date'].cast('date'))

Resources