How to save none nullable dataframe to a hive table? - apache-spark

I have a dataframe in spark which has a non nullable column in it. When I save it to HIVE and then read it from HIVE the non nullable column is nullable. What might be wrong?
For some context I am taking an existing dataframe and changing it's schema to include that none nullable property.
df = spark.table("myhive_table")
df.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
schema = StructType([StructField("col1",spark_types.StringType(), True),
StructField("col2",spark_types.DoubleType(), False),
])
df2 = spark.createDataFrame(df.rdd,schema)
df2.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = false)
spark.sql('drop table myhive_table')
df.write.saveAsTable("myhive_table",overwrite = True)
spark.table("myhive_table").printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)

Related

Convert DataFrame Format

I have my dataframe in below format -
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and convert into having multiple values-
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Example:
From:
1,12345, [pq -> r, ab -> c]
To:
1,12345, pq ,r
1,12345, ab ,c
I am trying this code but doesn't work-
val array2Df = array1Df.flatMap(line =>
line.getMap[String, String](2).map(
(line.getString(0),line.getString(1),_)
))
Try following
val arrayData = Seq(
Row("1","epoch_1",Map("epoch_1_key1"->"epoch_1_val1","epoch_1_key2"->"epoch_1_Val2")),
Row("2","epoch_2",Map("epoch_2_key1"->"epoch_2_val1","epoch_2_key2"->"epoch_2_Val2"))
)
val arraySchema = new StructType()
.add("Id",StringType)
.add("epoch", StringType)
.add("data", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.printSchema()
df.show(false)
After that you need to explode based on data column. Don't forget to
import org.apache.spark.sql.functions.explode
df.select($"Id",explode($"data")).show(false)

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.
Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = false)
>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')
>>> spark.sql("select * from default.withcolTest").printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = true)
Why does the nullable flag of the column newCol added with withColumn function change when the dataframe is persisted?

could not find implicit value for parameter sparkSession

I have a notebook with code below that throws error of:
could not find implicit value for parameter sparkSession
import org.apache.spark.sql.{SparkSession, Row, DataFrame}
import org.apache.spark.ml.clustering.KMeans
def createBalancedDataframe(df:DataFrame, reductionCount:Int)(implicit sparkSession:SparkSession) = {
val kMeans = new KMeans().setK(reductionCount).setMaxIter(30)
val kMeansModel = kMeans.fit(df)
import sparkSession.implicits._
kMeansModel.clusterCenters.toList.map(v => (v, 0)).toDF("features", "label")
}
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
Error:
Name: Compile Error
Message: <console>:82: error: could not find implicit value for parameter sparkSession: org.apache.spark.sql.SparkSession
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
^
StackTrace:
It would be greatly appreciated if anyone can offer any help, thank you very much in advance.
UPDATE:
Thanks to Reddy's input, after I changed it to
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
I receive the following error:
Name: java.lang.IllegalArgumentException
Message: Field "features" does not exist.
Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
StackTrace: Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:93)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:254)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:340)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:305)
at createBalancedDataframe(<console>:45)
UPDATE2:
featureDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- distance: double (nullable = true)
|-- amt: integer (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
|-- category_indexed: double (nullable = false)
|-- category_encoded: vector (nullable = true)
|-- merchant_indexed: double (nullable = false)
|-- merchant_encoded: vector (nullable = true)
|-- features: vector (nullable = true)
val fraudDF = featureDF
.filter($"is_fraud" === 1)
.withColumnRenamed("is_fraud", "label")
.select("features", "label")
fraudDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- trans_num: string (nullable = true)
|-- trans_time: string (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- amt: integer (nullable = true)
|-- merch_lat: double (nullable = true)
|-- merch_long: double (nullable = true)
|-- distance: double (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
Why feature is gone???
Assuming you have your SparkSession and is named spark
you can pass it explicitly this way
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
or create an implicit reference (spark2 or any name) in the calling environment. Example:
implicit val spark2 = spark
//some calls
// others
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)

org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);

I am using spark sql to select a column along with sum of another column:
Below is my query:
scala> spark.sql("select distinct _c3,sum(_c9).as(sumAadhar) from aadhar group by _c3 order by _c9 desc LIMIT 3").show
And my schema is :
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: double (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
And I a getting below error:
org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:613)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
Any idea, what am i doing wrong or is there any other way to sum the values of a column
Check below which is tried on a reduced schema:
scala> val df = Seq(("a", 2), ("a", 3), ("b", 4), ("a", 9), ("b", 1), ("c", 100)).toDF("_c3", "_c9") df: org.apache.spark.sql.DataFrame = [_c3: string, _c9: int]
scala> df.createOrReplaceTempView("aadhar")
scala> spark.sql("select _c3,sum(_c9) as sumAadhar from aadhar group by _c3 order by sumAadhar desc LIMIT 3").show
+---+---------+
|_c3|sumAadhar|
+---+---------+
| c| 100|
| a| 14|
| b| 5|
+---+---------+
Removed distinct since its not necessary as your original query already groups by _c3.
Changed sum(_c9).as(sumAadhar) to sum(_c9) as sumAadhar as I think that syntax was leading spark sql to do some unintended casting.

Got a Error when using DataFrame.schema.fields.update

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!
You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Resources