How to save none nullable dataframe to a hive table?

How to save none nullable dataframe to a hive table? - apache-spark

I have a dataframe in spark which has a non nullable column in it. When I save it to HIVE and then read it from HIVE the non nullable column is nullable. What might be wrong?
For some context I am taking an existing dataframe and changing it's schema to include that none nullable property.
df = spark.table("myhive_table")
df.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
schema = StructType([StructField("col1",spark_types.StringType(), True),
StructField("col2",spark_types.DoubleType(), False),
])
df2 = spark.createDataFrame(df.rdd,schema)
df2.printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = false)
spark.sql('drop table myhive_table')
df.write.saveAsTable("myhive_table",overwrite = True)
spark.table("myhive_table").printSchema()
=> root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)

Related

Convert DataFrame Format

I have my dataframe in below format -
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and convert into having multiple values-
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Example:
From:
1,12345, [pq -> r, ab -> c]
To:
1,12345, pq ,r
1,12345, ab ,c
I am trying this code but doesn't work-
val array2Df = array1Df.flatMap(line =>
line.getMap[String, String](2).map(
(line.getString(0),line.getString(1),_)
))

Try following
val arrayData = Seq(
Row("1","epoch_1",Map("epoch_1_key1"->"epoch_1_val1","epoch_1_key2"->"epoch_1_Val2")),
Row("2","epoch_2",Map("epoch_2_key1"->"epoch_2_val1","epoch_2_key2"->"epoch_2_Val2"))
)
val arraySchema = new StructType()
.add("Id",StringType)
.add("epoch", StringType)
.add("data", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.printSchema()
df.show(false)
After that you need to explode based on data column. Don't forget to
import org.apache.spark.sql.functions.explode
df.select($"Id",explode($"data")).show(false)

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.
Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = false)
>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')
>>> spark.sql("select * from default.withcolTest").printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = true)
Why does the nullable flag of the column newCol added with withColumn function change when the dataframe is persisted?

could not find implicit value for parameter sparkSession

I have a notebook with code below that throws error of:
could not find implicit value for parameter sparkSession
import org.apache.spark.sql.{SparkSession, Row, DataFrame}
import org.apache.spark.ml.clustering.KMeans
def createBalancedDataframe(df:DataFrame, reductionCount:Int)(implicit sparkSession:SparkSession) = {
val kMeans = new KMeans().setK(reductionCount).setMaxIter(30)
val kMeansModel = kMeans.fit(df)
import sparkSession.implicits._
kMeansModel.clusterCenters.toList.map(v => (v, 0)).toDF("features", "label")
}
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
Error:
Name: Compile Error
Message: <console>:82: error: could not find implicit value for parameter sparkSession: org.apache.spark.sql.SparkSession
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
^
StackTrace:
It would be greatly appreciated if anyone can offer any help, thank you very much in advance.
UPDATE:
Thanks to Reddy's input, after I changed it to
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
I receive the following error:
Name: java.lang.IllegalArgumentException
Message: Field "features" does not exist.
Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
StackTrace: Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:93)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:254)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:340)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:305)
at createBalancedDataframe(<console>:45)
UPDATE2:
featureDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- distance: double (nullable = true)
|-- amt: integer (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
|-- category_indexed: double (nullable = false)
|-- category_encoded: vector (nullable = true)
|-- merchant_indexed: double (nullable = false)
|-- merchant_encoded: vector (nullable = true)
|-- features: vector (nullable = true)
val fraudDF = featureDF
.filter($"is_fraud" === 1)
.withColumnRenamed("is_fraud", "label")
.select("features", "label")
fraudDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- trans_num: string (nullable = true)
|-- trans_time: string (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- amt: integer (nullable = true)
|-- merch_lat: double (nullable = true)
|-- merch_long: double (nullable = true)
|-- distance: double (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
Why feature is gone???

Assuming you have your SparkSession and is named spark
you can pass it explicitly this way
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
or create an implicit reference (spark2 or any name) in the calling environment. Example:
implicit val spark2 = spark
//some calls
// others
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)

org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);

I am using spark sql to select a column along with sum of another column:
Below is my query:
scala> spark.sql("select distinct _c3,sum(_c9).as(sumAadhar) from aadhar group by _c3 order by _c9 desc LIMIT 3").show
And my schema is :
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: double (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
And I a getting below error:
org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:613)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
Any idea, what am i doing wrong or is there any other way to sum the values of a column

Check below which is tried on a reduced schema:
scala> val df = Seq(("a", 2), ("a", 3), ("b", 4), ("a", 9), ("b", 1), ("c", 100)).toDF("_c3", "_c9") df: org.apache.spark.sql.DataFrame = [_c3: string, _c9: int]
scala> df.createOrReplaceTempView("aadhar")
scala> spark.sql("select _c3,sum(_c9) as sumAadhar from aadhar group by _c3 order by sumAadhar desc LIMIT 3").show
+---+---------+
|_c3|sumAadhar|
+---+---------+
| c| 100|
| a| 14|
| b| 5|
+---+---------+
Removed distinct since its not necessary as your original query already groups by _c3.
Changed sum(_c9).as(sumAadhar) to sum(_c9) as sumAadhar as I think that syntax was leading spark sql to do some unintended casting.

Got a Error when using DataFrame.schema.fields.update

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!

You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to save none nullable dataframe to a hive table? - apache-spark

Related

Convert DataFrame Format

Incorrect nullability of column after saving pyspark dataframe

could not find implicit value for parameter sparkSession

org.apache.spark.sql.AnalysisException: Can't extract value from sum(_c9#30);

Got a Error when using DataFrame.schema.fields.update

Categories

Resources