I am new to DataBricks. I have following code in scala the adds a new column to existing dataframe it checks for a specific column value and if it is null assigns 0 otherwise 1.
val sv_string = "select * from temp_tbl_ASBReports r left join temp_tbl_ASBInvestigations i on r.nh_parentasbcase = i.nh_asbcaseid"
val df_asbreportssv = spark.sql(sv_string)
df_asbreportssv.withColumn("InInvestigation",when((df_asbreportssv("nh_parentasbcase").isNull), "0").otherwise("1"))
It throws the following error:
error: not found: value when
df_asbreportssv.withColumn("InInvestigation",when((df_asbreportssv("nh_parentasbcase").isNull), "1").otherwise("0"))
Could you please help?
Sharing the answer as per the comment by #Daniel and Original Poster #user13033419 has confirmed as the issue has been resolved.
You need to use "import org.apache.spark.sql.functions._" before the code.
Example:
import org.apache.spark.sql.functions._
val df = Seq(2, 3, 4).toDF("number")
df
.withColumn("number_factorial", factorial(col("number")))
.show()
Modify your code accordingly:
import org.apache.spark.sql.functions._
val sv_string = "select * from temp_tbl_ASBReports r left join temp_tbl_ASBInvestigations i on r.nh_parentasbcase = i.nh_asbcaseid"
val df_asbreportssv = spark.sql(sv_string)
df_asbreportssv.withColumn("InInvestigation",when((df_asbreportssv("nh_parentasbcase").isNull), "0").otherwise("1"))
Related
Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?
In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.
Thanks in advance.
Not going down the .filter road as I cannot see the focus there.
For .transform
dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3
The following:
dataframe transform
From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.
This they say is messy:
...
def inc(i: Int) = i + 1
val tmp0 = func0(inc, 3)(testDf)
val tmp1 = func1(1)(tmp0)
val tmp2 = func2(2)(tmp1)
val res = tmp2.withColumn("col3", expr("col2 + 3"))
compared to:
val res = testDf.transform(func0(inc, 4))
.transform(func1(1))
.transform(func2(2))
.withColumn("col3", expr("col2 + 3"))
transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination
import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),
Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))
transform with lambda function new array function on a DF in v 3 using withColumn
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = Seq(
(Array("New York", "Seattle")),
(Array("Barcelona", "Bangalore"))
).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"),
(col: Column) => concat(col, lit(" is fun!"))))
Try them.
Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):
transform works similar to the map function in Scala. I’m not sure why
they chose to name this function transform… I think array_map would
have been a better name, especially because the Dataset#transform
function is commonly used to chain DataFrame transformations.
Update
If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
I'm a table in cassandra tfm.foehis that have data.
When i did the first charge of data from spark to cassandra, I used this set of commands:
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val wkdir="/home/adminbigdata/tablas/"
val fileIn= "originales/22_FOEHIS2.csv"
val fileOut= "22_FOEHIS_PRE2"
val fileCQL= "22_FOEHISCQL"
val data = sc.textFile(wkdir + fileIn).filter(!_.contains("----")).map(_.trim.replaceAll(" +", "")).map(_.dropRight(1)).map(_.drop(1)).map(_.replaceAll(",", "")).filter(array => array(6) != "MOBIDI").filter(array => array(17) != "").saveAsTextFile(wkdir + fileOut)
val firstDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").option("mode", "DROPMALFORMED").option("delimiter", "|").load(wkdir + fileOut)
val columns: Array[String] = firstDF.columns
val reorderedColumnNames: Array[String] = Array("hoclic","hodtac","hohrac","hotpac","honrac","hocdan","hocdrs","hocdsl","hocol","hocpny","hodesf","hodtcl","hodtcm","hodtea","hodtra","hodtrc","hodtto","hodtua","hohrcl","hohrcm","hohrea","hohrra","hohrrc","hohrua","holinh","holinr","honumr","hoobs","hooe","hotdsc","hotour","housca","houscl","houscm","housea","houser","housra","housrc")
val secondDF= firstDF.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
secondDF.write.cassandraFormat("foehis", "tfm").save()
But when I load new data using the same script, I get errors. I don't know what's wrong?
This is the message:
java.lang.UnsupportedOperationException: 'SaveMode is set to ErrorIfExists and Table
tfm.foehis already exists and contains data.
Perhaps you meant to set the DataFrame write mode to Append?
Example: df.write.format.options.mode(SaveMode.Append).save()" '
The error message clearly says you that you need to use Append mode & shows what you can do with it. In your case it happens because destination table already exists, and writing mode is set to "error if exists". If you still want to write data, the code should be following:
import org.apache.spark.sql.SaveMode
secondDF.write.cassandraFormat("foehis", "tfm").mode(SaveMode.Append).save()
Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()
*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))
I'm trying to use Spark 1.4 window functions in pyspark 1.4.1
but getting mostly errors or unexpected results.
Here is a very simple example that I think should work:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
l = [(1,101),(2,202),(3,303),(4,404),(5,505)]
df = sqlContext.createDataFrame(l,["a","b"])
wSpec = Window.orderBy(df.a).rowsBetween(-1,1)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
==> Failure org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
===> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")).collect()
[Row(a=1, prev=None, b=101, next=None), Row(a=2, prev=None, b=202, next=None), Row(a=3, prev=None, b=303, next=None)]
As you can see, if I add rowsBetween frame specification, neither rank() nor lag/lead() window functions recognize it: "Window function does not take a frame specification".
If I omit the rowsBetween frame specification at leas lag/lead() do not throw exceptions but return unexpected (for me) result: always None. And the rank() still doesn't work with different exception.
Can anybody help me to get my window functions right?
UPDATE
All right, that starts to look as a pyspark bug.
I have prepared the same test in pure Spark (Scala, spark-shell):
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val l: List[Tuple2[Int,Int]] = List((1,101),(2,202),(3,303),(4,404),(5,505))
val rdd = sc.parallelize(l).map(i => Row(i._1,i._2))
val schemaString = "a b"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, IntegerType, true)))
val df = sqlContext.createDataFrame(rdd, schema)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec = Window.orderBy("a").rowsBetween(-1,1)
df.select(df("a"), rank().over(wSpec).alias("rank"))
==> org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.;
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
val wSpec = Window.orderBy("a")
df.select(df("a"), rank().over(wSpec).alias("rank")).collect()
====> res10: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
====> res12: Array[org.apache.spark.sql.Row] = Array([1,null,101,202], [2,101,202,303], [3,202,303,404], [4,303,404,505], [5,404,505,null])
Even though the rowsBetween cannot be applied in Scala, both rank() and lag()/lead() work as I expect when rowsBetween is omitted.
As far as I can tell there two different problems. Window frame definition is simply not supported by Hive GenericUDAFRank, GenericUDAFLag and GenericUDAFLead so errors you see are an expected behavior.
Regarding issue with the following PySpark code
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
it looks like it is related to my question https://stackoverflow.com/q/31948194/1560062 and should be addressed by SPARK-9978. As far now you can make it work by changing window definition to this:
wSpec = Window.partitionBy().orderBy(df.a)