Spark sql how to execute sql command in a loop for every record in input DataFrame - apache-spark

Spark sql how to execute sql command in a loop for every record in input DataFrame
I have a DataFrame with following schema
%> input.printSchema
root
|-- _c0: string (nullable = true)
|-- id: string (nullable = true)
I have another DataFrame on which I need to execute sql command
val testtable = testDf.registerTempTable("mytable")
%>testDf.printSchema
root
|-- _1: integer (nullable = true)
sqlContext.sql(s"SELECT * from mytable WHERE _1=$id").show()
$id should be from the input DataFrame and the sql command should execute for all input table ids

Assuming you can work with a single new DataFrame containing all the rows present in testDf that matches the values present in the id column of input, you can do an inner join operation, as stated by Alberto:
val result = input.join(testDf, input("id") == testDf("_1"))
result.show()
Now, if you want a new, different DataFrame for each distinct value present in testDf, the problem is considerably harder. If this is the case, I would suggest you to make sure the data in your lookup table can be collected as a local list, so you could loop through its values and create a new DataFrame for each one as you already thought (this is not recommended):
val localArray: Array[Int] = input.map { case Row(_, id: Integer) => id }.collect
val result: Array[DataFrame] = localArray.map {
i => testDf.where(testDf("_1") === i)
}
Anyway, unless the lookup table is very small, I suggest that you adapt your logic to work with the single joined DataFrame of my first example.

Related

Can IF statement work correctly to build spark dataframe?

I have following code which uses an IF statement to build dataframe conditionally.
Does this work as I expect?
df = sqlContext.read.option("badRecordsPath", badRecordsPath).json([data_path_1, s3_prefix + "batch_01/2/2019-04-28/15723921/15723921_15.json"])
if "scrape_date" not in df.columns:
df = df.withColumn("scrape_date", lit(None).cast(StringType()))
Is this what you are trying to do?
val result = <SOME Dataframe I previously created>
scala> result.printSchema
root
|-- VAR1: string (nullable = true)
|-- VAR2: double (nullable = true)
|-- VAR3: string (nullable = true)
|-- VAR4: string (nullable = true)
scala> result.columns.contains("VAR3")
res13: Boolean = true
scala> result.columns.contains("VAR9")
res14: Boolean = false
So the "result" dataframe has columns "VAR1", "VAR2" and so on.
The next line shows that it contains "VAR3" (result of expression is "true". But it does not contains a column called "VAR9" (result of the expression is "false").
The above is scala, but you should be able to do the same in Python (sorry I did not notice you were asking about python when I replied).
In terms of execution, the if statement will execute locally on the driver node. As a rule of thumb, if something returns an RDD, DataFrame or DataSet, it will be executed in parallel on the executor(s). Since DataFrame.columns returns an Array, any processing of the list of columns will be done in the driver node (because an Array is not an RDD, DataFrame nor DataSet).
Also note that RDD, DataFrame and DataSet will be executed "lazy-lly". That is, Spark will "accumulate" the operations that generate these objects. It will only execute them when you do something that doesn't generate an RDD, DataFrame or DataSet. For example when you do a show or a count or a collect. Part of the reason for doing this is so Spark can optimise the execution of the process. Another is so it only does what is actually needed to generate the answer.

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema:
|-- ts: long (nullable = true)
|-- id: integer (nullable = true)
|-- managers: array (nullable = true)
| |-- element: string (containsNull = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
df1 is created from an avro file while df2 from an equivalent parquet file. However, If I execute, df1.unionAll(df2).show(), I get the following error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
I ran into the same situation and it turns out to be not only the fields need to be the same but also you need to maintain the exact same ordering of the fields in both dataframe in order to make it work.
This is old and there are already some answers lying around but I just faced this problem while trying to make a union of two dataframes like in...
//Join 2 dataframes
val df = left.unionAll(right)
As others have mentioned, order matters. So just select right columns in the same order than left dataframe columns
//Join 2 dataframes, but take columns in the same order
val df = left.unionAll(right.select(left.columns.map(col):_*))
I found the following PR on github
https://github.com/apache/spark/pull/11333.
That relates to UDF (user defined function) columns, which were not correctly handled during the union, and thus would cause the union to fail. The PR fixes it, but it hasn't made it to spark 1.6.2, I haven't checked on spark 2.x yet.
If you're stuck on 1.6.x there's a stupid work around, map the DataFrame to an RDD and back to a DataFrame
// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
.map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
.toDF("id", "tags") // RDD --> back to DF but now without UDF column
// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()

Spark parquet nested value flatten

I have parquet file. I loaded using Spark.And one of the value is nested key,value pairs. How do I flatten?
df.printSchema
root
|-- location: string (nullable = true)
|-- properties: string (nullable = true)
texas,{"key":{"key1":"value1","key2":"value2"}}
thanks,
You can use explode on your dataframe and pass it a function that reads the JSON column using scala4s. Scala4s has easy parsing API, for your case it will look like:
val list = for {
JArray(keys) <- parse(json) \\ "key"
json # JObject(key) <- keys
JField("key1", JString(key1)) <- key
JField("key2", JString(key2)) <- key
} yield {
Seq(key1, key2)
}
This flattens your dataframe.
If you also want to add column for key, you can use withColumn after explode(keep the key also in the new column).

SparkSQL DataFrame: sql query does not work when using caching

I'm starting to use spark for my learning. I made a simple program based on this document.
My program reads payment logs from file (on a HDFS cluster), transfers it to a dataframe and uses this dataframe in some sql queries. I ran my program in two cases: with and without cache() method. I encountered a weird problem as describe bellow:
Not using cache():
I tried to run some queries and everything was fine. (log_zw is my table name)
val num_records = sqlContext.sql("select * from log_zw").count
val num_acc1 = sqlContext.sql("select * from log_zw where ACN = 'acc1' ").count
Using cache()
I also used two queries above. The first query returned the correct value, but the second was not, it returned 0.
However, when I queried it in another approach:
val num_acc1 = log_zw.filter(log_zw("ACN").contains("acc1")).count
it returned the correct result.
I'm very new to Spark and cluster computing system, I dont have any idea why it worked like that. Could anyone please explain to me this problem, especially the different when using sql query and spark method.
Edit: Here is the schema, it's very simple.
root
|-- PRODUCT_ID: string (nullable = true)
|-- CHANNEL: string (nullable = true)
|-- ACN: string (nullable = true)
|-- AMOUNT_VND: double (nullable = false)
|-- TRANS_ID: string (nullable = true)
Edit2: This is my code when using cache(): (I ran some queries and the results are showed in comments in code)
// read tsv files
case class LogZW(
PRODUCT_ID: String,
PLATFORM: String,
CHANNEL: String,
ACN: String,
AMOUNT_VND: Double,
TRANS_ID: String)
def loadLog(filename: String): DataFrame = {
sc.textFile(filename).map(line => line.split("\t")).map(p =>
LogZW(p(1), p(3), p(4), p(5), p(9).toDouble, p(10).substring(0,8))).toDF()
}
// generate schema
val schemaString = "PRODUCT_ID PLATFORM CHANNEL ACN AMOUNT_VND TRANS_ID"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// read all files
val HDFSFolder = "hdfs://master:54310/user/lqthang/data/*"
val log = loadLog(HDFSFolder)
// register table
log.registerTempTable("log")
log.show()
// select a subset of log table
val log_zw = sqlContext.sql("select PRODUCT_ID, CHANNEL, ACN, AMOUNT_VND, TRANS_ID from log where PLATFORM = 'zingwallet' and CHANNEL not in ('CBZINGDEAL', 'VNPT') and PRODUCT_ID not in ('ZingCredit', 'zingcreditdbg') ")
// register new table
log_zw.show()
log_zw.registerTempTable("log_zw")
// cache table
log_zw.cache()
// this query returns incorrect value!!
val num_acc1 = sqlContext.sql("select * from log_zw where ACN = 'acc1' ").count
// this query returns correct value!
val num_acc2 = sqlContext.sql("select * from log_zw where trim(ACN) = 'acc1' ").count
// uncache data and try another query
log_zw.unpersist()
// this query also returns the correct value!!!
val num_acc2 = sqlContext.sql("select * from log_zw where ACN = 'acc1' ").count
Edit3: I tried to add another cache() method to log dataframe:
// register table
log.registerTempTable("log")
log.show()
log.cache()
The following code is the same as above (with log_zw.cache()). So the important result is:
// this query returns the CORRECT value!!
val num_acc1 = sqlContext.sql("select * from log_zw where ACN = 'acc1' ").count
We don't have a lot of details about what the data is, but I notice that your two code sections do different things.
In the first, you do ACN = 'acc1' but in the second you check if ACN contains 'acc1'.
So the second bit (with the filter) will match if ACN is ' acc1', or 'acc1 ', or 'acc1'
In other words, I bet if you add a trim to your SQL query you would get a different result.
So try this:
val num_records = sqlContext.sql("select * from log_zw").count
val num_acc1 = sqlContext.sql("select * from log_zw where trim(ACN) = 'acc1' ").count

How to pass whole Row to UDF - Spark DataFrame filter

I'm writing filter function for complex JSON dataset with lot's of inner structures. Passing individual columns is too cumbersome.
So I declared the following UDF:
val records:DataFrame = = sqlContext.jsonFile("...")
def myFilterFunction(r:Row):Boolean=???
sqlc.udf.register("myFilter", (r:Row)=>myFilterFunction(r))
Intuitively I'm thinking it will work like this:
records.filter("myFilter(*)=true")
What is the actual syntax?
You have to use struct() function for constructing the row while making a call to the function, follow these steps.
Import Row,
import org.apache.spark.sql._
Define the UDF
def myFilterFunction(r:Row) = {r.get(0)==r.get(1)}
Register the UDF
sqlContext.udf.register("myFilterFunction", myFilterFunction _)
Create the dataFrame
val records = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")
Use the UDF
records.filter(callUdf("myFilterFunction",struct($"text",$"text2"))).show
When u want all columns to be passed to UDF.
records.filter(callUdf("myFilterFunction",struct(records.columns.map(records(_)) : _*))).show
Result:
+------+------+
| text| text2|
+------+------+
|sachin|sachin|
+------+------+
scala> inputDF
res40: org.apache.spark.sql.DataFrame = [email: string, first_name: string ... 3 more fields]
scala> inputDF.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
Now, I would like to filter the rows based on the Gender Field. I can accomplish that by using the .filter($"gender" === "Male") but I would like to do with the .filter(function).
So, defined my anonymous functions
val isMaleRow = (r:Row) => {r.getAs("gender") == "Male"}
val isFemaleRow = (r:Row) => { r.getAs("gender") == "Female" }
inputDF.filter(isMaleRow).show()
inputDF.filter(isFemaleRow).show()
I felt the requirement can be done in a better way i.e without declaring as UDF and invoke it.
In addition to the first answer. When we want all columns to be passed to UDF we can use
struct("*")
If you want to take an action over the whole row and process it in a distributed way, take the row in the DataFrame and send to a function as a struct and then convert to a dictionary to execute the specific action, is very important to execute the collect method over the final DataFrame because Spark has the LazyLoad activated and don't work with full data at less you tell it explicitly.
In my case I should send the row of a DataFrame to index as Dictionary object:
Import libraries.
Declare the udf and the lambda must receiving the row structure.
Execute specific function, in this case send to index a dictionary (the row structure converted to a dict).
The DataFrame origin execute a withColum method that indicates to Spark execute this in each row, before make the call to collect, this allows to execute the function in a distribuible way. Don't forget send to a other DataFrame Variable.
Execute the collect method to execute the process and distribute the function.
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
myUdf = udf(lambda row: sendToES(row.asDict()), IntegerType())
dfWithControlCol = df.withColumn("control_col", myUdf(struct([df[x] for x in df.columns])))
dfWithControlCol.collect()

Resources