How to do left outer join in spark sql? - apache-spark

I am trying to do a left outer join in spark (1.6.2) and it doesn't work. My sql query is like this:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
The result is like this:
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
I got same result either using LEFT JOIN or LEFT OUTER JOIN (the second uuid is not null).
I would expect the second uuid column to be null only. how to do a left outer join correctly?
=== Additional information ==
If I using dataframe to do left outer join i got correct result.
s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')
s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'),
p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()
I got result like this:
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
Thanks,

I don't see any issues in your code. Both "left join" or "left outer join" will work fine. Please check the data again the data you are showing is for matches.
You can also perform Spark SQL join by using:
// Left outer join explicit
df1.join(df2, df1["col1"] == df2["col1"], "left_outer")

You are filtering out null values for p.created_year (and for p.uuid) with
where t.created_year = 2016
and p.created_year = 2016
The way to avoid this is to move filtering clause for p to the ON statement:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
and p.created_year = 2016
where t.created_year = 2016").show()
This is correct but inefficient because we also need to filter on t.created_year before the join happens. So it is recommended to use subqueries:
sqlContext.sql("select t.type, t.uuid, p.uuid
from (
SELECT type, uuid FROM symptom_type WHERE created_year = 2016
) t LEFT JOIN (
SELECT uuid FROM plugin WHERE created_year = 2016
) p
ON t.uuid = p.uuid").show()

I think you just need to use LEFT OUTER JOIN instead of LEFT JOIN keyword for what you want. For more informations look at the Spark documentation.

Related

How does Spark SQL implement the group by aggregate

How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

How to filter based on the value(tuples) in a pair RDD in terms of key

The sample RDD looks like:
(key1,(111,222,1)
(key1,(113,224,1)
(key1,(114,225,0)
(key1,(115,226,0)
(key1,(113,226,0)
(key1,(116,227,1)
(key1,(117,228,1)
(key2,(118,229,1)
I am currently doing a spark project. I want to filter the first and last elements where the third position in tuple values are '1' and '0' based on keys.
Is it possible to do it with reduceByKey? But after my research, I did not find a good logic to reach what I want. I want my result in the order which is the same as the output shown below.
Expected output:
(key1,(111,222,1)
(key1,(114,225,0)
(key1,(113,226,0)
(key1,(116,227,1)
(key2,(118,229,1)
Much appreciated.
If I understand correctly, you want the first "1", the first "0", the last "1" and the last "0" for each key, and maintain the order. If I were you, I would use the SparkSQL API to do that.
First, let's build your RDD (By the way, providing sample data is very nice, giving enough code so that we can reproduce what you did is ever better):
val seq = Seq(("key1",(111,222,1)),
("key1",(113,224,1)),
("key1",(114,225,0)),
("key1",(115,226,0)),
("key1",(113,226,0)),
("key1",(116,227,1)),
("key1",(117,228,1)),
("key2",(118,229,1)))
val rdd = sc.parallelize(seq)
// then I switch to dataframes, and add an id to be able to go back to
// the previous order
val df = rdd.toDF("key", "value").withColumn("id", monotonicallyIncreasingId)
df.show()
+----+-----------+------------+
| key| value| id|
+----+-----------+------------+
|key1|[111,222,1]| 8589934592|
|key1|[113,224,1]| 25769803776|
|key1|[114,225,0]| 42949672960|
|key1|[115,226,0]| 60129542144|
|key1|[113,226,0]| 77309411328|
|key1|[116,227,1]| 94489280512|
|key1|[117,228,1]|111669149696|
|key2|[118,229,1]|128849018880|
+----+-----------+------------+
Now, we can group by "key" and "value._3", keep the min(id) and its max and explode back the data. With a window however, we can do it in a simpler way. Let's define the following window:
val win = Window.partitionBy("key", "value._3").orderBy("id")
// now we compute the previous and next element of each id using resp. lag and lead
val big_df = df
.withColumn("lag", lag('id, 1) over win)
.withColumn("lead", lead('id, 1) over win)
big_df.show
+----+-----------+------------+-----------+------------+
| key| value| id| lag| lead|
+----+-----------+------------+-----------+------------+
|key1|[111,222,1]| 8589934592| null| 25769803776|
|key1|[113,224,1]| 25769803776| 8589934592| 94489280512|
|key1|[116,227,1]| 94489280512|25769803776|111669149696|
|key1|[117,228,1]|111669149696|94489280512| null|
|key1|[114,225,0]| 42949672960| null| 60129542144|
|key1|[115,226,0]| 60129542144|42949672960| 77309411328|
|key1|[113,226,0]| 77309411328|60129542144| null|
|key2|[118,229,1]|128849018880| null| null|
+----+-----------+------------+-----------+------------+
Now we see that the rows you are after are the ones with either a lag equal to null (first element) or a lead equal to null (last element). Therefore, let's filter, sort back to the previous order using the id and select the columns you need:
val result = big_df
.where(('lag isNull) || ('lead isNull))
.orderBy('id)
.select("key", "value")
result.show
+----+-----------+
| key| value|
+----+-----------+
|key1|[111,222,1]|
|key1|[114,225,0]|
|key1|[113,226,0]|
|key1|[117,228,1]|
|key2|[118,229,1]|
+----+-----------+
Finally, if you really need a RDD, you can convert the dataframe with:
result.rdd.map(row => row.getAs[String](0) -> row.getAs[(Int, Int, Int)](1))

How to use a filter in subselect

I want to perform a subselect on a related set of data. That subdata needs to be filtered using data from the main query:
customEvents
| extend envId = tostring(customDimensions.EnvironmentId)
| extend organisation = tostring(customDimensions.OrganisationName)
| extend version = tostring(customDimensions.Version)
| extend app = tostring(customDimensions.Appname)
| where customDimensions.EventName contains "ApiSessionStartStart"
| extend dbInfo = toscalar(
customEvents
| extend dbInfo = tostring(customDimensions.dbInfo)
| extend serverEnvId = tostring(customDimensions.EnvironmentId)
| where customDimensions.EventName == "ServiceSessionStart" or customDimensions.EventName == "ServiceSessionContinuation"
| where serverEnvId = envId // This gives and error
| project dbInfo
| take 1)
| order by timestamp desc
| project timestamp, customDimensions.OrganisationName, customDimensions.Version, customDimensions.onBehalfOf, customDimensions.userId, customDimensions.Appname, customDimensions.apiKey, customDimensions.remoteIp, session_Id , dbInfo, envId
The above query results in an error:
Failed to resolve entity 'envId'
How can I filter the data in the subselect based on the field envId in the main query?
i believe you'd need to use join instead, where you'd join to get that value from the second query
docs for join: https://docs.loganalytics.io/docs/Language-Reference/Tabular-operators/join-operator
the left hand side of the join is your "outer" query, and the right hand side of the join would be that "inner" query, though instead of doing take 1, you'd probably do a simpler query that just gets distinct values of serverEnvId, dbInfo

Pyspark - Join timestamp window against timestamp values

Is there any effective way of joining a list of timestamp windows against a list of timestamp values?
The dataframe A has these values:
+------------------------------------+---------------------------------------------+----------------------+
|userid | window |total_unique_locations|
+------------------------------------+---------------------------------------------+----------------------+
|da24a375-962a|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
|0fd2b419-d6ec|[2017-06-04 03:50:00.0,2017-06-04 03:55:00.0]|2 |
|c8159400-fe0a|[2017-06-04 03:10:00.0,2017-06-04 03:15:00.0]|2 |
|a4336494-3a10|[2017-06-04 03:00:00.0,2017-06-04 03:05:00.0]|3 |
|b4590016-1af2|2017-06-04 03:45:00.0,2017-06-04 03:50:00.0] |2 |
|03b33b0a-e94e|[2017-06-04 03:30:00.0,2017-06-04 03:35:00.0]|2 |
|e5e4c972-6599|[2017-06-04 03:25:00.0,2017-06-04 03:30:00.0]|5 |
|345e81fb-5e12|[2017-06-04 03:50:00.0,2017-06-04 03:55:00.0]|2 |
|bedd88f1-3751|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
|da401dab-e7f3|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
+------------------------------------+---------------------------------------------+----------------------+
where the data type of window is struct<start:timestamp,end:timestamp>
And the dataframe B has these values:
+------------------------------------+------------------+
|userid |eventtime |distance |
+------------------------------------+------------------+
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.17218625176420413|
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.11145767867097957|
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.14064932728588236|
|a3fac437-efcc|2017-06-04 03:00:00.0|0.08328915597349452|
|a3fac437-efcc|2017-06-04 03:00:00.0|0.07079054693441306|
+------------------------------------+------------------+
I tried to use the regular join but it does not work as the window and eventtime have different data types.
A.join(B, A.userid == B.userid, A.window == B.eventtime).select("*")
Any suggestions?
The less efficient solution is to join or crossJoin with beteween:
a.join(b, col("eventtime").between(col("window.start"), col("window.end")))
The more efficient solution is to convert eventtime to a struct with the same definition as used for existing window. For example:
(b
.withColumn("event_window", window(col("eventtime"), "5 minutes"))
.join(a, col("event_window") == col("window")))
You cannot join these two since the data type of window and eventtime are different.
val result = A.join(B,
A("userid") === B("userid") &&
A("window.start") === B("eventtime") ||
A("window.end") === B("eventtime"), "left")
Hope this helps!

Resources