avoid repeating column condition

avoid repeating column condition - apache-spark

lets assume I have the following df -
students = spark.createDataFrame(
[
("amit",),
("amit",),
("itay",),
],
["student"],
)
I want to create a lot of columns based on the value in student column.
I know for sure that I might have just 2 values on this data frame.
example:
students = students.withColumn(
"address", f.when(f.col("student") == "amit", f.lit("berlin")).otherwise(f.lit("paris"))
).withColumn(
"studies", f.when(f.col("student") == "amit", f.lit("CS")).otherwise(f.lit("physics"))
).withColumn(
"age", f.when(f.col("student") == "amit", f.lit("25")).otherwise(f.lit("27"))
)
can I do it cleaner without repeating all the time f.when(f.col("student") == "amit" or to create this columns together? any suggestions can be good.

You could create a list of 3-tuples with all the information that's necessary to create your columns:
values = [
("address", "berlin", "paris"),
("studies", "CS", "physics"),
("age", "25", "27")
]
Then, you can create your spark columns by iterating over values:
cols = [
f.when(f.col('student') == "amit", f.lit(val1))
.otherwise(f.lit(val2)).alias(col_name)
for (col_name, val1, val2) in values
]
students.select("*", *cols).show()
+-------+-------+-------+---+
|student|address|studies|age|
+-------+-------+-------+---+
| amit| berlin| CS| 25|
| amit| berlin| CS| 25|
| itay| paris|physics| 27|
+-------+-------+-------+---+

Related

PySpark - Group by ID & Date, and Sum in mins by a time column

I am processing my data in Spark, and the problem is similar and can be fixed by like what I did in SQL:
SUM(DATEDIFF(MINUTE, '0:00:00', targetcolumn)
But, I am wondering is there anyway to do so by PySpark especially there is only a time column?
My dataframe is like:
df= df_temp.show()
|record_date| Tag| time|
+-----------+----+-----+
| 2012-05-05| A |13:14:07.000000|
| 2012-05-05| A |13:54:08.000000|
...................
| 2013-01-01| B |14:40:26.000000|
| 2013-01-01| B |14:48:27.000000|
..................
| 2014-04-03| C |17:17:30.000000|
| 2014-04-03| C |17:47:31.000000|
Is it possible, I can do like group by record_date, Tag
then sum up time in mins?
So it will turn out like:
|record_date| Tag| time|
+-----------+----+-----+
| 2012-05-05| A |00:41:01.000000|
| 2013-01-01| B |00:08:01.000000|
| 2014-04-03| C |00:30:01.000000|
Time column could be any format like: 40 in mins or 0.4 hrs.
Thank you

If only two latest rows have to be compared, then Window "lead" function can be used, on Scala:
val df = Seq(
("2012-05-05", "A", "13:14:07.000000"),
("2012-05-05", "A", "13:54:08.000000"),
("2013-01-01", "B", "14:40:26.000000"),
("2013-01-01", "B", "14:48:27.000000"),
("2014-04-03", "C", "17:17:30.000000"),
("2014-04-03", "C", "17:47:31.000000")
).toDF("record_date", "Tag", "time")
val recordTagWindow = Window.partitionBy("record_date", "Tag").orderBy(desc("time"))
df
.withColumn("time", substring($"time", 1, 8))
.withColumn("unixTimestamp", unix_timestamp($"time", "HH:mm:ss"))
.withColumn("timeDiffSeconds", $"unixTimestamp" - lead($"unixTimestamp", 1, 0).over(recordTagWindow))
.withColumn("timeDiffFormatted", date_format($"timeDiffSeconds".cast(TimestampType).cast(TimestampType), "HH:mm:ss"))
.withColumn("rownum", row_number().over(recordTagWindow))
.where($"rownum" === 1)
.drop("rownum", "timeDiffSeconds", "time", "unixTimestamp")
Output (look like yours example incorrect for first row):
+-----------+---+-----------------+
|record_date|Tag|timeDiffFormatted|
+-----------+---+-----------------+
|2012-05-05 |A |00:40:01 |
|2013-01-01 |B |00:08:01 |
|2014-04-03 |C |00:30:01 |
+-----------+---+-----------------+
For more than two rows, functions "first" and "last" can be used, and Window modified to include all values (rowsBetween):
val recordTagWindow = Window.partitionBy("record_date", "Tag").orderBy(desc("time"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df
.withColumn("time", substring($"time", 1, 8))
.withColumn("unixTimestamp", unix_timestamp($"time", "HH:mm:ss"))
.withColumn("timeDiffSeconds", first($"unixTimestamp").over(recordTagWindow) - last($"unixTimestamp").over(recordTagWindow))
.withColumn("timeDiffFormatted", date_format($"timeDiffSeconds".cast(TimestampType).cast(TimestampType), "HH:mm:ss"))
.withColumn("rownum", row_number().over(Window.partitionBy("record_date", "Tag").orderBy(desc("time"))))
.where($"rownum" === 1)
.drop("rownum", "timeDiffSeconds", "time", "unixTimestamp")
.show(false)

I need to create a pyspark UDF that outputs a table from a query with a comparison

I am working with the IBM attrition data set on Kaggle. What I am trying to do is count occurrences of categorical variables to Attrition == 'Yes', and Attrition == 'No', and take the simple ratio to see which level of the categorical variable is more likely to attrite. Now I can do this in Pandas, like this:
def cal_ratio(x):
n_1 = sum(x['Attrition'].values == 'Yes')
n_0 = sum(x['Attrition'].values == 'No')
return n_1/n_0
Or I could easily enough write a spark.sql query that does it, and re-write it for each categorical variable I want to compare. A function like this one for Pandas would make my life easier, but I can't find any real guidance on how to create this sort of UDF nor how to register it.
EDIT: may be helpful if I ask also how would this work in pyspark with the UDF?
b = data.groupby('BusinessTravel').apply(cal_ratio)

Not sure it is the best solution but you can try this :
# My sample dataframe
df.show()
+---------+
|Attrition|
+---------+
| Yes|
| Yes|
| Yes|
| Yes|
| Yes|
| No|
| No|
+---------+
from pyspark.sql import functions as F
result = (
df.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.first()
)
print(result.ratio)
> 2.5
You can, of course, transform the result thing to a function by replacing the hard-coded values with variables.
def cal_ratio(df):
result = (
df.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.first()
)
return result.ratio
EDIT : If you need to group by a column, then you need to replace the first with a collect:
def cal_ratio(df):
result = (
df.groupBy("BusinessTravel")
.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.collect()
)
return result

Pyspark DataFrame: find difference between two DataFrames (values and column names)

I am having total 100+ columns in dataframe.
I am trying to compare two data frame and find unmatched record with column name.
I got a output bellow code but When I run the code for 100+ columns job got aborted.
I am doing this for SCD Type 2 delta process error finding.
from pyspark.sql.types import *
from pyspark.sql.functions import *
d2 = sc.parallelize([("A1", 500,1005) ,("A2", 700,10007)])
dataFrame1 = sqlContext.createDataFrame(d2, ["ID", "VALUE1", "VALUE2"])
d2 = sc.parallelize([("A1", 600,1005),("A2", 700,10007)])
dataFrame2 = sqlContext.createDataFrame(d2, ["ID", "VALUE1", "VALUE2"])
key_id_col_name="ID"
key_id_value="A1"
dataFrame1.select("ID","VALUE1").subtract(dataFrame2.select("ID",col("VALUE1").alias("value"))).show()
def unequalColumnValuesTwoDF(dataFrame1,dataFrame2,key_id_col_name,key_id_value):
chk_fst=True
dataFrame1 = dataFrame1.where(dataFrame1[key_id_col_name] == key_id_value)
dataFrame2 = dataFrame2.where(dataFrame2[key_id_col_name] == key_id_value)
col_names = list(set(dataFrame1.columns).intersection(dataFrame2.columns))
col_names.remove(key_id_col_name)
for col_name in col_names:
if chk_fst == True:
df_tmp = dataFrame1.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE")).subtract(dataFrame2.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE"))).withColumn("COL_NAME",lit(col_name))
chk_fst = False
else:
df_tmp = df_tmp.unionAll(dataFrame1.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE")).subtract(dataFrame2.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE"))).withColumn("COL_NAME",lit(col_name)))
return df_tmp
res_df = unequalColumnValuesTwoDF(dataFrame1,dataFrame2,key_id_col_name,key_id_value)
res_df.show()
>>> dataFrame1.show()
+---+------+------+
| ID|VALUE1|VALUE2|
+---+------+------+
| A1| 500| 1005|
| A2| 700| 10007|
+---+------+------+
>>> dataFrame2.show()
+---+------+------+
| ID|VALUE1|VALUE2|
+---+------+------+
| A1| 600| 1005|
| A2| 700| 10007|
+---+------+------+
>>> res_df.show()
+------+-----+--------+
|KEY_ID|VALUE|COL_NAME|
+------+-----+--------+
| A1| 500| VALUE1|
+------+-----+--------+
Please suggest any other way.

Here is another approach:
Join the two DataFrames using the ID column.
Then for each row, create a new column which contains the columns for which there is a difference.
Create this new column as a key-value pair map using pyspark.sql.functions.create_map().1
The key for the map will be the column name.
Using pyspark.sql.functions.when(), set the value to the corresponding value in in dataFrame1 (as it seems like that is what you want from your example) if there is a difference between the two DataFrames. Otherwise, we set the value to None.
Use pyspark.sql.functions.explode() on the map column, and filter out any rows where the difference is not null using pyspark.sql.functions.isnull().
Select the columns you want and rename using alias().
Example:
import pyspark.sql.functions as f
columns = [c for c in dataFrame1.columns if c != 'ID']
dataFrame1.alias('r').join(dataFrame2.alias('l'), on='ID')\
.withColumn(
'diffs',
f.create_map(
*reduce(
list.__add__,
[
[
f.lit(c),
f.when(
f.col('r.'+c) != f.col('l.'+c),
f.col('r.'+c)
).otherwise(None)
]
for c in columns
]
)
)
)\
.select([f.col('ID'), f.explode('diffs')])\
.where(~f.isnull(f.col('value')))\
.select(
f.col('ID').alias('KEY_ID'),
f.col('value').alias('VALUE'),
f.col('key').alias('COL_NAME')
)\
.show(truncate=False)
#+------+-----+--------+
#|KEY_ID|VALUE|COL_NAME|
#+------+-----+--------+
#|A1 |500 |VALUE1 |
#+------+-----+--------+
Notes
1 The syntax *reduce(list.__add__, [[f.lit(c), ...] for c in columns]) as the argument to create_map() is some python-fu that helps create the map dynamically.
create_map() expects an even number of arguments- it assumes that the first argument in every pair is the key and the second is the value. In order to put the arguments in that order, the list comprehension yields a list for each iteration. We reduce this list of lists into a flat list using list.__add__.
Finally the * operator is used to unpack the list.
Here is the intermediate output, which may make the logic clearer:
dataFrame1.alias('r').join(dataFrame2.alias('l'), on='ID')\
.withColumn(
'diffs',
f.create_map(
*reduce(
list.__add__,
[
[
f.lit(c),
f.when(
f.col('r.'+c) != f.col('l.'+c),
f.col('r.'+c)
).otherwise(None)
]
for c in columns
]
)
)
)\
.select('ID', 'diffs').show(truncate=False)
#+---+-----------------------------------+
#|ID |diffs |
#+---+-----------------------------------+
#|A2 |Map(VALUE1 -> null, VALUE2 -> null)|
#|A1 |Map(VALUE1 -> 500, VALUE2 -> null) |
#+---+-----------------------------------+

Pyspark dataframe join elements as variables

I am facing an issue while I am trying to pass the join elements as variables in pyspark dataframe join function. I am getting primary key fields from a file while I am trying pass it as variable in a join statement, it throws an error as "cannot resolve the column name" since it is passed as a string. Please assist me on this.
for i in range(len(pr_list)):
if i != len(pr_list)-1:
pr_str += " (df_a." + pr_list[i] + " == df_b." +pr_list[i] +") & "
else:
pr_str += "(df_a." + pr_list[i] + " == df_b." +pr_list[i] +")"
print (pr_str)
df1_with_db2 = df_a.join(df_b, pr_str ,'inner').select('df_a.*')

The reason for showing this error is because in the join condition you are passing the join condition as string and in the join condition it accepts either a single column name or list of column names or condition with expressions, you just want to minor change in the code
df1_with_db2 = df_a.alias("df_a").join(df_b, eval(pr_str) ,'inner').select('df_a.*')

By looking at your error it looks your pr_list can have columns which are neither present on any of 2 df or you didn't alias you dataframes befor joining like
df1_with_db2 = df_a.alias("df_a").join(df_b.alias("df_b"), pr_str ,'inner').select('df_a.*')
Below is my way to do this problem:-
In your code, I found both dataframe have the same name of columns and that are in list pr_list
So you can just pass this list as join condition like below (by default join is inner):
df1_with_db2 = df_a.join(
df_b,
pr_list
)
you will get common column only one time so no need to write select function
Here is a example:-
df1 = sqlContext.createDataFrame([
[1,2],
[3,4],
[9,8]
], ['a', 'b'])
df2 = sqlContext.createDataFrame([
[1,2],
[3,4],
[18,19]
], ['a', 'b'])
jlist = ['a','b']
df1.join(df2, jlist).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+

Spark SQL secondary filtering and grouping

Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true, or count the number of distinct field3 that have field4 == "some_value" and field5 == false, etc.
Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.
In Pig, I can do:
FOREACH (GROUP A by field1) {
field_a = FILTER A by field2 == TRUE;
field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
field_c = DISTINCT field_b.field3;
GENERATE FLATTEN(group),
COUNT(field_a) as fa,
COUNT(field_b) as fb,
COUNT(field_c) as fc,
Is there a way to do this in Spark SQL?

Excluding distinct count this is can solved by simple sum over condition:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(1L, true, "x", "foo", true), (1L, true, "y", "bar", false),
(1L, true, "z", "foo", true), (2L, false, "y", "bar", false),
(2L, true, "x", "foo", false)
)).toDF("field1", "field2", "field3", "field4", "field5")
val left = df.groupBy($"field1").agg(
sum($"field2".cast("int")).alias("fa"),
sum(($"field4" === "foo" && ! $"field5").cast("int")).alias("fb")
)
left.show
// +------+---+---+
// |field1| fa| fb|
// +------+---+---+
// | 1| 3| 0|
// | 2| 1| 1|
// +------+---+---+
Unfortunately is much more tricky. GROUP BY clause in Spark SQL doesn't physically group data. Not to mention that finding distinct elements is quite expensive. Probably the best thing you can do is to compute distinct counts separately and simply join the results:
val right = df.where($"field4" === "foo" && ! $"field5")
.select($"field1".alias("field1_"), $"field3")
.distinct
.groupBy($"field1_")
.agg(count("*").alias("fc"))
val joined = left
.join(right, $"field1" === $"field1_", "leftouter")
.na.fill(0)
Using UDAF to count distinct values per condition is definitely an option but efficient implementation will be rather tricky. Converting from internal representation is rather expensive, and implementing fast UDAF with a collection storage is not cheap either. If you can accept approximate solution you can use bloom filter there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

avoid repeating column condition - apache-spark

Related

PySpark - Group by ID & Date, and Sum in mins by a time column

I need to create a pyspark UDF that outputs a table from a query with a comparison

Pyspark DataFrame: find difference between two DataFrames (values and column names)

Pyspark dataframe join elements as variables

Spark SQL secondary filtering and grouping

Categories

Resources