pyspark when condition is true, insert some words with column variable - apache-spark

I have a csv as described:
s_table | s_name | t_cast | t_d |
aaaaaa | juuoo | TRUE |float |
aaaaaa | juueo | TRUE |float |
aaaaaa | ju4oo | | |
aaaaaa | juuoo | | |
aaaaaa | thyoo | | |
aaaaaa | juioo | | |
aaaaaa | rtyoo | | |
I am trying to use pyspark when condition to check the condition of t_cast with s_table and if it is TRUE, return a statement in a new column.
What i've tried is:
filters = filters.withColumn("p3", f.when((f.col("s_table") == "aaaaaa") & (f.col("t_cast").isNull()),f.col("s_name")).
when((f.col("s_table") == "aaaaaa") & (f.col("t_cast") == True),
f"CAST({f.col('s_table')} AS {f.col('t_d')}) AS {f.col('s_table')}"))
What I am trying to achieve is for the column p3 to return this:
s_table | s_name | t_cast | t_d | p_3 |
aaaaaa | juuoo | TRUE |float | cast ('juuoo' as float) as 'juuoo' |
aaaaaa | juueo | TRUE |float | cast ('juueo' as float) as 'juuoo' |
aaaaaa | ju4oo | | | ju4oo |
aaaaaa | juuoo | | | juuoo |
aaaaaa | thyoo | | | thyoo |
aaaaaa | juioo | | | juioo |
aaaaaa | rtyoo | | | rtyoo |
But the result that I get is:
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
I feel like I am almost there, but I can't quite figure it out.

You need to use Spark concat function instead of Python string format to get the expected string. Something like:
import pyspark.sql.functions as F
filters = filters.withColumn(
"p3",
(F.when((F.col("s_table") == "aaaaaa") & (F.col("t_cast").isNull()), F.col("s_name"))
.when((F.col("s_table") == "aaaaaa") & F.col("t_cast"),
F.expr(r"concat('CAST(\'', s_name, '\' AS ', t_d, ') AS \'', s_table, '\'')")
)
)
)
filters.show(truncate=False)
#+-------+------+------+-----+----------------------------------+
#|s_table|s_name|t_cast|t_d |p3 |
#+-------+------+------+-----+----------------------------------+
#|aaaaaa |juuoo |true |float|CAST('juuoo' AS float) AS 'aaaaaa'|
#|aaaaaa |juueo |true |float|CAST('juueo' AS float) AS 'aaaaaa'|
#|aaaaaa |ju4oo |null |null |ju4oo |
#|aaaaaa |juuoo |null |null |juuoo |
#|aaaaaa |thyoo |null |null |thyoo |
#|aaaaaa |juioo |null |null |juioo |
#|aaaaaa |rtyoo |null |null |rtyoo |
#+-------+------+------+-----+----------------------------------+

Related

How to convert Spark map keys to individual columns

I'm using spark 2.3 and scala 2.11.8.
I have a Dataframe like below,
--------------------------------------------------------
| ID | Name | Desc_map |
--------------------------------------------------------
| 1 | abcd | "Company" -> "aa" , "Salary" -> "1" ....|
| 2 | efgh | "Company" -> "bb" , "Salary" -> "2" ....|
| 3 | ijkl | "Company" -> "cc" , "Salary" -> "3" ....|
| 4 | mnop | "Company" -> "dd" , "Salary" -> "4" ....|
--------------------------------------------------------
Expected Dataframe,
----------------------------------------
| ID | Name | Company | Salary | .... |
----------------------------------------
| 1 | abcd | aa | 1 | .... |
| 2 | efgh | bb | 2 | .... |
| 3 | ijkl | cc | 3 | .... |
| 4 | mnop | dd | 4 | .... |
----------------------------------------
Any help is appreciated.
If data is your dataset that contains:
+---+----+----------------------------+
|ID |Name|Map |
+---+----+----------------------------+
|1 |abcd|{Company -> aa, Salary -> 1}|
|2 |efgh|{Company -> bb, Salary -> 2}|
|3 |ijkl|{Company -> cc, Salary -> 3}|
|4 |mnop|{Company -> aa, Salary -> 4}|
+---+----+----------------------------+
You can get your desired output through:
data = data.selectExpr(
"ID",
"Name",
"Map.Company",
"Map.Salary"
)
Final output:
+---+----+-------+------+
|ID |Name|Company|Salary|
+---+----+-------+------+
|1 |abcd|aa |1 |
|2 |efgh|bb |2 |
|3 |ijkl|cc |3 |
|4 |mnop|aa |4 |
+---+----+-------+------+
Good luck!

Extract values from column in spark dataframe and to two new columns

I have a spark dataframe that looks like this:
+----+------+-------------+
|user| level|value_pair |
+----+------+-------------+
| A | 25 |(23.52,25.12)|
| A | 6 |(0,0) |
| A | 2 |(11,12.12) |
| A | 32 |(17,16.12) |
| B | 22 |(19,57.12) |
| B | 42 |(10,3.2) |
| B | 43 |(32,21.0) |
| C | 33 |(12,0) |
| D | 32 |(265.21,19.2)|
| D | 62 |(57.12,50.12)|
| D | 32 |(75.12,57.12)|
| E | 63 |(0,0) |
+----+------+-------------+
How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.
+----+------+-------------+-------+
|user| level|value1 |value2 |
+----+------+-------------+-------+
| A | 25 |23.52 |25.12 |
| A | 6 |0 |0 |
| A | 2 |11 |12.12 |
| A | 32 |17 |16.12 |
| B | 22 |19 |57.12 |
| B | 42 |10 |3.2 |
| B | 43 |32 |21.0 |
| C | 33 |12 |0 |
| D | 32 |265.21 |19.2 |
| D | 62 |57.12 |50.12 |
| D | 32 |75.12 |57.12 |
| E | 63 |0 |0 |
+----+------+-------------+-------+
I know I can separate the values like so:
df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]
But how do I also get rid of the parantheses?
For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.
Also, I believe you need to first remove the brackets, and then expand the column.
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))
df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
.withColumn('value2', split(df['value_pair'], ',').getItem(1))
>>> df.show(truncate=False)
+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2 |
+----+-----+-----------+------+---------+
| A |25 |23.52,25.12|23.52 |25.12 |
| A |6 |0,0 |0 |0 |
| A |2 |11,12.12 |11 |12.12 |
| A |32 |17,16.12 |17 |16.12 |
| B |22 |19,57.12 |19 |57.12 |
| B |42 |10,3.2 |10 |3.2 |
| B |43 |32,21.0 |32 |21.0 |
| C |33 |12,0 |12 |0 |
| D |32 |265.21,19.2|265.21|19.2 |
| D |62 |57.12,50.12|57.12 |50.12 |
| D |32 |75.12,57.12|75.12 |57.12 |
| E |63 |0,0 |0 |0 |
+----+-----+-----------+------+---------+
As noticed, I changed slightly your code on how you grab the 2 items.
More information can be found here

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance')
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_FACTOR | 0.033572 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?
importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

drop rows matching string from an excel file using pandas

i have an Excel file users.xls cotaining all users from enterprise departments (203 lines) with three lines headers before datas.
I'd like tro drop all users from compta department in the file
i tried to make a script to use pandas to do it
I can list all compta users in file but when i try to drop them, it fails
import pandas as pd
cu = 'users.xls'
users = pd.read_excel(cu, skiprows=3)
sorted_by_username = users.sort_values(['Department'], ascending=True)
df = pd.DataFrame(sorted_by_username)
comptausers = df['Department'].str.contains('compta')
dfu = df[comptausers]
print(dfu)
df.drop(index=dfu, axis=0, inplace=True)
acutal users.xlsfile:
cat users.xls
+--+---------------------+------------ +
| |List Of users | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
|ID|User Name |Department|User Location|
+--+----------+----------+-------------+
|2 |smith |Sales |Paris |
+--+----------+----------+-------------+
|10|Foo |Compta |Paris |
+--+----------+----------+-------------+
|1 |Targaryan |CEO |London |
+--+----------+----------+-------------+
|21|Cooper |Compta |Berlin |
+--+----------+-------+--+-------------+
What i want:
cat users.xls
+--+---------------------+------------ +
| |List Of users | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
|ID|User Name |Department|User Location|
+--+----------+----------+-------------+
|2 |smith |Sales |Paris |
+--+----------+----------+-------------+
|1 |Targaryan |CEO |London |
+--+----------+----------+-------------+

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

Resources