Split string column into multiple columns in pyspark - python-3.x

I have a table as below -
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm |pur_region |purchase_dt |pur_details |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121 |Johnny |USA |2023-01-12 |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
+------+---------------------------------+---------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The schema of this table is -
root
|-- cus_id: string (nullable = true)
|-- cus_nm: string (nullable = true)
|-- pur_region: string (nullable = true)
|-- purchase_dt: string (nullable = true)
|-- pur_details: string (nullable = true)
I would like to split the column pur_details and extract check and sale_price_gap as separate columns.
Note that the pur_details may or may not have check and sale_price_gap, so if it's not present in pur_details then the new column values should be null.
Sample expected output -
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cus_id|cus_nm |pur_region |purchase_dt |check |sale_price_gap |pur_details |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0121 |Johnny |USA |2023-01-12 |sale_item |upto 30% on_sale |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |sale_item |upto 30% on_sale |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |sale_item |upto 30% on_sale |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |sale_item |upto 30% on_sale |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |sale_item |BOGO 20% off |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
+------+---------------------------------+---------------+-------------+----------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Can someone please help me with the best and efficient way when I have millions of rows.
Thanks in advance.

With a few assumptions like there is no , on strings except those separate fields, I hope this code helps you to get an idea to achieve what you want:
df.withColumn("pur_details_split", split(col("pur_details"), ","))
.withColumn("check", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("check")), 1), "="), 2))
.withColumn("sale_price_gap", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("sale_price_gap")), 1), "="), 2))
.select("check", "sale_price_gap")
.show(false)
+---------+----------------+
|check |sale_price_gap |
+---------+----------------+
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|null |
+---------+----------------+
Ps 1. I can't remember the python equivalencies of Spark API, but I'm pretty sure they are similar to Scala ones.
Ps 2. I removed sale_price_gap from the last record on the original dataframe to support the scenario of non-existent value.

I tried to use regexp_extract, which reduces the code and might be faster.
df.withColumn("check",F.regexp_extract("pur_details", "check=([^,}]*)(.*)" ,1))\
.withColumn("sale_price_gap",F.regexp_extract("pur_details", "sale_price_gap=([^,}]*)(.*)" ,1))\
.show()
You can test the regex here
check regexp
+--------------------+-------+---------+----------------+
| pur_details|cust_id| check| sale_price_gap|
+--------------------+-------+---------+----------------+
|[{product_id=XA80...| 0121|sale_item|upto 30% on_sale|
|[{product_id=XA14...| 0137|sale_item|upto 30% on_sale|
|[{product_id=XAHJ...| 0168|sale_item|upto 30% on_sale|
|[{product_id=XAOP...| 0365|sale_item|upto 30% on_sale|
|[{product_id=XA09...| 2600|sale_item| BOGO 20% off|
+--------------------+-------+---------+----------------+

Related

read percentage values in spark

I have a xlsx file which has a single column ;
percentage
30%
40%
50%
-10%
0.00%
0%
0.10%
110%
99.99%
99.98%
-99.99%
-99.98%
when i read this using Apache-Spark out put i get is,
|percentage|
+----------+
| 0.3|
| 0.4|
| 0.5|
| -0.1|
| 0.0|
| 0.0|
| 0.001|
| 1.1|
| 0.9999|
| 0.9998|
+----------+
expected output is ,
+----------+
|percentage|
+----------+
| 30%|
| 40%|
| 50%|
| -10%|
| 0.00%|
| 0%|
| 0.10%|
| 110%|
| 99.99%|
| 99.98%|
+----------+
My code -
val spark = SparkSession
.builder
.appName("trimTest")
.master("local[*]")
.getOrCreate()
val df = spark.read
.format("com.crealytics.spark.excel").
option("header", "true").
option("maxRowsInMemory", 1000).
option("inferSchema", "true").
load("data/percentage.xlsx")
df.printSchema()
df.show(10)
I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.
Well, percentage ARE double: 30% = 0.3
The only difference is the way it is displayed and, as #Artem_Aliev wrote in comment, there is no percentage type in spark that would print out as you expect. But once again: percentage are double, same thing, different notation.
The question is, what do you want to do with those percentage?
to "apply" them on something else, i.e. use multiply, then just use the double type column
to have a nice print, convert to the suitable string before printing:
val percentString = format_string("%.2f%%", $"percentage" * 100)
ds.withColumn("percentage", percentString).show()

select top max rows which has sum = 50 in dataframe

This is my dataframe
+--------------+-----------+------------------+
| _c3|sum(number)| perc|
+--------------+-----------+------------------+
| France| 5170305|1.3201573334529797|
| Germany| 9912088|2.5308982087190754|
| Vietnam| 14729566| 3.760966630301244|
|United Kingdom| 19435674| 4.962598446648971|
| Philippines| 21994132| 5.615861086093151|
| Japan| 35204549| 8.988936539189615|
| China| 39453426|10.073821498682275|
| Hong Kong| 39666589| 10.1282493704753|
| Thailand| 57202857|14.605863902228613|
| Malaysia| 72364309| 18.47710593603423|
| Indonesia| 76509597|19.535541048174547|
+--------------+-----------+------------------+
I want to select only top countries which sum up to 50 percent of passengers (country, number of passengers, percentage of passengers). How can I do it?
You can use a running total to store cumulative percentage and then filter by it. So, assuming your dataframe is small enough, something like this should do it:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn("cumulativepercentage", sum("perc").over(
Window.orderBy(col("perc").desc))
).where(col("cumulativepercentage").leq(50))
result.show(false)

Subtract in set intervals across 2 values in Excel

I am trying to write a PTO calculator in Excel and need help subtracting paid time off. The part I can't get to work is it subtracts in quarter-hour increments, but instead of using an HH:MM format, the times are scaled to 100.
Example:
| A | B
--|--------------|--------
1| Bank |50.00
2| CY | 6.40
|--------------|--------
3| Used | 9.50
|--------------|--------
4| Bank Remain |
5| CY Remain |
The Used would subtract from CY Accrued first until it is less than 0.25, then it will subtract the rest from Bank Accrued.
Thanks for the help. I did look for similar questions, but the only ones I saw calculated time in standard format (15 min intervals) not time proportioned to a 100 scale.
I'm not sure if this should go in it's own answer or not, since the principle is the same as the answer from #ReyJuna, but the execution is somewhat different. Using the table from the question:
| A | B
--|--------------|--------
1| Bank |50.00
2| CY | 6.40
|--------------|--------
3| Used | 9.50
|--------------|--------
4| Bank Remain |46.75
5| CY Remain | 0.15
Starting with finding CY Remain, if CY is greater than Used you don't have to worry about running out of 0.25 increment segments, so you can directly subtract using B2-B3 in the function below.
When CY is less than Used, you only want groups of exactly 0.25, so you can use the TRUNC() function to drop the decimals. The function is:
=IF(B2<B3,B2-(TRUNC(B2/0.25)*0.25),B2-B3)
Since you want to stop subtracting at CY less than 0.25, all you have to do is get the number of 0.25 groups in CY and subtract that. The rest that needs to be subtracted from Bank will be handled by this function:
=IF(B2<B3,B1-(B3-TRUNC(B2/0.25)*0.25),B1)
Notice the TRUNC(B2/0.25)*0.25 is the same, but is now subtracted from the value in Used. This gives you what is left over after the correct amount has been subtracted from CY first. Finally, when the Bank Remain function is false, meaning Bank PTO is not needed, then B1 is returned as it went unchanged.
This assumes using a formula instead of VBA and I have laid out the data in a simple table (see below) where the CY Old Balance is in cell B2.
| | A | B | C | D |
| |-------------|------|-------|----------|
|1| | CY | BANK | PTO Used |
|2| Old Balance | 6.4 | 50 | 9.5 |
|3| New Balance | 0.15 | 46.75 | |
The formula for the CY New Balance cell (B3) is:
=IF(B2>=D2,B2-D2,B2-0.25*INT(B2/0.25))
If the CY Old Balance is greater than or equal to the PTO Used, then I have assumed that the entire used amount is subtracted.
The B2-0.25*INT(B2/0.25) portion calculates how many whole portions of .25 are in your Old Balance and then subtracts that from your Old Balance giving you the CY New Balance.
The formula for the BANK New Balance cell (C3) is similar:
=IF(B2>=D2,C2,C2-(D2-0.25*INT(B2/0.25)))
The first part of the IF is based on the same assumption as above.
The C2-(D2-0.25*INT(B2/0.25)) uses the same "whole portions of .25" approach, then subtracts that from PTO Used to get what remains. That is then subtracted from BANK Old Balance to get the New Balance.

Cash Flow Timeline Graph

I have an excel sheet with 6 columns:
3 different cash flows (30%,60% and 10% of the project value)
3 columns with their respective dates
As an example, suppose total contract value is 100 USD, I receive USD 30 on 15.02.2019, USD 60 on 15.03.2020 and USD 10 on 15.03.2021. This is one row and 6 columns.
I want to present this information in 1 single chart/visualization. There are about 200 rows and the dates are not in a particular order, it's random.
When I try to combine the data X axis (dates) and all the 3 Cash flows (on Y axis), it doesn't make sense, it gets chaotic and moreover the dates only come up for the 30% Cash flow.
I want X axis with all the dates and on Y axis to have the cash flows with 3 legends (30%,60% and 10%) on their respective dates.So in nutshell , as an example the graph can show that on 1st January 2019 I had a total cash flow of 10 USD from 30% Cash flow, 5 from 60% and 2 from 10 %.I am not an advanced user in Excel so would appreciate your help! If I need to format my data to some particular way-I can do that as well.What graph should I use?
I am ready to use Power BI or any other free solution as well-if its easy there!
PS-I tried doing a combo chart and then making changes to data as well(but, still it doesn't work!) under Design>Select Data-tried everything!
You have to transform aka. normalize your data first.
What you have is 6 columns:
| Cash 30% | Cash 60% | Cash 10% | Date 30% | Date 60% | Date 10% |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 30.00 | 60.00 | 10.00 | 20190215 | 20200315 | 20210315 |
What you want is the following structure, containing the same information as above, but in a normalized way:
| Cash flow pct | Date | Amount |
| ------------- | -------- | ------ |
| 10% | 20210315 | 10.00 |
| 30% | 20190215 | 30.00 |
| 60% | 20200315 | 60.00 |
Once your data is structured this way, visualizing it the way you're describing is straightforward.
Transforming the data is very easy to do using the Power Query editor of Power BI (or Excel for that matter). Post a new question tagged with "powerquery" if you need assistance in how to make a transformation such as this.

Multiple column categories in MS Excel pivot table

Is it possible to design a pivot table in such a way that I have multiple column categories? Please see examples below:
|Group | Category 1 | Category 2 |
| | good | bad | total | good | bad | total |
|---------------------------------------------------|
|Group 1| 40% | 60% | 100% | 60% | 40% | 100% |
|Group 2| 30% | 70% | 100% | 20% | 80% | 100% |
...
I can get the Category 1 part or the Category 2 part, but not both. If you put both as my column input, I get the combined version (i.e. good/good, good/bad, bad/good, and bad/bad).
Thanks
Yes, it is possible - see the attached image:
pivot table

Resources