remove specific words into a dataframe with pyspark - helper

I have a DataFrame
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | VION00001 TRADE |
|32350|B | Curso | CUS11222 LEADER|
|32362|C | Curso | ITIN9876 EVALUA|
|32347|D | Curso | CUMPLI VION1234 |
|32036|E | Curso | EVAN1111 INFORM|
I need, that into formacion column remove the characters that start with VION|CUS|ITIN|VION|EVAN so Dataframe looks like
+------+--------------------+-----------------+----
| id| titulo |tipo | formacion |
+------+--------------------+-----------------+----
|32084|A | Material | TRADE |
|32350|B | Curso | LEADER |
|32362|C | Curso | EVALUA |
|32347|D | Curso | CUMPLI |
|32036|E | Curso | INFORM |
+------+--------------------+-----------------+----
Thank you for your help

Use split function to split the column by space then get the last element of array.
from Spark2.4+ use element_at function
for Spark < 2.4 use reverse(split(array))[0]
#using element_at
df.withColumn("formacion",element_at(split(col("formacion"),"\\s"),-1)).show()
#or using array_index
df.withColumn("formacion",split(col("formacion"),"\\s")[1]).show()
#split reverse and get first index value
df.withColumn("formacion",reverse(split(col("formacion"),"\\s"))[0]).show()
#+-----+--------------+----------+-------------+
#| id|titulo |tipo | formacion |
#+------+--------------------+-----------------+
#|32084|A | Material | TRADE |
#|32350|B | Curso | LEADER |
#|32362|C | Curso | EVALUA |
#|32347|D | Curso | CUMPLI |
#|32036|E | Curso | INFORM |
#+-----+--------------+----------+-------------+

Sorry guys , this is original Column from DataFrame
formacion = [ VION00001 TRADE, CUS11222 LEADER,ITIN9876 EVALUA ,VION1234 CUMPLI,EVAN11 FR]
This is expected
formacion = [ TRADE, LEADER,EVALUA ,CUMPLI, FR]

Related

Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

Remove groups from pandas where {condition}

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance')
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_FACTOR | 0.033572 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?
importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

How to transpose repeating groups in excel?

I have an excel sheet with the following data structure
+-------------+-------------------------------+--+
| title | Dr. | |
| first_name | Adam | |
| last_name | Meyer | |
| email | adam.meyer#my-company.com | |
| phone | +49 (0)931-32187-0 | |
| fax | | |
| room | | |
| position | Direktor | |
| title | | |
| first_name | Judith | |
| last_name | Schmidt | |
| email | judith.schmidt#my-company.com | |
| phone | +49 (0)444-32131-1 | |
| fax | | |
| room | | |
| position | | |
| title | | |
| first_name | Claus | |
| last_name | Niemes | |
| email | claus.niemes#my-company.com | |
| phone | +49 (0)444-32131-2 | |
| fax | | |
| room | | |
| position | Verkäufer | |
| bio | xxxxxxxxxx | |
| title | Dr. | |
| first_name | András | |
| last_name | Cloon | |
| email | andrás.cloon#my-company.de | |
| phone | +49 (0)444-32131-1 | |
+-------------+-------------------------------+--+
Not all 450 Persons have values for all rows. Sometimes is missing the first row with the title for example.
I want to import these data into a mysql database.
I need this result:
I'm not an excel expert that's why I'm happy about every hint how to ease the preparation of the data for the import.
I know how to do the import.
Oh I see what you want now
Sub SortItOUt()
Dim t As Range
Dim r As Range
Set t = Sheets(2).Range("a2") 'I assume a blank second sheet to collect the data
Set r = Sheets(1).Range("a1") 'assume data starts in sheet 1
Do
Dim x As Long
For x = 0 To 7
r.Offset(x, 1).Copy t.Offset(0, x) 'copy and transpose
Next x
Set t = t.Offset(1, 0)
Set r = r.Offset(8, 0)
Loop Until r = ""
End Sub

Resources