remove specific words into a dataframe with pyspark - helper

I have a DataFrame
| id| titulo |tipo | formacion |
|32084|A | Material | VION00001 TRADE |
|32350|B | Curso | CUS11222 LEADER|
|32362|C | Curso | ITIN9876 EVALUA|
|32347|D | Curso | CUMPLI VION1234 |
|32036|E | Curso | EVAN1111 INFORM|
I need, that into formacion column remove the characters that start with VION|CUS|ITIN|VION|EVAN so Dataframe looks like
| id| titulo |tipo | formacion |
|32084|A | Material | TRADE |
|32350|B | Curso | LEADER |
|32362|C | Curso | EVALUA |
|32347|D | Curso | CUMPLI |
|32036|E | Curso | INFORM |
Thank you for your help

Use split function to split the column by space then get the last element of array.
from Spark2.4+ use element_at function
for Spark < 2.4 use reverse(split(array))[0]
#using element_at
#or using array_index
#split reverse and get first index value
#| id|titulo |tipo | formacion |
#|32084|A | Material | TRADE |
#|32350|B | Curso | LEADER |
#|32362|C | Curso | EVALUA |
#|32347|D | Curso | CUMPLI |
#|32036|E | Curso | INFORM |

Sorry guys , this is original Column from DataFrame
formacion = [ VION00001 TRADE, CUS11222 LEADER,ITIN9876 EVALUA ,VION1234 CUMPLI,EVAN11 FR]
This is expected


Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
'ID', source['Size'], source['Type']
| ID| Size|Type| A| B|
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|

Remove groups from pandas where {condition}

I have dataframe like this:
| | envelopeid | message |
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
| | envelopeid | message |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and'CMN.00002' or 'CMN.00004').any()) or'CMN.00002').all()))
but it is not working properly
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?
importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
| word |
| chair |
| lamp |
| table |
I want to explode out all the prefixes like so:
| prefix |
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|

How to transpose repeating groups in excel?

I have an excel sheet with the following data structure
| title | Dr. | |
| first_name | Adam | |
| last_name | Meyer | |
| email | | |
| phone | +49 (0)931-32187-0 | |
| fax | | |
| room | | |
| position | Direktor | |
| title | | |
| first_name | Judith | |
| last_name | Schmidt | |
| email | | |
| phone | +49 (0)444-32131-1 | |
| fax | | |
| room | | |
| position | | |
| title | | |
| first_name | Claus | |
| last_name | Niemes | |
| email | | |
| phone | +49 (0)444-32131-2 | |
| fax | | |
| room | | |
| position | Verkäufer | |
| bio | xxxxxxxxxx | |
| title | Dr. | |
| first_name | András | |
| last_name | Cloon | |
| email | andrá | |
| phone | +49 (0)444-32131-1 | |
Not all 450 Persons have values for all rows. Sometimes is missing the first row with the title for example.
I want to import these data into a mysql database.
I need this result:
I'm not an excel expert that's why I'm happy about every hint how to ease the preparation of the data for the import.
I know how to do the import.
Oh I see what you want now
Sub SortItOUt()
Dim t As Range
Dim r As Range
Set t = Sheets(2).Range("a2") 'I assume a blank second sheet to collect the data
Set r = Sheets(1).Range("a1") 'assume data starts in sheet 1
Dim x As Long
For x = 0 To 7
r.Offset(x, 1).Copy t.Offset(0, x) 'copy and transpose
Next x
Set t = t.Offset(1, 0)
Set r = r.Offset(8, 0)
Loop Until r = ""
End Sub
