Replace partial string values in pandas dataframes by using dictionary - python-3.x

I have dataframe like this
Date Name
11-01-19 Craig-TX
22-10-23 Lucy-AR
I have dictionary
data ={'TX':'Texas','AR':'ARIZONA'}
I would like to replace partial string value from TX --> Texas and AR --> Arizona.
Resultant dataframe should be
Date Name
11-01-19 Craig-Texas
22-10-23 Lucy-Arizona
Do we have any specific function to replace the values in each row?

Adding regex = True
df=df.replace(data,regex=True)
Date Name
0 11-01-19 Craig-Texas
1 22-10-23 Lucy-ARIZONA
More safe, for example if the Name contain TX, using replace will fail
df.Name=df.Name.str.split('-',expand=True).replace({1:data}).agg('-'.join)
0 Craig-Lucy
1 Texas-ARIZONA
dtype: object

Related

Regex remove specific part of a string in a dataframe

in a pandas df column I have all records like this:
column
strunk 0 somestring
strunk 0 anotherstring
strunk 0 string
How can I remove the strunk 0 part and keep only the rest?
to get:
column
somestring
anotherstring
string
If the part you want to remove is the same in every row you can do:
df['new_col'] = df['old_col'].str[10:]

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1
Output needed:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3
In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".
What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.
I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.
Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.
Could you help me on the subject with a dynamic solution ?
Thank you so much.
Best regards
Update: Turned out I misunderstood the requirement. This would work:
for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Yet Another Update: To Handle Null Values add coalesce:
for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Input/Output:
------- Keeping the below section for a while just for reference -------
You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).
This would work with the same:
df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");
df.show()
df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))
as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}
expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
ELSE NULL END""";
df.withColumn("Result", F.expr(expression)).show();
Input/Output:

How can I search for a prefix in a panda column, then if found, return that prefix+the next 11 characters in a new column?

I have a dataframe that contains invoice numbers in a variety of formats from different payments. I need to search for the prefix 'SIN' in column INVOICE NUMBER, and then if found, return SIN+the next 11 characters to a new column. The original data is:
Payer Amount INVOICE NUMBER
0 Client A 345.34 SINDE19-000032
1 Client B 450.00 48372HNFFSINNL18-003421SINNL18-012374
2 Client C 2403.34 SINGB09584
3 Client D 1492.33 KSKH97444 SI3232
If there are multiple versions of SINxxx..., I would like to return the two invoice numbers in the new column, separated by a comma.
The final dataframe should look like:
Payer Amount INVOICE NUMBER TIDY
0 Client A 345.34 SINDE19-000032 SINDE19-000032
1 Client B 450.00 48372HNFFSINNL18-003421SINNL18-012374 SINNL18-003421,SINNL18-012374
2 Client C 2403.34 SINGB09584 NaN
3 Client D 1492.33 KSKH97444 SI3232 NaN
You have two options to do this. Either you can use the map function with a regex:
df['TIDY'] = df['INVOICE NUMBER'].map(lambda x: ','.join(re.findall(r'SIN.{11}', x)))
This uses the map function to first extract the regex matches and then join them with the , as a delimiter for the complete column. Alternatively, you can use the Series.str.extractall function to do the same, as shown in this comment:
df['TIDY'] = df['INVOICE NUMBER'].str.extractall(r'(SIN.{11})').unstack(fill_value='').apply(','.join, 1)
this output will give you two columns for the first SIN and the second, It's what you want?
# Extract name from the string
df['SIN1'] = df['INVOICE_NUMBER'].str.extract(r'(SINNL.\d+.\d+)', expand=True)
df['SIN2'] = df['INVOICE_NUMBER'].str.extract(r'(SINNL.\d+.\d+)$', expand=True) # notice the $ here
df

get label of an index in pandas multiindex dataframe

I have a dataframe df
c1 c2
name sample
person1 a1 aaa AAA
b1 bbb BBB
c1 ccc CCC
person2 d1 ...
I want to iterate through the dataframe, one person at a time, and check if values in columns match a criteria. If I get a match, then I'd like to extract the label for that index (at level[1] and as a string), and create a set of all such indices. So say my criteria is that column_value == bbb, then I'd like to get "b1"
The following produces almost what I want, but it returns of set of generator objects, rather than the names of the labels as strings.
index_set = set()
for person, new_df in df.groupby(level=0):
idx = new_df.index.get_level_values(1).tolist()
index_set.add(x for x in idx)
which produces something like at 0x0000022F6F05D200>, at 0x0000022F6F05D410>,....
So how to make it produce something like {"b1", "f1", "h1",...} instead?
And another question: when iterating through df by creating new_df the index names don't seem to transfer to new_df. Can this be avoided somehow? It would make the code more readable if I could refer to the index as get_level_values('sample') rather than get_level_values(1)
The add method of a set adds one element, in your case it adds an iterator. You could use list comprehension to add a few: [index_set.add(x) for x in idx], but the correct way is to use update method:
index_set.update(idx)

Generating test data in Excel for an EAV table

This is a pretty complicated question so be prepared! I want to generate some test data in excel for my EAV table. The columns I have are:
user_id, attribute, value
Each user_id will repeat for a random number of times between 1-4, and for each entry I want to pick a random attribute from a list, and then a random value which this can take on. Lastly I want the attributes for each id entry to be unique i.e. I do not want more than one entry with the same id and attribute. Below is an example of what I mean:
user_id attribute value
100001 gender male
100001 religion jewish
100001 university imperial
100002 gender female
100002 course physics
Possible values:
attribute value
gender male
female
course maths
physics
chemistry
university imperial
cambridge
oxford
ucl
religion jewish
hindu
christian
muslim
Sorry that the table above messed up. I don't know how to paste into here while retaining the structure! Hopefully you can see what I'm talking about otherwise I can get a screenshot.
How can I do this? In the past I have generated random data using a random number generator and a VLOOKUP but this is a bit out of my league.
My approach is to create a table with all four attributes for each ID and then filter that table randomly to get between one and four filtered rows per ID. I assigned a random value to each attribute. The basic setup looks like this:
To the left is the randomized eav table and to the left is the lookup table used for the randomized values. Here's the formulas. Enter them and copy down:
Column A - Establishes a random number every four digits. This determines the attribute that must be selected:
=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)
Column B - Uses the formula in A to determine if row is included:
=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)
Column C - Creates the IDs, starting with 100,001:
=(INT((ROW()-2)/4)+100000)+1
Column D - Repeats the four attributes:
=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")
Column E - Finds the first occurence of the Column D attribute in the lookup table and selects a randomly offset value:
=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))
When you filter on the TRUEs in Column B you'll get your list of one to four Attributes per ID. Disconcertingly, the filtering forces a recalculation, so the filtered list will no longer say TRUE for every cell in column B.
If this was mine I'd automate it a little more, perhaps by putting the "magic number" 4 in it's own cell (the count of attributes).
There are a number of ways to do this. You could use either perl or python. Both have modules for working with spreadsheets. In this case, I used python and the openpyxl module.
# File: datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example: datagen.py myfile.xlsx
import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter
# verify that user specified an argument
if len(sys.argv) < 2:
print "Specify an excel filename to save the data, e.g myfile.xlsx"
exit(-1)
# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()
# Modify this line to specify the range of user ids
ids = range(100001, 100100)
# data structure for the attributes and values
data = { 'gender': ['male', 'female'],
'course': ['maths', 'physics', 'chemistry'],
'university': ['imperial','cambridge','oxford', 'ucla'],
'religion': ['jewish', 'hindu', 'christian','muslim']}
# Write column headers in the spreadsheet
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'
row = 1
# Loop through each user id
for user_id in ids:
# randomly select how many attributes to use
attr_cnt = random.randint(1,4)
attributes = data.keys()
for idx in range(attr_cnt):
# randomly select attribute
attr = random.choice(attributes)
# remove the selected attribute from further selection for this user id
attributes.remove(attr)
# randomly select a value for the attribute
value = random.choice(data[attr])
row = row + 1
# write the values for the current row in the spreadsheet
ws.cell('%s%s'%('A', row)).value = user_id
ws.cell('%s%s'%('B', row)).value = attr
ws.cell('%s%s'%('C', row)).value = value
# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1])
print "Done!"

Resources