Regex remove specific part of a string in a dataframe

Regex remove specific part of a string in a dataframe - python-3.x

in a pandas df column I have all records like this:
column
strunk 0 somestring
strunk 0 anotherstring
strunk 0 string
How can I remove the strunk 0 part and keep only the rest?
to get:
column
somestring
anotherstring
string

If the part you want to remove is the same in every row you can do:
df['new_col'] = df['old_col'].str[10:]

Related

Splitting a pandas column every n characters

I have a dataframe where some columns contain long strings (e.g. 30000 characters). I would like to split these columns every 4000 characters so that I end up with a range of new columns containing strings of length at most 4000. I have an upper bound on the string lengths so I know there should be at most 9 new columns. I would like there to always be 9 new columns, having None/NaN in columns where the string is shorter.
As an example (with n = 10 instead of 4000 and 3 columns instead of 9), let's say I have the dataframe:
df_test = pd.DataFrame({'id': [1, 2, 3],
'str_1': ['This is a long string', 'This is an even longer string', 'This is the longest string of them all'],
'str_2': ['This is also a long string', 'a short string', 'mini_str']})
id str_1 str_2
0 1 This is a long string This is also a long string
1 2 This is an even longer string a short string
2 3 This is the longest string of them all mini_str
In this case I want to get the result
id str_1_1 str_1_2 str_1_3 str_1_4 str_2_1 str_2_2 str_2_3
0 1 This is a long strin g NaN This is al so a long string
1 2 This is an even long er string NaN a short st ring NaN
2 3 This is th e longest string of them all mini_str NaN NaN
Here, I want e.g. first row, column str_1_3 to be a string of length 1.
I tried using
df_test['str_1'].str.split(r".{10}", expand=True, n=10)
but that didn't work. It gave this as result
0 1 2 3
0 g None
1 er string None
2 them all
where the first columns aren't filled.
I also tried looping through every row and inserting '|' every 10 characters and then splitting on '|' but that seems tedious and slow.
Any help is appreciated.

The answer is quite simple, that is, insert a delimiter and split it.
For example, use | as the delimiter and let n = 4:
series = pd.Series(['This is an even longer string', 'This is the longest string of them all'],name='str1')
name = series.name
cols = series.str.replace('(.{10})', r'\1|').str.split('|', n=4, expand=True).add_prefix(f'{name}_')
That is, use str.replace to add delimiter, use str.split to split them apart and use add_prefix to add the prefixes.
The output will be:
str1_0 str1_1 str1_2 str1_3
0 This is an even long er string None
1 This is th e longest string of them all
The reason why str.split('.{10}') doesn't work is that the pat param in the function str.split is a pattern to match the strings as split delimiters but not strings that should be in splited results. Therefore, with str.split('.{10}'), you get one character every 10-th chars.
UPDATE: Accroding to the suggestion from #AKX, use \x1F as a better delimiter:
cols = series.str.replace('(.{10})', '\\1\x1F').str.split('\x1F', n=4, expand=True).add_prefix(f'{name}_')
Note the absence of the r string flags.

Display 2 decimal places, and use comma as separator in pandas?

Is there any way to replace the dot in a float with a comma and keep a precision of 2 decimal places?
Example 1 : 105 ---> 105,00
Example 2 : 99.2 ---> 99,20
I used a lambda function df['abc']= df['abc'].apply(lambda x: f"{x:.2f}".replace('.', ',')). But then I have an invalid format in Excel.
I'm updating a specific sheet on excel, so I'm using : wb = load_workbook(filename) ws = wb["FULL"] for row in dataframe_to_rows(df, index=False, header=True): ws.append(row)

Let us try
out = (s//1).astype(int).astype(str)+','+(s%1*100).astype(int).astype(str).str.zfill(2)
0 105,00
1 99,20
dtype: object
Input data
s=pd.Series([105,99.2])

s = pd.Series([105, 99.22]).apply(lambda x: f"{x:.2f}".replace('.', ',')
First .apply takes a function inside and
f string: f"{x:.2f} turns float into 2 decimal point string with '.'.
After that .replace('.', ',') just replaces '.' with ','.
You can change the pd.Series([105, 99.22]) to match it with your dataframe.

I think you're mistaking something in here. In excel you can determine the print format i.e. the format in which numbers are printed (this icon with +/-0).
But it's not a format of cell's value i.e. cell either way is numeric. Now your approach tackles only cell value and not its formatting. In your question you save it as string, so it's read as string from Excel.
Having this said - don't format the value, upgrade your pandas (if you haven't done so already) and try something along these lines: https://stackoverflow.com/a/51072652/11610186
To elaborate, try replacing your for loop with:
i = 1
for row in dataframe_to_rows(df, index=False, header=True):
ws.append(row)
# replace D with letter referring to number of column you want to see formatted:
ws[f'D{i}'].number_format = '#,##0.00'
i += 1

well i found an other way to specify the float format directly in Excel using this code :
for col_cell in ws['S':'CP'] :
for i in col_cell :
i.number_format = '0.00'

Replace partial string values in pandas dataframes by using dictionary

I have dataframe like this
Date Name
11-01-19 Craig-TX
22-10-23 Lucy-AR
I have dictionary
data ={'TX':'Texas','AR':'ARIZONA'}
I would like to replace partial string value from TX --> Texas and AR --> Arizona.
Resultant dataframe should be
Date Name
11-01-19 Craig-Texas
22-10-23 Lucy-Arizona
Do we have any specific function to replace the values in each row?

Adding regex = True
df=df.replace(data,regex=True)
Date Name
0 11-01-19 Craig-Texas
1 22-10-23 Lucy-ARIZONA
More safe, for example if the Name contain TX, using replace will fail
df.Name=df.Name.str.split('-',expand=True).replace({1:data}).agg('-'.join)
0 Craig-Lucy
1 Texas-ARIZONA
dtype: object

Pandas set empty values in column based on its type

I am trying read an excel with below values:
I need a generic way to convert all NA to 'Not Available' and empty values in column to either 0 or "" based on its type.
Example:
I used below code to do this:
df=pd.read_excel(price_excel,sheet_name=0,na_values='NA',keep_default_na=False)
df=df.replace(np.nan, 'Not Available', regex=True)
# changing nan values of int and float to 0 and string to ""
col_with_data_type={}
for j,i in zip(df.columns,df.dtypes):
if(i=='int64' or i=='float64'):
col_with_data_type[j]=0
else:
col_with_data_type[j]=""
df.fillna(value=col_with_data_type,inplace=True)
I am able to convert 'NA' to 'Not Available' , but not empty empty value to 0 or "".
Currently my dataframe look like this:
Please help me to resolve this issue.

SAS EG Character Functions

I have two tables containing characters:
First_Column Second Column
aaa 123aaa123
bbb cdsbbbsxd
ccc 098fdsccd
I want to label 1 if Second Column string contains the string in the first column otherwise I would like put 0.
I could not find a way to do that in SAS EG? Is there any function to do this?
Thanks

You can use functions like find, index or count.
count('Second Column'n, First_Column)
index('Second Column'n, First_Column)
find('Second Column'n, First_Column)
In Query Builder you have to add new column with an expression like below:
case(count('Second Column'n, First_Column))
when(0) then 0
else 1
end

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Regex remove specific part of a string in a dataframe - python-3.x

in a pandas df column I have all records like this: column strunk 0 somestring strunk 0 anotherstring strunk 0 string How can I remove the strunk 0 part and keep only the rest? to get: column somestring anotherstring string

If the part you want to remove is the same in every row you can do: df['new_col'] = df['old_col'].str[10:]

Related

Splitting a pandas column every n characters

Display 2 decimal places, and use comma as separator in pandas?

Replace partial string values in pandas dataframes by using dictionary

Pandas set empty values in column based on its type

SAS EG Character Functions

Categories

Resources