Removing unwanted characters in python pandas

Removing unwanted characters in python pandas - python-3.x

I have a pandas dataframe column like below :
| ColumnA |
+-------------+
| ABCD(!) |
| <DEFG>(23) |
| (MNPQ. ) |
| 32.JHGF |
| "QWERT" |
Aim is to remove the special characters and produce the output as below :
| ColumnA |
+------------+
| ABCD |
| DEFG |
| MNPQ |
| JHGF |
| QWERT |
Tried using the replace method like below, but without success :
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_]+", "", regex=True)
print(df)
So, how can I replace the special characters using replace method in pandas?

Your solution is also for get numbers \d and _, so it remove only:
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "")
print (df)
ColumnA
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

regrex should be r'[^a-zA-Z]+', it means keep only the characters that are from A to Z, a-z
import pandas as pd
# | ColumnA |
# +-------------+
# | ABCD(!) |
# | <DEFG>(23) |
# | (MNPQ. ) |
# | 32.JHGF |
# | "QWERT" |
# create a dataframe from a list
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])
# | ColumnA |
# +------------+
# | ABCD |
# | DEFG |
# | MNPQ |
# | JHGF |
# | QWERT |
# keep only the characters that are from A to Z, a-z
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z]+', '')
print(df['ColumnB'])
Result:
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z]+" If this doesn't work, then maybe try to update your pandas;
import pandas as pd
d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF', '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "", regex=True)
print(df)
Output
Column A
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

Related

Possible corner case: pandas.read_csv

Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?

Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.

Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Pandas DataFrame remove strings having a certain characters

I have a pandas DataFrame with a lot of text data. I want to remove all lines starting with "*" mark. Therefore, I tried a small example as the following.
string1 = '''* This needs to be gone
But this line should stay
*remove
* this too
End'''
string2 = '''* This needs to be gone
But this line should stay
*remove
* this too
End'''
df = pd.DataFrame({'a':[string1,string2]})
df['a'] = df['a'].map(lambda a: (re.sub(r'(?m)^\*.*\n?', '', a, flags=re.MULTILINE)))
It could perfectly do the job. However, when I applied the same function to my original DataFrame it is not working. Can you help me to identify the issue?
df2['NewsText'] = df2['NewsText'].map(lambda a: (re.sub(r'(?m)^\*.*\n?', '', a, flags=re.MULTILINE)))
df2.head()
Pease see the attached image of my original DataFrame

Given your example data
.str.split('\n') creates a list of each section
.apply(lambda x: '\n'.join([y for y in x if '*' not in y])) uses a list comprehension to remove each sentence with * and then joins it back into a string.
You can join with ' '.join or ''.join
.apply(lambda x: [y for y in x if '*' not in y]) if you want to have a list instead of a long string.
| | a |
|---:|:--------------------------|
| 0 | * This needs to be gone |
| | But this line should stay |
| | *remove |
| | * this too |
| | End |
| 1 | * This needs to be gone |
| | But this line should stay |
| | *remove |
| | * this too |
| | End |
# remove sections with '*'
df['a'] = df['a'].str.split('\n').apply(lambda x: '\n'.join([y for y in x if '*' not in y]))
# final
| | a |
|---:|:--------------------------|
| 0 | But this line should stay |
| | End |
| 1 | But this line should stay |
| | End |

Explode date interval over a group by and take last value in pyspark

I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?

So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)

Use a dataframe as lookup for another dataframe

I’ve two dataframes df_1 and df_2
df_1 is my master dataframe and df_2 is a lookup dataframe.
I want to test if the value in df_1[‘col_c1’] contains any of the the values from df_2[‘col_a2’].
If this is true (can be multiple matches !);
add the value(s) from df_2[‘col_b2’] to df_1[‘col_d1’]
add the value(s) from df_2[‘col_c2’] to df_1[‘col_e1’]
How can i achieve this?
I’ve really no idea and therefore I can’t share any code for this.
Sample df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
----------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | |
1_002 | zzzzz | ggggjjjjjkkkkk | |
1_003 | pppp | qqqqffffgggg | |
1_004 | sss | wwwcccyyy | |
1_005 | eeeeee | eecccffffll | |
1_006 | tttt | hhggeeuuuuu | |
Sample df_2
col_a2 | col_b2 | col_c2
------------------------------
ccc | 2_001 | some_data_c
jjj | 2_002 | some_data_j
fff | 2_003 | some_data_f
Desired output df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
------------------------------------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | 2_001 | some_data_c
1_002 | zzzzz | ggggjjjjjkkkkk | 2_002 | some_data_j
1_003 | pppp | qqqqffffgggg | 2_003 | some_data_f
1_004 | sss | wwwcccyyy | 2_001 | some_data_c
1_005 | eeeeee | eecccffffll | 2_001;2_003 | some_data_c; some_data_f
1_006 | tttt | hhggeeuuuuu | |
df_1 has approx 45.000 rows and df_2 approx. 16.000 rows. (Also added a non matching row)
I've been struggling with this for hours, but I really have no idea.
I don't think merging is an option because there's no exact match.
Your help is greatly appreciated.

Use:
#exctract values by df_2["col_a2"] to new column
s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
.reset_index(level=1, drop=True))
#repeat rows with duplicated match
df_1 = df_1.join(s)
#add new columns by map
df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
#aggregate join
cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
print (df)
col_a1 col_b1 col_c1 col_d1 col_e1
0 1_001 aaaaaa bbbbccccdddd 2_001 some_data_c
1 1_002 zzzzz ggggjjjjjkkkkk 2_002 some_data_j
2 1_003 pppp qqqqffffgggg 2_003 some_data_f
3 1_004 sss wwwcccyyy 2_001 some_data_c
4 1_005 eeeeee eecccffffll 2_001,2_003 some_data_c,some_data_f

this will solve it
df['col_d1'] = df.apply(lambda x: ','.join([df2['col_b2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
df['col_e1'] = df.apply(lambda x: ','.join([df2['col_c2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
Output
col_a1 col_b1 col_c1 col_d1 \
0 1_001 aaaaaa bbbbccccdddd 2_001
1 1_002 zzzzz ggggjjjjjkkkkk 2_002
2 1_003 pppp qqqqffffgggg 2_003
3 1_004 sss wwwcccyyy 2_001
4 1_005 eeeeee eecccffffll 2_001 , 2_003
col_e1
0 some_data_c
1 some_data_j
2 some_data_f
3 some_data_c
4 some_data_c; some_data_f

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing unwanted characters in python pandas - python-3.x

Your solution is also for get numbers \d and _, so it remove only: df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "") print (df) ColumnA 0 ABCD 1 DEFG 2 MNPQ 3 JHGF 4 QWERT

Related

Possible corner case: pandas.read_csv

Split column on condition in dataframe

Pandas DataFrame remove strings having a certain characters

Explode date interval over a group by and take last value in pyspark

Use a dataframe as lookup for another dataframe

Categories

Resources