How to add varying prefixes to strings in pandas column

How to add varying prefixes to strings in pandas column - python-3.x

I have a dataframe with 2 columns containing audio filenames and corresponding texts, looking like this:
data = {'Audio_Filename': ['3e2bd3d1-b9fc-095728a4d05b',
'8248bf61-a66d-81f33aa7212d',
'81051730-8a18-6bf476d919a4'],
'Text': ['On a trip to America, he saw people filling his noodles into paper cups.',
'When the young officers were told they were going to the front,',
'Yeah, unbelievable, I had not even thought of that.']}
df = pd.DataFrame(data, columns = ['Audio_Filename', 'Text'])
Now I want to add a string prefix (the speaker ID: sp1, sp2, sp3) with an underscore _ to all audio filename strings according to this pattern:
sp2_3e2bd3d1-b9fc-095728a4d05b.
My difficulty: The prefix/speaker ID is not fixed but varies depending on the audio filenames. Because of this, I have zipped the audio filenames and the speaker IDs and iterated over those and the audio filename rows via for-loops. This is my code:
zipped = list(zip(audio_filenames, speaker_ids))
for audio, speaker_id in zipped:
for index, row in df.iterrows():
audio_row = row['Audio_Filename']
if audio == audio_row:
df['Audio_Filename'] = f'{speaker_id}_' + audio_row
df.to_csv('/home/user/file.csv')
I also tried apply with lambda after the if statement:
df['Audio_Filename'] = df['Audio_Filename'].apply(lambda x: '{}_{}'.format(speaker_id, audio_row))
But nothing works so far.
Can anyone please give me a hint on how to do this?
The resulting dataframe should look like this:
Audio_Filename Text
sp2_3e2bd3d1-b9fc-095728a4d05b On a trip to America, he saw people filling hi...
sp1_8248bf61-a66d-81f33aa7212d When the young officers were told they were go...
sp3_81051730-8a18-6bf476d919a4 Yeah, unbelievable, I had not even thought of ...
(Of course, I have much more audio filenames and corresponding texts in the dataframe).
I appreciate any help, thank you!

If you have audio_filenames and speaker_ids list, you can use Series.map function. For example:
audio_filenames = [
"3e2bd3d1-b9fc-095728a4d05b",
"8248bf61-a66d-81f33aa7212d",
"81051730-8a18-6bf476d919a4",
]
speaker_ids = ["sp2", "sp1", "sp3"]
mapper = {k: "{}_{}".format(v, k) for k, v in zip(audio_filenames, speaker_ids)}
df["Audio_Filename"] = df["Audio_Filename"].map(mapper)
print(df)
Prints:
Audio_Filename Text
0 sp2_3e2bd3d1-b9fc-095728a4d05b On a trip to America, he saw people filling his noodles into paper cups.
1 sp1_8248bf61-a66d-81f33aa7212d When the young officers were told they were going to the front,
2 sp3_81051730-8a18-6bf476d919a4 Yeah, unbelievable, I had not even thought of that.

Related

Remove non meaningful characters in pandas dataframe

I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python pandas column. Although the text starts with b' , it's a string
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.

Your string looks like byte string but not so encode/decode doesn't work. Try something like this:
>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)
0 b'Hello! End Climate Silence is looking for v...
1 b'I doubt if climate emergency 8s real, I thin...
2 b'No, thankfully it doesnt. Cant see how cheap...
3 b'Climate Change Poses a WidelllThreat to Nati...
4 b""This doesn't feel like targeted propaganda ...
5 b'berates climate change activist who confront...
Name: text, dtype: object
Note you have to clean your unbalanced single/double quotes and remove the first 'b' character.

You could go through your strings and keep only ascii characters:
my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)
Note that .encode('ascii', errors= 'ignore') doesn't change the string it's applied to but returns the encoded string. This should work:
new_str = my_str.encode('ascii',errors='ignore')
print(new_str)

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

I have a problem similar to this question.
I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.
for example
import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
'xxx stuff coll tides 20 oz 09980-66-832',
'hel world sambucus elder 60 chw 0392-67-491',
'northern cold ultimate 180 sg 06580-66-832',
'ancient nuts boogs 16oz 58532-42-123 '],
'Brand': ['FUBAR OF BAZ',
'XXX STUFF',
'HELLO WORLD',
'NORTHERN COLDNITES',
'ANCIENT NUTS']})
I want to remove the brand name from the Item_UPC column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.
def clean_upc(df):
#take in a dataframe, expand the number of columns into a temp
#dataframe
temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
#add columns to main dataframe from Temp
df.insert(0, "UPC", temp[1])
df.insert(1, "Item", temp[0])
#drop original combined column
df.drop(columns= ["Item_UPC"], inplace=True)
#remove leading zero on and hyphens in UPC.
df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
col_names = df.columns
#make all columns lower case to ease searching
for cols in col_names:
df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)
after running this I have a data frame with three columns
UPC, Item, Brand
The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code
temp = df["Item"].str.rsplit(" ", expand = True)
temp has a shape of
temp.shape
(329868, 13)
which makes manual curating a pain when most of columns 9-13 are empty.
Currently my logic is to first split brand in to 2 while dropping the first column in temp
brand = df["brand"].str.rsplit(" ", n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)
and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " " or vice versa, and then concatenate temp back together (
temp["combined"] = temp[1] + temp[2]....+temp[13]
and replace the existing Item column with the combined column
df["Item"] = temp["combined"]
or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.

Because the input does not follow any well-defined rules, this looks like more of an optimization problem. You can start by stripping exact matches:
df["Item_cleaned"] = df.apply(lambda x: x.Item_UPC.lstrip(x.Brand.lower()), axis=1)
output:
Item_UPC Brand Item_cleaned
0 fubar baz dr frm prob onc dly wmn ogc 30vcp 06... FUBAR OF BAZ dr frm prob onc dly wmn ogc 30vcp 06580-66-832
1 xxx stuff coll tides 20 oz 09980-66-832 XXX STUFF coll tides 20 oz 09980-66-832
2 hel world sambucus elder 60 chw 0392-67-491 HELLO WORLD sambucus elder 60 chw 0392-67-491
3 northern cold ultimate 180 sg 06580-66-832 NORTHERN COLDNITES ultimate 180 sg 06580-66-832
4 ancient nuts boogs 16oz 58532-42-123 ANCIENT NUTS boogs 16oz 58532-42-123
This method should will strip any exact matches and output to a new column Item_cleaned. If your input is abbreviated, you should apply a more complex fuzzy string matching algorithm. This may be prohibitively slow, however. In that case, I would recommend a two-step method, saving all rows that have been cleaned by the approach above, and do a second pass for more complicated cleaning as needed.

What is the easiest way to split a string into a first name and last name?

The dataset has 14k rows and has many titles, etc.
I am a beginner in Pandas and Python and I'd like to know how to proceed with getting the output of first name and last name from this dataset.
Dataset:
0 Pr.Doz.Dr. Klaus Semmler Facharzt für Frauenhe...
1 Dr. univ. (Budapest) Dalia Lax
2 Dr. med. Jovan Stojilkovic
3 Dr. med. Dirk Schneider
4 Marc Scheuermann
14083 Bag Kinderarztpraxis
14084 Herr Ulrich Bromig
14085 Sohn Heinrich
14086 Herr Dr. sc. med. Amadeus Hartwig
14087 Jasmin Rieche

for name in dataset:
first = name.split()[-2]
last = name.split()[-1]
# save here
This will work for most names, not all. For repeatability you may need a list of titles such as (dr., md., univ.) to skip over

As it doesn't contain any structure, you're out of luck. An ad-hoc solution could be to just write down a list of all locations/titles/conjunctions and other noise you've identified and then strip those from the rows. Then, if you notice some other things you'd like to exclude, just add them to your list.
This will not solve the issue of certain rows having their name in reverse order. So it'll require you to manually go over everything and check if the row is valid, but it might be quicker than editing each row by hand.
A simple, brute-force example would be:
excludes = {'dr.', 'herr', 'budapest', 'med.', 'für', ... }
new_entries = []
for title in all_entries:
cleaned_result = []
parts = title.split(' ')
for part in parts:
if part.lowercase() not in excludes:
cleaned_result.append(part)
new_entries.append(' '.join(cleaned_result))

How to split compound word in pandas?

I have and document that consist of many compounds (or sometimes combined) word as:
document.csv
index text
0 my first java code was helloworld
1 my cardoor is totally broken
2 I will buy a screwdriver to fix my bike
As seen above some words are combined or compound and I am using compound word splitter from here to fix this issue, however, I have trouble to apply it in each row of my document (like pandas series) and convert the document into a clean form of:
cleanDocument.csv
index text
0 my first java code was hello world
1 my car door is totally broken
2 I will buy a screw driver to fix my bike
(I am aware of word such as screwdriver should be together, but my goal is cleaning the document). If you have a better idea for splitting only combined words, please let me know.
splitter code may works as:
import pandas as pd
import splitter ## This use enchant dict (pip install enchant requires)
data = pd.read_csv('document.csv.csv')
then, it should use:
splitter.split(data) ## ???
I already looked into something like this but this not work in my case. thanks

You use apply wit axis =1 : Can you try the following
data.apply(lambda x: splitter.split(j) for j in (x.split()), axis = 1)
I do not have splitter installed on my system. By looking at the link you have provided, I have this following code. Can you try:
def handle_list(m):
ret_lst = []
L = m['text'].split()
for wrd in L:
g = splitter.split(wrd)
if g :
ret_lst.extend(g)
else:
ret_lst.append(wrd)
return ret_lst
dft.apply(handle_list, axis = 1)

How to read starting N words from each rows in python3

I am reading excel which has free text in a column.Now after reading that file from pandas, I want to restrict the column having text to read just N words from starting for each rows. I tried everything but was not able to make it.
data["text"] = I am going to school and I bought something from market.
But I just want to read staring 5 words. so that it could look like below.
data["text"] = I am going to school.
and I want this same operation to be done bow each row for data["text"] column.
You help will be highly appreciated.

def first_k(s: str, k=5) -> str:
s = str(s) # just in case something like NaN tries to sneak in there
first_words = s.split()[:k]
return ' '.join(first_words)
Then, apply the function:
data['text'] = data['text'].apply(first_k)

data["text"] = [' '.join(s.split(' ')[:5]) for s in data["text"].values]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to add varying prefixes to strings in pandas column - python-3.x

Related

Remove non meaningful characters in pandas dataframe

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

What is the easiest way to split a string into a first name and last name?

How to split compound word in pandas?

How to read starting N words from each rows in python3

Categories

Resources