Row count based on second column in RDD?

Row count based on second column in RDD? - apache-spark

Here is an example of RDD
joinRDD = (productdict.join(result))
# [('B000002KXA', ('Music', ['AVM91SKZ9M58T', "This remix and single version of Madonna's song Resume Me [Immaculate Collection Album] is one of Madonna's best. Not only does it show the true ability of Madonna's vocal ability but the power this song brings to your heart. Madonna's voice in this single is unlike any other song she has sang, beautifully put together and mastered. A song everyone remembers from Madonna's long list of credits. This CD is one not to miss either you love Madonna or love-to-hate her, you must have it, a collection item!!! END", '5.0', 97]))]
Due to the (), I'm not sure on how to do a count based on second column ('Music').
Doing the following does not work:
joinRDD2 = joinRDD.mapValues(lambda x: (x[1], 1)).reduceByKey(add)

The thing is, your second "column" is not ('Music'), it's actually...
('Music', ['AVM91SKZ9M58T', "This remix ...", '5.0', 97])
x[1] for you returns this whole column, while it seems you need only 'Music' which is the first element in that column. So, instead of x[1] you should use x[1][0].
Example RDD:
from operator import add
joinRDD = sc.parallelize([
('B000002KXA', ('Music', ['AVM91SKZ9M58T', "This remix", 4])),
('B000002KXA', ('Music', ['AVM91SKZ9M58T', "This remix", 4])),
('B000002KXA', ('Drama', ['AVM91SKZ9M58T', "This remix", 4])),
])
Test:
joinRDD2 = joinRDD.map(lambda x: (x[1][0], 1)).reduceByKey(add)
joinRDD2.collect()
# [('Drama', 1), ('Music', 2)]

Related

How to add varying prefixes to strings in pandas column

I have a dataframe with 2 columns containing audio filenames and corresponding texts, looking like this:
data = {'Audio_Filename': ['3e2bd3d1-b9fc-095728a4d05b',
'8248bf61-a66d-81f33aa7212d',
'81051730-8a18-6bf476d919a4'],
'Text': ['On a trip to America, he saw people filling his noodles into paper cups.',
'When the young officers were told they were going to the front,',
'Yeah, unbelievable, I had not even thought of that.']}
df = pd.DataFrame(data, columns = ['Audio_Filename', 'Text'])
Now I want to add a string prefix (the speaker ID: sp1, sp2, sp3) with an underscore _ to all audio filename strings according to this pattern:
sp2_3e2bd3d1-b9fc-095728a4d05b.
My difficulty: The prefix/speaker ID is not fixed but varies depending on the audio filenames. Because of this, I have zipped the audio filenames and the speaker IDs and iterated over those and the audio filename rows via for-loops. This is my code:
zipped = list(zip(audio_filenames, speaker_ids))
for audio, speaker_id in zipped:
for index, row in df.iterrows():
audio_row = row['Audio_Filename']
if audio == audio_row:
df['Audio_Filename'] = f'{speaker_id}_' + audio_row
df.to_csv('/home/user/file.csv')
I also tried apply with lambda after the if statement:
df['Audio_Filename'] = df['Audio_Filename'].apply(lambda x: '{}_{}'.format(speaker_id, audio_row))
But nothing works so far.
Can anyone please give me a hint on how to do this?
The resulting dataframe should look like this:
Audio_Filename Text
sp2_3e2bd3d1-b9fc-095728a4d05b On a trip to America, he saw people filling hi...
sp1_8248bf61-a66d-81f33aa7212d When the young officers were told they were go...
sp3_81051730-8a18-6bf476d919a4 Yeah, unbelievable, I had not even thought of ...
(Of course, I have much more audio filenames and corresponding texts in the dataframe).
I appreciate any help, thank you!

If you have audio_filenames and speaker_ids list, you can use Series.map function. For example:
audio_filenames = [
"3e2bd3d1-b9fc-095728a4d05b",
"8248bf61-a66d-81f33aa7212d",
"81051730-8a18-6bf476d919a4",
]
speaker_ids = ["sp2", "sp1", "sp3"]
mapper = {k: "{}_{}".format(v, k) for k, v in zip(audio_filenames, speaker_ids)}
df["Audio_Filename"] = df["Audio_Filename"].map(mapper)
print(df)
Prints:
Audio_Filename Text
0 sp2_3e2bd3d1-b9fc-095728a4d05b On a trip to America, he saw people filling his noodles into paper cups.
1 sp1_8248bf61-a66d-81f33aa7212d When the young officers were told they were going to the front,
2 sp3_81051730-8a18-6bf476d919a4 Yeah, unbelievable, I had not even thought of that.

I wrote code using pandas with python. I want to convert the code into a new dataframe with the output seperated into two columns

I went from one data frame to another and performed calcs on the column next to name for each unique person. Now I have a output of the names and calcs next to it and I want to break it into two columns and put it in a data frame and print. I'm thinking I should put the entire for loop into a dictionary then a data frame, but not to sure of how to do that. I am a beginner at this and would really appreciate peoples help. See code from the for loop piece below:
names = df['Participant Name, Number'].unique()
for name in names:
unique_name_df = df[df['Participant Name, Number'] == name]
badge_types = unique_name_df['Dosimeter Location'].unique()
if 'Collar' in badge_types:
collar = unique_name_df[unique_name_df['Dosimeter Location'] == 'Collar']['Total DDE'].astype(float).sum()
if 'Chest' in badge_types:
chest = unique_name_df[unique_name_df['Dosimeter Location'] == 'Chest']['Total DDE'].astype(float).sum()
if len(badge_types) == 1:
if 'Collar' in badge_types:
value = collar
elif 'Chest' in badge_types:
value = chest
print(name, value)

If you expect len(badge_types)==1 in all the cases, try:
pd.DataFrame( df.groupby(['Participant Name, Number']).Total_DDE.sum() )
Otherwise, to get the sum per Dosimeter Location, add it on the groupby as
pd.DataFrame( df.groupby(['Participant Name, Number', 'Dosimeter Location']).Total_DDE.sum() )

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

I have a problem similar to this question.
I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.
for example
import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
'xxx stuff coll tides 20 oz 09980-66-832',
'hel world sambucus elder 60 chw 0392-67-491',
'northern cold ultimate 180 sg 06580-66-832',
'ancient nuts boogs 16oz 58532-42-123 '],
'Brand': ['FUBAR OF BAZ',
'XXX STUFF',
'HELLO WORLD',
'NORTHERN COLDNITES',
'ANCIENT NUTS']})
I want to remove the brand name from the Item_UPC column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.
def clean_upc(df):
#take in a dataframe, expand the number of columns into a temp
#dataframe
temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
#add columns to main dataframe from Temp
df.insert(0, "UPC", temp[1])
df.insert(1, "Item", temp[0])
#drop original combined column
df.drop(columns= ["Item_UPC"], inplace=True)
#remove leading zero on and hyphens in UPC.
df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
col_names = df.columns
#make all columns lower case to ease searching
for cols in col_names:
df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)
after running this I have a data frame with three columns
UPC, Item, Brand
The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code
temp = df["Item"].str.rsplit(" ", expand = True)
temp has a shape of
temp.shape
(329868, 13)
which makes manual curating a pain when most of columns 9-13 are empty.
Currently my logic is to first split brand in to 2 while dropping the first column in temp
brand = df["brand"].str.rsplit(" ", n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)
and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " " or vice versa, and then concatenate temp back together (
temp["combined"] = temp[1] + temp[2]....+temp[13]
and replace the existing Item column with the combined column
df["Item"] = temp["combined"]
or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.

Because the input does not follow any well-defined rules, this looks like more of an optimization problem. You can start by stripping exact matches:
df["Item_cleaned"] = df.apply(lambda x: x.Item_UPC.lstrip(x.Brand.lower()), axis=1)
output:
Item_UPC Brand Item_cleaned
0 fubar baz dr frm prob onc dly wmn ogc 30vcp 06... FUBAR OF BAZ dr frm prob onc dly wmn ogc 30vcp 06580-66-832
1 xxx stuff coll tides 20 oz 09980-66-832 XXX STUFF coll tides 20 oz 09980-66-832
2 hel world sambucus elder 60 chw 0392-67-491 HELLO WORLD sambucus elder 60 chw 0392-67-491
3 northern cold ultimate 180 sg 06580-66-832 NORTHERN COLDNITES ultimate 180 sg 06580-66-832
4 ancient nuts boogs 16oz 58532-42-123 ANCIENT NUTS boogs 16oz 58532-42-123
This method should will strip any exact matches and output to a new column Item_cleaned. If your input is abbreviated, you should apply a more complex fuzzy string matching algorithm. This may be prohibitively slow, however. In that case, I would recommend a two-step method, saving all rows that have been cleaned by the approach above, and do a second pass for more complicated cleaning as needed.

how do i include a string or an 'and' inside a list?

So in my code i want to do this structure
Mix the , , , and together.
I have this code
import random
ingredient = ['flour', 'baking powder', 'butter', 'milk', 'eggs', 'vanilla', 'sugar']
def create_recipe(main_ingredient, baking, measure, ingredient):
"""Create a random recipe and print it."""
m_ingredient = random.choice(main_ingredient) #random choice from the main ingredient
baking_ = random.choice(baking) #random choice from the baking list
ingredients = random.choice(ingredient) #random choice from the ingredient list
print("***", m_ingredient.title(), baking_.title() , "Recipe ***")
print("Ingredients:")
print(random.randint(1,3), random.choice(measure), m_ingredient)
for i in range(len(ingredients)): #get the random ingredients
print(random.randint(1,3), random.choice(measure), ingredient[i])
print ("Method:")
selected=[] # this is where the ingredients from the recipe
selected_items = "Mix the"
for i in range(len(ingredients)):
ingredients= str(random.choice(ingredient))
selected.append(ingredients)
selected.insert(-2,'and')
print(selected_items,",".join(selected), "together.")
I had this as an output for it.
Mix the eggs,baking powder,sugar,and,flour,sugar together.
How do i add the put the 'and' before the last item in the list and make it just like the structure that i wanted?

Put all your ingredients into a list and slice it according to your choosen string formatting:
things = ['flour', 'baking powder', 'butter', 'milk', 'eggs', 'vanilla', 'sugar']
# join all but the last element using ", " as seperator, then print the last element
# after the "and"
print(f"You need {', '.join(things[:-1])} and {things[-1]}")
# or
print("You need", ', '.join(things[:-1]), "and", things[-1])
Output:
You need flour, baking powder, butter, milk, eggs, vanilla and sugar
Further resources:
Understanding slice notation
Is there a Python equivalent to Ruby's string interpolation?
Joining pairs of elements of a list - Python (related,maybe better use

Fastest way to replace substrings with dictionary (On large dataset)

I have 10M texts (fits in RAM) and a python dictionary of a kind:
"old substring":"new substring"
The size of a dictionary is ~15k substrings.
I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").
The source texts are in pandas dataframe.
For now i have tried these approaches:
1) Replace in a loop with reduce and str replace (~120 rows/sec)
replaced = []
for row in df.itertuples():
replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))
2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):
def string_replace(text):
for key in mapping:
text = text.replace(key, mapping[key])
return text
replaced = []
for row in tqdm(df.itertuples()):
replaced.append(string_replace(row[1]))
Also .iterrows() works 20% slower than .itertuples()
3) Using apply on Series (also ~160 rows/sec):
replaced = df['text'].apply(string_replace)
With these speed it take hours to process the whole dataset.
Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.
Thanks.
UPDATED:
Toy data to check the idea:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
result expected:
old replaced
0 first text to replace FT to rep
1 second text to replace 2nd text to rep

Ive overcome this again and found a fantastic library called flashtext.
Speedup on 10M records with 15k vocabulary is about x100 (really one hundred times faster than regexp or other approaches from my first post)!
Very easy to use:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
import flashtext
processor = flashtext.KeywordProcessor()
for k, v in mapping.items():
processor.add_keyword(k, v)
print(list(map(processor.replace_keywords, df["old"])))
Result:
['FT to rep', '2nd text to rep']
Also flexible adaptation to different languages if needed, using processor.non_word_boundaries attribute.
Trie-based search used in here gives amazing speedup.

One solution would have been to convert the dictionary to a trie and write the code so that you only pass once through the modified text.
Basically, you advance through the text and the trie one character at a time, and as soon a match is found, you replace it.
Of course, if you need to apply the replacements also to already replaced text, this is harder.

I think you are looking for replace with regex on df i.e
If you hava dictionary then pass it as a parameter.
d = {'old substring':'new substring','anohter':'another'}
For entire dataframe
df.replace(d,regex=True)
For series
df[columns].replace(d,regex=True)
Example
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
df['replaced'] = df['old'].replace(mapping,regex=True)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Row count based on second column in RDD? - apache-spark

Related

How to add varying prefixes to strings in pandas column

I wrote code using pandas with python. I want to convert the code into a new dataframe with the output seperated into two columns

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

how do i include a string or an 'and' inside a list?

Fastest way to replace substrings with dictionary (On large dataset)

Categories

Resources