How to split a column values in dataframe - python-3.x

I have a dataframe like this
PK Name Mobile questions
1 Jack 12345 [{'question':"how are you","Response":"Fine"},{"question":"whats your age","Response":"i am 19"}]
2 kim 102345 [{'question':"how are you","Response":"Not Fine"},{"question":"whats your age","Response":"i am 29"}]
3 jame 420
I want the output df to be like
PK Name Mobile Question 1 Response 1 Question 2 Response 2
1 Jack 12345 How are you Fine Whats your age i am 19
2 Kim 102345 How are you Not Fine Whats your age i am 29
3 jame 420

you can use explode to first create a row per element in each list. Then create a dataframe from this exploded series and keep indexes. assign a column to get incremental value per row per index group, then set_index and unstack to create the right shape. finally rename the columns and join to original df
# create a row per element in each list in each row
s = df['questions'].explode()
# create the dataframe and reshape
df_qr = pd.DataFrame(s.tolist(), index=s.index)\
.assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
.set_index('cc', append=True).unstack()
#flatten columns names
df_qr.columns = [f'{col[0]} {col[1]}' for col in df_qr.columns]
# join back to df
df_f = df.drop('questions', axis=1).join(df_qr, how='left')
print (df_f)
PK Name Mobile question 1 question 2 Response 1 Response 2
0 1 Jack 12345 how are you whats your age Fine i am 19
1 2 kim 102345 how are you whats your age Not Fine i am 29
Edit, if some rows are emppty strings instead of lis, then create s this way:
s = df.loc[df['questions'].apply(lambda x: isinstance(x, list)), 'questions'].explode()

Related

How to do fuzzy matching within in the same dataset with multiple columns

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.
data:
Name School Marks Location Rank
0 JACK TML 90 AU 3
1 JHON SSP 85 NULL NULL
2 NULL TML NULL AU 3
3 BECK NTC NULL EU 2
4 JHON SSP NULL JP 1
5 SEON NTC 80 RS 5
Expected Data Output:
data:
Name School Marks Location Rank Matched_Name Matched_Rank Score
0 JACK TML 90 AU 3 Jack 3 100
1 JHON SSP 85 JP 1 JHON 1 100
2 BECK NTC NULL EU 2 - - -
3 SEON NTC 80 RS 5 - - -
I how to do it with fuzzy logic ?
here is my code
ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)
# Columns to match on from df_left
left_on = ["Name", "Rank"]
# Columns to match on from df_right
right_on = ["Name", "Rank"]
# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
ds2,
left_on,
right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)
print(matched_results)
try:
print(matched_results.columns)
cols = matched_results.columns
except:
pass
print(matched_results.to_csv('matched_results.csv',index=False))
# Let's see the best matches
try:
matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
pass
Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:
Create a list of all school names using
df['school_name'].tolist()
Find null values in your data frame.
Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)
Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:
filtered = df[df['school_name'] == x]
and use it to replace values in the original data frame.

How to Melt data shape in custom format in python pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I have below database format stored in a pandas dataframe
ID Block
MGKfdkldr Product 1
MGKfdkldr Product 2
MGKfdkldr Product 3
GLOsdasd Product 2
GLOsdasd Product 3
NewNew Product 1
OldOld Product 4
OldOld Product 8
Here is the sample dataframe code
df1 = pd.DataFrame({'ID':['MGKfdkldr','MGKfdkldr','MGKfdkldr','GLOsdasd','GLOsdasd','NewNew','OldOld','OldOld'],'Block':['Product 1','Product 2','Product 3','Product 2','Product 3','Product 1','Product 4','Product 8']})
I am looking for below data format from this(Expected output):
ID Block-1 Block-2 Block-3
MGKfdkldr Product 1 Product 2 Product 3
GLOsdasd Product 2 Product 3
NewNew Product 1
OldOld Product 4 Product 8
I have tried to melt it with pd.melt function but it just transposing data to column header but I am looking for bit difference. Is there any other method through which I can get my Expected output?
Can anyone help me on this? Please
The function you're looking for is pivot not melt. You'll also need to provide a "counter" column that simply counts the repeated "ID"s to get everything to align properly.
df1["Block_id"] = df1.groupby("ID").cumcount() + 1
new_df = (df1.pivot("ID", "Block_id", "Block") # reshapes our data
.add_prefix("Block-") # adds "Block-" to our column names
.rename_axis(columns=None) # fixes funky column index name
.reset_index()) # inserts "ID" as a regular column instead of an Index
print(new_df)
ID Block-1 Block-2 Block-3
0 GLOsdasd Product 2 Product 3 NaN
1 MGKfdkldr Product 1 Product 2 Product 3
2 NewNew Product 1 NaN NaN
3 OldOld Product 4 Product 8 NaN
If you want actual blanks (e.g. empty string "") instead of a NaN, you can use new_df.fillna("")

I want to merge 4 rows to form 1 row with 4 sub-rows in pandas Dataframe

This is my dataframe
I have tried this but it didn't work:
df1['quarter'].str.contains('/^[-+](20)$/', re.IGNORECASE).groupby(df1['quarter'])
Thanks in advance
Hi and welcome to the forum! If I understood your question correctly, you want to form groups per year?
Of course, you can simply do a group by per year as you already have the column.
Assuming you didn't have the year column, you can simply group by the whole string except the last 2 characters of the quarter column. Like this (I created a toy dataset for the answer):
import pandas as pd
d = {'quarter' : pd.Series(['1947q1', '1947q2', '1947q3', '1947q4','1948q1']),
'some_value' : pd.Series([1,3,2,4,5])}
df = pd.DataFrame(d)
df
This is our toy dataframe:
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
4 1948q1 5
Now we simply group by the year, but we substract the last 2 characters:
grouped = df.groupby(df.quarter.str[:-2])
for name, group in grouped:
print(name)
print(group, '\n')
Output:
1947
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
1948
quarter some_value
4 1948q1 5
Additional comment: I used an operation that you can always apply to strings. Check this, for example:
s = 'Hi there, Dhruv!'
#Prints the first 2 characters of the string
print(s[:2])
#Output: "Hi"
#Prints everything after the third character
print(s[3:])
#Output: "there, Dhruv!"
#Prints the text between the 10th and the 15th character
print(s[10:15])
#Output: "Dhruv"

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

How to find the row of a dataframe from column value then update row with dictionary?

I have a dataframe of the form:
import pandas as pd
df = pd.DataFrame(None,columns= ['Name','Age','Asset'])
df = df.append({'Name':'John','Age':10,'Asset':'Bike'},ignore_index=True)
df = df.append({'Name':'Sarah','Age':17,'Asset':'Laptop'},ignore_index=True)
df = df.append({'Name':'Noah','Age':14,'Asset':'Book'},ignore_index=True)
df
Name Age Asset
0 John 10 Bike
1 Sarah 17 Laptop
2 Noah 14 Book
Now I want to take a dictionary {'Name' :'John','Age':11,'Asset' :'Phone'} and finds the row of the df with name John, change the age to 11 and Change the Asset to 'Phone'. Assume every column of the dataframe is a key in the dicionary.
since iloc retrieves a row, I thought this would work,
df.loc[df['Name'] == 'John'] = {'Name' :'John','Age':11,'Asset' :'Phone'}
However, this doesn't work and you need to update with a list.
What's the efficient way to update a dataframe row with a dictionary?
You can set the index to Name and then call df.update() to update the dataframe referring to the matching index. Finally .reset_index() to reset the index to columns:
d={'Name' :'John','Age':11,'Asset' :'Phone'}
d_=pd.DataFrame().from_dict(d,'index').T
m_=df.set_index('Name')
m_.update(d_)
df=m_.reset_index()
Name Age Asset
0 John 11 Phone
1 Sarah 17 Laptop
2 Noah 14 Book
I think you are trying to make the changes to more than just John? But for any dictionary?
Let's set the dictionary you provided as
di = {'Name' :'John','Age':11,'Asset' :'Phone'}
Then we can filter using .loc rows by 'Name', and select columns 'Age' and 'Asset', then set values from the dictionary.
df.loc[df['Name'] == di['Name'], ['Age', 'Asset']] = [di['Age'], di['Asset']]
print(df)
Name Age Asset
0 John 11 Phone
1 Sarah 17 Laptop
2 Noah 14 Book
You may assign dict.values() directly to the the slice you want modify by .loc as follows (Note: this works on Python 3.5+ because the order of d is guaranteed by the insertion order. On Python < 3.5, you need use collections.OrderedDict):
d = {'Name' :'John','Age':11,'Asset' :'Phone'}
df.loc[df.Name.eq(d['Name']), :] = list(d.values())
Out[643]:
Name Age Asset
0 John 11 Phone
1 Sarah 17 Laptop
2 Noah 14 Book

Resources