turn three columns into dictionary python - python-3.x

Name = [list(['Amy', 'A', 'Angu']),
list(['Jon', 'Johnson']),
list(['Bob', 'Barker'])]
Other = [list(['Amy', 'Any', 'Anguish']),
list(['Jon', 'Jan']),
list(['Baker', 'barker'])]
import pandas as pd
df = pd.DataFrame({'Other' : Other,
'ID': ['E123','E456','E789'],
'Other_ID': ['A123','A456','A789'],
'Name' : Name,
})
ID Name Other Other_ID
0 E123 [Amy, A, Angu] [Amy, Any, Anguish] A123
1 E456 [Jon, Johnson] [Jon, Jan] A456
2 E789 [Bob, Barker] [Baker, barker] A789
I have the df as seen above. I want to make columns ID, Name and Other into a dictionary with they key being ID. I tried this according to python pandas dataframe columns convert to dict key and value
todict = dict(zip(df.ID, df.Name))
Which is close to what I want
{'E123': ['Amy', 'A', 'Angu'],
'E456': ['Jon', 'Johnson'],
'E789': ['Bob', 'Barker']}
But I would like to get this output that includes values from Other column
{'E123': ['Amy', 'A', 'Angu','Amy', 'Any','Anguish'],
'E456': ['Jon', 'Johnson','Jon','Jan'],
'E789': ['Bob', 'Barker','Baker','barker']
}
And If I put the third column Other it gives me errors
todict = dict(zip(df.ID, df.Name, df.Other))
How do I get the output I want?

Why not just combine the Name and Other column before creating a dict of the Name column.
df['Name'] = df['Name'] + df['Other']
dict(zip(df.ID, df.Name))
Gives
{'E123': ['Amy', 'A', 'Angu', 'Amy', 'Any', 'Anguish'],
'E456': ['Jon', 'Johnson', 'Jon', 'Jan'],
'E789': ['Bob', 'Barker', 'Baker', 'barker']}

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Change a dataframe column value based on the current value?

I have a pandas dataframe with several columns and in one of them, there are string values. I need to change these strings to an acceptable value based on the current value. The dataframe is relatively large (40.000 x 32)
I've made a small function that takes the string to be changed as a parameter and then lookup what this should be changed to.
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','Monday','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
def lut(txt):
my_lut = {'Script' : ['Script','Scrpt','MyScript'],
'Weekday' : ['Sunday','Monday','Tuesday']}
for key, value in my_lut.items():
if txt in value:
return(key)
break
return('Unknown')
The desired output should be:
A B
0 Script Song
1 Script Blues
2 Script Rock
3 Weekday Classic
4 Weekday Whatever
5 Unknown Something
I can't figure out how to apply this to the dataframe.
I've struggled over this for some time now so any input will be appreciated
Regards,
Check this out:
import pandas as pd
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','sdfsd','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
dic = {'Weekday': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], 'Script': ['Script','Scrpt','MyScript']}
for k, v in dic.items():
for item in v:
df.loc[df.A == item, 'A'] = k
df.loc[~df.A.isin(k for k, v in dic.items()), 'A'] = "Unknown"
Output:

filter dataframe columns as you iterate through rows and create dictionary

I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

xlsxwriter - Conditional formatting based on column name of the dataframe

I have a dataframe as below. I want to apply conditional formatting on column "Data2" using the column name. I know how to define format for a specific column but I am not sure how to define it based on column name as shown below.
So basically I want to do the same formatting on column name(because the order of column might change)
df1 = pd.DataFrame({'Data1': [10, 20, 30],
'Data2': ["a", "b", "c"]})
writer = pd.ExcelWriter('pandas_filter.xlsx', engine='xlsxwriter', )
workbook = writer.book
df1.to_excel(writer, sheet_name='Sheet1', index=False)
worksheet = writer.sheets['Sheet1']
blue = workbook.add_format({'bg_color':'#000080', 'font_color': 'white'})
red = workbook.add_format({'bg_color':'#E52935', 'font_color': 'white'})
l = ['B2:B500']
for columns in l:
worksheet.conditional_format(columns, {'type': 'text',
'criteria': 'containing',
'value': 'a',
'format': blue})
worksheet.conditional_format(columns, {'type': 'text',
'criteria': 'containing',
'value': 'b',
'format': red})
writer.save()
using xlsxwriter with xl_col_to_name we can get the column name using the index.
from xlsxwriter.utility import xl_col_to_name
target_col = xl_col_to_name(df1.columns.get_loc("Data2"))
l = [f'{target_col}2:{target_col}500']
for columns in l:
using opnpyxl with get_column_letter we can get the column name using the index.
from openpyxl.utils import get_column_letter
target_col = get_column_letter(df1.columns.get_loc("Data2") + 1) # add 1 because get_column_letter index start from 1
l = [f'{target_col}2:{target_col}500']
for columns in l:
...

Resources