Populating a pandas dataframe from an odd dictionary - python-3.x

I have a dictionary as follows:
{'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
I would like to come up with a dataframe like this:
+----+----------+--------+
| ID | header | body |
+----+----------+--------+
| 1 | header_1 | body_1 |
+----+----------+--------+
| 2 | header_1 | body_3 |
+----+----------+--------+
| 3 | header_1 | body_2 |
+----+----------+--------+
| 4 | header_2 | body_6 |
+----+----------+--------+
| 5 | header_2 | body_4 |
+----+----------+--------+
| 6 | header_2 | body_5 |
+----+----------+--------+
| 7 | header_4 | body_7 |
+----+----------+--------+
Where blank items (such as for the key header_10 in the dict above) would receive a value of None. I have tried a number of varieties for df.loc such as:
for header_name, body_list in all_unique.items():
for body_name in body_list:
metadata.loc[metadata.index[-1]] = [header_name, body_name]
To no avail. Surely there must be a quick way in Pandas to append rows and autoincrement the index? Something similar to the SQL INSERT INTO statement only using pythonic code?

Use dict comprehension for add Nones for empty lists and then flatten for list of tuples:
d = {'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
d = {k: v if bool(v) else [None] for k, v in d.items()}
data = [(k, y) for k, v in d.items() for y in v]
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
Another solution:
data = []
for k, v in d.items():
if bool(v):
for y in v:
data.append((k, y))
else:
data.append((k, None))
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None

If the dataset is too big, this solution would be slow, but it should still work.
for key in data.keys():
vals= data[key]
# Create temp df with data from a single key
t_df = pd.DataFrame({'header':[key]*len(vals),'body':vals})
# Append it to your full dataframe.
df = df.append(t_df)

This is another unnesting problem again
Borrow Jez's setting up for your d
d = {k: v if bool(v) else [None] for k, v in d.items()}
1st convert your dict into dataframe
df=pd.Series(d).reset_index()
df.columns
Out[204]: Index(['index', 0], dtype='object')
Then using this function in here
yourdf=unnesting(df,[0])
yourdf
Out[208]:
0 index
0 body_1 header_1
0 body_3 header_1
0 body_2 header_1
1 body_6 header_2
1 body_4 header_2
1 body_5 header_2
2 body_7 header_4
2 body_8 header_4
3 body_9 header_3
4 body_10 header_9
5 None header_10
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`
Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data
You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

Pandas find max column, subtract from another column and replace the value

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']
Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

I have pandas dataframe with 3 columns and want output like this

DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7
Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7
You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |

Filtering columns in a pandas dataframe

I have a dataframe with the following column.
A
55B
<lhggkkk>
66c
dggfhhjjjj
I need to filter the records which start with number(such as 55B and 66C) separately and the others separately. Can anyone please help?
Try:
import pandas as pd
df = pd.DataFrame()
df['A'] = ['55B','<lhggkkk>','66c','dggfhhjjjj']
df['B'] = df['A'].apply(lambda x:x[0].isdigit())
print(df)
A B
0 55B True
1 <lhggkkk> False
2 66c True
3 dggfhhjjjj False
Try to check if the first number is digit then boolen index i.e
mask = df['A'].str[0].str.isdigit()
one = df[mask]
two = df[~mask]
print(one,'\n',two)
A
0 55B
2 66c
A
1 <lhggkkk>
3 dggfhhjjjj
To check first string is digit or not:
df['A'].str[0].str.isdigit()
So:
import pandas as pd
import numpy as np
df:
-----------------
| A
-----------------
0 | 55B
1 | <lhggkkk>
2 | 66c
3 | dggfhhjjjj
df['Result'] = np.where(df['A'].str[0].str.isdigit(), 'Numbers', 'Others')
df:
----------------------------
| A | Result
----------------------------
0 | 55B | Numbers
1 | <lhggkkk> | Others
2 | 66c | Numbers
3 | dggfhhjjjj | Others

How to compare all values in a dictionary

I have a dictionary which has names as keys and numbers as values. I want to make a list with the values of the dictionary that are closer to each other. All values represent a cell in an imaginary 5x5 grid. So I want to check which 2 values are closer to each other in the grid.
Ex.
my_dict = {Mark:2, Luke:6, Ferdinand:10, Martin:20, Marvin: 22}
I would want to get the values Martin and Marvin because its values are closer to each other.
This will work for any size dictionary and get you the pair with the smallest value. It uses itertools to go through all combinations.
from itertools import combinations
my_dict = {'Mark': 2, 'Luke': 6, 'Ferdinand': 7, 'Martin': 20, 'Marvin': 22}
for value in combinations(my_dict.items(), 2):
current_diff = abs(value[0][1] - value[1][1])
pair_of_interest = (value[0][0], value[1][0])
try:
if current_diff < difference:
difference = current_diff
pair = pair_of_interest
except NameError:
difference = current_diff
pair = pair_of_interest
print("{0} and {1} have the smallest distance of {2}".format(pair[0], pair[1], difference))
I assume values in dictionary is like this.
+----+----+----+----+-------->x
| 1 | 2 | 3 | 4 | 5 |
+----+----+----+----+----+
| 6 | 7 | 8 | 9 | 10 |
+----+----+----+----+----+
| 11 | 12 | 13 | 14 | 15 |
+----+----+----+----+----+
| 16 | 17 | 18 | 19 | 20 |
+----+----+----+----+----+
| 21 | 22 | 23 | 24 | 25 |
+----+----+----+----+----+
|
v
y
ie:
1 => (y,x)=(0,0)
2 => (y,x)=(0,1)
...
24 => (y,x)=(4,3)
25 => (y,x)=(4,4)
source code:
import itertools
my_dict = {'Mark':2, 'Luke':6, 'Ferdinand':10, 'Martin':20, 'Marvin': 22}
val2vec = lambda v: (v/5, v%5)
name2vec = lambda name: val2vec(my_dict[name])
vec2dis2 = lambda vec1, vec2: (vec2[0] - vec1[0])**2 + (vec2[1] - vec1[1])**2 #Use 'math.sqrt' if you want.
for dis2, grp in sorted((vec2dis2(name2vec(name1), name2vec(name2)), (name1, name2)) for name1, name2 in itertools.combinations(my_dict.iterkeys(), 2)):
print str(grp).ljust(30), "distance^2 =", dis2
output:
('Luke', 'Ferdinand') distance^2 = 2
('Luke', 'Mark') distance^2 = 2
('Ferdinand', 'Martin') distance^2 = 4
('Martin', 'Marvin') distance^2 = 4
('Ferdinand', 'Mark') distance^2 = 8
('Ferdinand', 'Marvin') distance^2 = 8
('Luke', 'Martin') distance^2 = 10
('Luke', 'Marvin') distance^2 = 10
('Marvin', 'Mark') distance^2 = 16
('Martin', 'Mark') distance^2 = 20

Resources