I have a dictionary which has names as keys and numbers as values. I want to make a list with the values of the dictionary that are closer to each other. All values represent a cell in an imaginary 5x5 grid. So I want to check which 2 values are closer to each other in the grid.
Ex.
my_dict = {Mark:2, Luke:6, Ferdinand:10, Martin:20, Marvin: 22}
I would want to get the values Martin and Marvin because its values are closer to each other.
This will work for any size dictionary and get you the pair with the smallest value. It uses itertools to go through all combinations.
from itertools import combinations
my_dict = {'Mark': 2, 'Luke': 6, 'Ferdinand': 7, 'Martin': 20, 'Marvin': 22}
for value in combinations(my_dict.items(), 2):
current_diff = abs(value[0][1] - value[1][1])
pair_of_interest = (value[0][0], value[1][0])
try:
if current_diff < difference:
difference = current_diff
pair = pair_of_interest
except NameError:
difference = current_diff
pair = pair_of_interest
print("{0} and {1} have the smallest distance of {2}".format(pair[0], pair[1], difference))
I assume values in dictionary is like this.
+----+----+----+----+-------->x
| 1 | 2 | 3 | 4 | 5 |
+----+----+----+----+----+
| 6 | 7 | 8 | 9 | 10 |
+----+----+----+----+----+
| 11 | 12 | 13 | 14 | 15 |
+----+----+----+----+----+
| 16 | 17 | 18 | 19 | 20 |
+----+----+----+----+----+
| 21 | 22 | 23 | 24 | 25 |
+----+----+----+----+----+
|
v
y
ie:
1 => (y,x)=(0,0)
2 => (y,x)=(0,1)
...
24 => (y,x)=(4,3)
25 => (y,x)=(4,4)
source code:
import itertools
my_dict = {'Mark':2, 'Luke':6, 'Ferdinand':10, 'Martin':20, 'Marvin': 22}
val2vec = lambda v: (v/5, v%5)
name2vec = lambda name: val2vec(my_dict[name])
vec2dis2 = lambda vec1, vec2: (vec2[0] - vec1[0])**2 + (vec2[1] - vec1[1])**2 #Use 'math.sqrt' if you want.
for dis2, grp in sorted((vec2dis2(name2vec(name1), name2vec(name2)), (name1, name2)) for name1, name2 in itertools.combinations(my_dict.iterkeys(), 2)):
print str(grp).ljust(30), "distance^2 =", dis2
output:
('Luke', 'Ferdinand') distance^2 = 2
('Luke', 'Mark') distance^2 = 2
('Ferdinand', 'Martin') distance^2 = 4
('Martin', 'Marvin') distance^2 = 4
('Ferdinand', 'Mark') distance^2 = 8
('Ferdinand', 'Marvin') distance^2 = 8
('Luke', 'Martin') distance^2 = 10
('Luke', 'Marvin') distance^2 = 10
('Marvin', 'Mark') distance^2 = 16
('Martin', 'Mark') distance^2 = 20
Related
I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']
Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7
DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7
Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7
You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
I convert my dataframe values to str, but when I concatenate them together the previous ints are including trailing decimals.
df["newcol"] = df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
This is giving me output like
500.0 how can I get rid of this leading/trailing decimal? sometimes my data in column a will have non alpha numeric characters.
+---------+---------+---------+------------------+----------------------+
| columna | columnb | columnc | expected | currently getting |
+---------+---------+---------+------------------+----------------------+
| | -1 | 27 | _-1_27 | _-1.0_27.0 |
| | -1 | 42 | _-1_42 | _-1.0_42.0 |
| | -1 | 67 | _-1_67 | _-1.0_67.0 |
| | -1 | 95 | _-1_95 | _-1.0_95.0 |
| 91_CCMS | 14638 | 91 | 91_CCMS_14638_91 | 91_CCMS_14638.0_91.0 |
| DIP96 | 1502 | 96 | DIP96_1502_96 | DIP96_1502.0_96.0 |
| 106 | 11694 | 106 | 106_11694_106 | 00106_11694.0_106.0 |
+---------+---------+---------+------------------+----------------------+
Error:
invalid literal for int() with base 10: ''
Edit:
If your df has more than 3 columns, and you want to join only 3 columns, you may specify those columns in the command using columns slicing. Assume your df has 5 columns named as : AA, BB, CC, DD, EE. You want only joining columns CC, DD, EE. You just need to specify those 3 columns before the fillna, and assign the result to newcol as you want:
df["newcol"] = df[['CC', 'DD', 'EE']].fillna('') \
.applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Note: I just break command into 2 lines using '\' for easy reading.
Original:
I guess your real data of columna columnb columnc contain str, float, int, empty space, blank space, and maybe even NaN.
Float with decimal values = .00 in a column dtype object will show without decimal.
Assume your df has only 3 columns: colmna, columnb, columnc as you said. Using command below will handle: str, float, int, NaN and joining 3 columns into one as you want:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
I created a sample similar as yours
columna columnb columnc
0 -1 27
1 NaN -1 42
2 -1 67
3 -1 95
4 91_CCMS 14638 91
5 DIP96 96
6 106 11694 106
Using your command returns the concatenated string having '.0' as you described
df['columna'].map(str) + '_' + df['columnb'].map(str) + '_' + df['columnc'].map(str)
Out[1926]:
0 _-1.0_27.0
1 nan_-1.0_42.0
2 _-1.0_67.0
3 _-1.0_95.0
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object
Using my command:
df.fillna('').applymap(lambda x: x if isinstance(x, str) else str(int(x))).agg('_'.join, axis=1)
Out[1927]:
0 _-1_27
1 _-1_42
2 _-1_67
3 _-1_95
4 91_CCMS_14638_91
5 DIP96__96
6 106_11694_106
dtype: object
I couldn't reproduce this error but maybe you could try something like:
df["newcol"] = df['columna'].map(lambda x: str(int(x)) if isinstance(x, int) else str(x)) + '_' + df['columnb'].map(lambda x: str(int(x))) + '_' + df['columnc'].map(lambda x: str(int(x)))
I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0
I have a dictionary as follows:
{'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
I would like to come up with a dataframe like this:
+----+----------+--------+
| ID | header | body |
+----+----------+--------+
| 1 | header_1 | body_1 |
+----+----------+--------+
| 2 | header_1 | body_3 |
+----+----------+--------+
| 3 | header_1 | body_2 |
+----+----------+--------+
| 4 | header_2 | body_6 |
+----+----------+--------+
| 5 | header_2 | body_4 |
+----+----------+--------+
| 6 | header_2 | body_5 |
+----+----------+--------+
| 7 | header_4 | body_7 |
+----+----------+--------+
Where blank items (such as for the key header_10 in the dict above) would receive a value of None. I have tried a number of varieties for df.loc such as:
for header_name, body_list in all_unique.items():
for body_name in body_list:
metadata.loc[metadata.index[-1]] = [header_name, body_name]
To no avail. Surely there must be a quick way in Pandas to append rows and autoincrement the index? Something similar to the SQL INSERT INTO statement only using pythonic code?
Use dict comprehension for add Nones for empty lists and then flatten for list of tuples:
d = {'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
d = {k: v if bool(v) else [None] for k, v in d.items()}
data = [(k, y) for k, v in d.items() for y in v]
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
Another solution:
data = []
for k, v in d.items():
if bool(v):
for y in v:
data.append((k, y))
else:
data.append((k, None))
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
If the dataset is too big, this solution would be slow, but it should still work.
for key in data.keys():
vals= data[key]
# Create temp df with data from a single key
t_df = pd.DataFrame({'header':[key]*len(vals),'body':vals})
# Append it to your full dataframe.
df = df.append(t_df)
This is another unnesting problem again
Borrow Jez's setting up for your d
d = {k: v if bool(v) else [None] for k, v in d.items()}
1st convert your dict into dataframe
df=pd.Series(d).reset_index()
df.columns
Out[204]: Index(['index', 0], dtype='object')
Then using this function in here
yourdf=unnesting(df,[0])
yourdf
Out[208]:
0 index
0 body_1 header_1
0 body_3 header_1
0 body_2 header_1
1 body_6 header_2
1 body_4 header_2
1 body_5 header_2
2 body_7 header_4
2 body_8 header_4
3 body_9 header_3
4 body_10 header_9
5 None header_10
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')