I have the following data
from io import StringIO
import pandas as pd
import collections
stg = """
target predictor value
10 predictor1 A
10 predictor1 C
10 predictor2 1
10 predictor2 2
10 predictor3 X
20 predictor1 A
20 predictor2 3
20 predictor3 Y
30 predictor1 B
30 predictor2 1
30 predictor3 X
40 predictor1 B
40 predictor2 2
40 predictor2 3
40 predictor3 X
40 predictor3 Y
50 predictor1 C
50 predictor2 3
50 predictor3 Y
60 predictor1 C
60 predictor2 4
60 predictor3 Z
"""
I've done this to get the list of predictors and values that have the same list of targets:
src = pd.read_csv(StringIO(stg), delim_whitespace=True, dtype=str)
grouped = src.groupby(["predictor","value"])['target'].apply(','.join).reset_index()
print(grouped)
predictor value target
0 predictor1 A 10,20
1 predictor1 B 30,40
2 predictor1 C 10,50,60
3 predictor2 1 10,30
4 predictor2 2 10,40
5 predictor2 3 20,40,50
6 predictor2 4 60
7 predictor3 X 10,30,40
8 predictor3 Y 20,40,50
9 predictor3 Z 60
From here I ultimately want to create a list of named tuples for each list of targets that represents the predictor and the value
Predicate = collections.namedtuple('Predicate',('predictor', 'value'))
EDIT:
To clarify, I want to create a list of Predicates so in a separate process, I can iterate them and construct query strings like so:
#target 10,20
data_frame.query('predictor1="A"')
#target 10,30
data_frame.query('predictor2="1"')
#target 10,30,40
data_frame.query('predictor3="X"')
#target 20,40,50
data_frame.query('predictor2="3" or predictor3="Y"')
I'd thought to try and use the target list and create a list of predictors and values like so
grouped_list = grouped.groupby('target').agg(lambda x: x.tolist())
print(grouped_list)
predictor value
target
10,20 [predictor1] [A]
10,30 [predictor2] [1]
10,30,40 [predictor3] [X]
10,40 [predictor2] [2]
10,50,60 [predictor1] [C]
20,40,50 [predictor2, predictor3] [3, Y]
30,40 [predictor1] [B]
60 [predictor2, predictor3] [4, Z]
This gives me 2 columns each containing a list. I can iterate these rows like so
for index, row in grouped_list.iterrows():
print("--------")
for pred in row["predictor"]:
print(pred)
But I can't see how to get from here to something like this (which does not work but hopefully illustrates what I mean):
for index, row in grouped_list.iterrows():
Predicates=[]
for pred, val in row["predicate","value"] :
Predicates.append(Predicate(pred, val))
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2563, in get_value
return libts.get_value_box(s, key)
File "pandas/_libs/tslib.pyx", line 1018, in pandas._libs.tslib.get_value_box
File "pandas/_libs/tslib.pyx", line 1026, in pandas._libs.tslib.get_value_box
TypeError: 'tuple' object cannot be interpreted as an integer
Any pointers would be greatly appreciated - I'm pretty new to python so figuring things out in a step by step fashion - there may be a far better way of achieving the above.
Cheers
David
I think you need list comprehension:
L = [Predicate(x.predictor, x.value) for x in grouped.itertuples()]
print (L)
[Predicate(predictor='predictor1', value='A'),
Predicate(predictor='predictor1', value='B'),
Predicate(predictor='predictor1', value='C'),
Predicate(predictor='predictor2', value='1'),
Predicate(predictor='predictor2', value='2'),
Predicate(predictor='predictor2', value='3'),
Predicate(predictor='predictor2', value='4'),
Predicate(predictor='predictor3', value='X'),
Predicate(predictor='predictor3', value='Y'),
Predicate(predictor='predictor3', value='Z')]
EDIT:
d = {k:[Predicate(x.predictor, x.value) for x in v.itertuples()]
for k,v in grouped.groupby('target')}
print (d)
{'10,30': [Predicate(predictor='predictor2', value='1')],
'30,40': [Predicate(predictor='predictor1', value='B')],
'20,40,50': [Predicate(predictor='predictor2', value='3'),
Predicate(predictor='predictor3', value='Y')],
'10,30,40': [Predicate(predictor='predictor3', value='X')],
'10,40': [Predicate(predictor='predictor2', value='2')],
'10,20': [Predicate(predictor='predictor1', value='A')],
'60': [Predicate(predictor='predictor2', value='4'),
Predicate(predictor='predictor3', value='Z')],
'10,50,60': [Predicate(predictor='predictor1', value='C')]}
Related
Would you help me, please, to calculate all the variations of 10 factors (each has 15 values) groupped by 3.
We have 10 factors.
Each factor has 15 values. E.g. 1,2,3,4,5,6...15
All the possible combinations of the first tripple of the factors (e.g. factor1, factor2, factor3) are:
15 (factor1 combination values) x 15 (factor2 combination values) x 15 (factor3 combination values) = 3 375
This should be calculated for all the possible tripplets among 10 factors :
3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 = 59 049 possible combinations of 3 factors
(except duplicates like factor1, factor1, factor2)
As a result we have 59 049 combinations of 3 factors x 3 375 combinations of its values = 199 mln records
Desirable output:
1st place 2nd place 3rd place 1st place value 2nd place value 3rd place value
factor1 factor2 factor3 1 1 1
factor1 factor2 factor3 1 1 2
factor1 factor2 factor3 1 1 3
… … … … … …
factor8 factor9 factor10 15 15 15
Thank you for every prompt how to meet the goal.
Key to your question: Number of combinations "except duplicates" is simply a binomial coefficient, and the instances can be generated by itertools.product() or pandas.MultiIndex.from_product() (this anwer also).
Therefore, the exact number of (factor1, factor2, factor3) is binom(10, 3) =120 instead of 3**10=59,049. The total number of rows is thus 120*3375=405,000.
Solution:
I parameterized all the numbers just to make the mathematical logic clear. In addition, this solution can be applied to varying number of values by recalculating comb_facs accordingly.
import pandas as pd
import numpy as np
import itertools
from scipy.special import comb
# data and parameters
n_cols = 10
k_cols = 3 # binomial coeff. (n k)
n_vals = 15 # 15 vals
dic = {}
for i in range(1, n_cols+1):
dic[f"f{i}"] = np.array([j for j in range(1, 1+n_vals)], dtype=object)
df = pd.DataFrame(dic)
# preallocate the output arrays: factors and values
comb_cols = comb(n_cols, k_cols) # binom(10,3) = 120
comb_facs = int(n_vals ** k_cols) # NOTE: must recalculate if number of values are not constant
total_len = int(comb_cols * comb_facs)
factors = np.zeros((total_len, k_cols), dtype=object)
values = np.zeros((total_len, k_cols), dtype=int)
# the actual iteration
for i, tup in enumerate(itertools.combinations(df.columns, k_cols)):
# 1. Cartesian product of (facA, facB, facC).
# can also use list(itertools.product())
vals = pd.MultiIndex.from_product(
[df[tup[i]].values for i in range(k_cols)] # df.f1, df.f2, df.f3
)
arr_vals = pd.DataFrame(index=vals).reset_index().values
# 2. Populate factor names and values into output arrays
factors[i * comb_facs:(i + 1) * comb_facs, :] = tup # broadcasting
values[i * comb_facs:(i + 1) * comb_facs, :] = arr_vals
# result
pd.concat([pd.DataFrame(factors, columns=["1p fac", "2p fac", "3p fac"]),
pd.DataFrame(values, columns=["1p val", "2p val", "3p val"])], axis=1)
Out[41]:
1p fac 2p fac 3p fac 1p val 2p val 3p val
0 f1 f2 f3 1 1 1
1 f1 f2 f3 1 1 2
2 f1 f2 f3 1 1 3
3 f1 f2 f3 1 1 4
4 f1 f2 f3 1 1 5
... ... ... ... ... ...
404995 f8 f9 f10 15 15 11
404996 f8 f9 f10 15 15 12
404997 f8 f9 f10 15 15 13
404998 f8 f9 f10 15 15 14
404999 f8 f9 f10 15 15 15
[405000 rows x 6 columns]
I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
I have 3 lists with the same size, I tried to loop through those loop by for command, but I always get the following error, my question is how to make Parallel iteration within for loop
a=list[...]
b=list[...]
c=list[...]
arrayList=[a,b,c]
for x,y,z in a,b,c:
do somthing
or
for x,y,z in arrayList:
do somthing
error
ValueError: too many values to unpack (expected 3)
You should probably use zip(), that creates tuples of same index elements from given collections:
>>> xs = [1,2,3,4]
>>> ys= [5,6,7,8]
>>> zs = [9,10,11,12]
>>> for x, y, z in zip(xs,ys,zs):
... print(x,y,z)
the output here is:
1 5 9
2 6 10
3 7 11
4 8 12
>>>
I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507
I have a dataset which has a row for each loan, and a borrower can have multiple loans. The 'Property' flag shows if there is any security behind the loan. I am trying to aggregate this flag on a borrower level, so for each borrower, if one of the Property flags is 'Y', I want to add an additional column where it is 'Y' for each of the borrowers.
The short example below shows what the end result should look like. Any help would be appreciated.
import pandas as pd
data = {'Borrower': [1,2,2,2,3,3,4,5,6,6],
'Loan' : [1,2,3,4,5,6,7,8,9,10],
'Property': ["Y","N","Y","Y","N","Y","N","Y","N","N"],
'Result': ['Y','Y','Y','Y','Y','Y','N','Y','N','N']}
df = pd.DataFrame.from_dict(data)
You can use Transform on Property after groupby Borrower. Because the ASCII code of 'Y' is bigger than 'N' so if there is any property which is 'Y' for a borrower, max(Property) will give 'Y'.
df['Result2'] = df.groupby('Borrower')['Property'].transform(max)
df
Out[202]:
Borrower Loan Property Result Result2
0 1 1 Y Y Y
1 2 2 N Y Y
2 2 3 Y Y Y
3 2 4 Y Y Y
4 3 5 N Y Y
5 3 6 Y Y Y
6 4 7 N N N
7 5 8 Y Y Y
8 6 9 N N N
9 6 10 N N N