Generate df with all possible combinations of values - python-3.x

I have 3 columns (A,B,C), the value of which can vary from 0 to 100, in increments of 0.1. How to generate df with all possible combinations of values ​​of these columns? :
A B C
0 0 0
0 0 0.01
0 0 0.02
… … … and so on

Edit: It's not combinations but a product
You could use combinations_with_replacement
import itertools
import pandas as pd
import numpy as np
# Python range does not work with floats
my_range = np.arange(0, 1.01, 0.01)
combinations = itertools.product(my_range, repeat=3)
df = pd.DataFrame(combinations)

Related

How to change one column of data to multiple column based on row using Python Pandas?

I dont know if I put the question correctly..
For example, I want
1
0
1
0
1
0
1
0
change into
1 0 1
0 1 0
1 0 x
The first list should not be changed..
and change the type to DataFrame..
I try use numpy.array, flatten the array. and reshape to columns using reshape(-1,3).T ..
but since there are some missing value to it.. I cannot reshape the array properly..
A possible solution would be to add the missing values to the array before resizing.
Starting point:
import numpy as np
import pandas as pd
# I assume you flattened the array.
data = np.array([1, 0, 1, 0, 1, 0, 1, 0, ])
Adding the new data based on the required shape and fill value:
new_shape = (3, 3)
fill_value = np.NaN
missing_length = np.product(new_shape) - data.size
missing_array = np.full(missing_length, fill_value)
data = np.hstack([data, missing_array])
Then apply the reshape and convert it to a dataframe:
data = data.reshape(new_shape)
df = pd.DataFrame(data)
print(df)
output:
0 1 2
0 1.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 NaN

Remove a character from a pandas dataframe columns

I have a dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
df
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
I want to remove all 0's after character 'L'.
My expected output:
col1
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
In [114]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
...: df
Out[114]:
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
In [115]: df.col1.str.replace("L([0]*)","L")
Out[115]:
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
Name: col1, dtype: object
Pandas string replace suffices for this. The code below looks for any 0, preceded by L, and replaces the 0 with an empty string :
df.col1.str.replace(r"(?<=L)0+", "")
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
If you need more speed, you could go down into plain Python with list comprehension:
import re
df["cleaned"] = [re.sub(r"(?<=L)0+", "", entry) for entry in df.col1]
df
col1 cleaned
0 AA_L8_ZZ AA_L8_ZZ
1 AA_L08_YY AA_L8_YY
2 AA_L800_XX AA_L800_XX
3 AA_L0008_CC AA_L8_CC

Operations on selective rows on pandas dataframe

I have a dataframe with phone calls, some of them are of zero duration. I want to replace them with int values ranging from 0 to 7, but every my attempt leads to errors or data loss.
I wrote function:
def calls_new(dur):
dur = random.randint(0,7)
return dur
and I tried to use it like this (one of these lines):
df_calls['duration'] = df_calls['duration'].apply(lambda row: x = random.randint(0,7) if x == 0 )
df_calls['duration'] = df_calls['duration'].where(df_calls['duration'] == 0, df_calls.apply(calls_new))
df_calls['duration'] = df_calls[df_calls['duration']==0].apply(calls_new)
Use .loc to set the values only where duration is 0. You can generate all of the random numbers and set everything at once. If you want 7, the end of randint needs to be 8 as the docs indicate high is one above the largest integer to be drawn.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'duration': [0,10,20,0,15,0,0,211]})
m = df['duration'].eq(0)
df.loc[m, 'duration'] = np.random.randint(0, 8, m.sum())
# |
# Need this many numbers
print(df)
duration
0 4
1 10
2 20
3 7
4 15
5 6
6 2
7 211

How to draw venn diagram from a dummy variable in Python Matplotlib_venn?

I have the following code to draw the venn diagram.
import numpy as np
import pandas as pd
import matplotlib_venn as vplt
x = np.random.randint(2, size=(10,3))
df = pd.DataFrame(x, columns=['A', 'B','C'])
print(df)
v = vplt.venn3(subsets=(1,1,1,1,1,1,1))
and the output looks like this:
I actually want to find the numbers in subsets() using the data set. How to do that? or is there any other easy way to make these venn diagram directly from the dataset.
I also want to make a box around it and annotate the remaining area as people with all the A,B,C are 0. Then calculate the percentage of the people in each circle and keep it as label. Not sure how to achieve this.
Background of the Problem:
I have a dataset of more than 500 observations and these three columns are recorded from one variable where multiple choices can be chosen as answers.
I want to visualize the data in a graph which shows that how many people have chosen 1st, 2nd, etc., as well as how many people have chosen 1st and 2nd, 1st and 3rd, etc.,
Use numpy.argwhere to get the indices of the 1s for each column and plot them the resultant
In [85]: df
Out[85]:
A B C
0 0 1 1
1 1 1 0
2 1 1 0
3 0 0 1
4 1 1 0
5 1 1 0
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
In [86]: sets = [set(np.argwhere(v).ravel()) for k,v in df.items()]
...: venn3(sets, df.columns)
...: plt.show()
Note: if you want to draw an additional box with the number of items not in either of the categories, add those lines:
In [87]: ax = plt.gca()
In [88]: xmin, _, ymin, _ = ax.axes.axis('on')
In [89]: ax.text(xmin, ymin, (df == 0).all(1).sum(), ha='left', va='bottom')

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources