Pandas: Random integer between values in two columns - python-3.x

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?

You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83

If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Related

pandas dataframe slicing to a subset from row #y1 to row #y2

I can't see the forest for the trees right now:
I have a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({'UTCs': [32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783],
'Temperature': [5, 7, 7, 9, 12, 9, 9, 4],
'Humidity': [50, 50, 48, 47, 46, 47, 48, 52],
'pressure': [998, 998, 999, 999, 999, 999, 1000, 1000]})
print(df)
UTCs Temperature Humidity pressure
0 32776 5 50 998
1 32777 7 50 998
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
7 32783 4 52 1000
Now I want to create a subset of all dataset columns for UTCs between 32778 and 32782
I can chose a subset with:
df_sub=df.iloc[2:7,:]
print(df_sub)
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
But how can I do that with the condition like 'chose rows between UTCs=32778 and UTCs=32782'?
Something like
df_sub = df.iloc[df[df.UTCs == 32778] : df[df.UTCs == 32783], : ]
does not work.
Any hint for me?
Use between for boolean indexing:
df_sub = df[df['UTCs'].between(32778, 32783, inclusive='left')]
output:
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000

Pandas: apply list of functions on columns, one function per column

Setting: for a dataframe with 10 columns I have a list of 10 functions which I wish to apply in a function1(column1), function2(column2), ..., function10(column10) fashion. I have looked into pandas.DataFrame.apply and pandas.DataFrame.transform but they seem to broadcast and apply each function on all possible columns.
IIUC, with zip and a for loop:
Example
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
for col, func in zip(df, functions):
df[col] = df[col].apply(func)
print(df)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9
You could do something like:
# list containing functions
fun_list = []
# assume df is your dataframe
for i, fun in enumerate(fun_list):
df.iloc[:,i] = fun(df.iloc[:,i])
You can probably try to map your N functions to each row by using a lambda containing a Series with your operations, check the following code:
import pandas as pd
matrix = [(22, 34, 23), (33, 31, 11), (44, 16, 21), (55, 32, 22), (66, 33, 27),
(77, 35, 11)]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abcdef'))
Will produce:
x y z
a 22 34 23
b 33 31 11
c 44 16 21
d 55 32 22
e 66 33 27
f 77 35 11
and then:
res_df = df.apply(lambda row: pd.Series([row[0] + 1, row[1] + 2, row[2] + 3]), axis=1)
will give you:
0 1 2
a 23 36 26
b 34 33 14
c 45 18 24
d 56 34 25
e 67 35 30
f 78 37 14
You can simply apply to specific column
df['x'] = df['X'].apply(lambda x: x*2)
Similar to #Chris Adams's answer but makes a copy of the dataframe using dictionary comprehension and zip.
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
df_2 = pd.DataFrame({col: func(df[col]) for col, func in zip(df, functions)})
print(df_2)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9

How to do element wise operation of Pandas Series to get a new DataFrame

I have a Pandas Series of numbers, lets say [1,2,3,4,5]. I want an efficient way to create a dataframe with each combination of series element passed through a function and the result being the corresponding dataframe element
Lets say the function is
def f1(a,b):
return a*b + 5
then I want the dataframe to be like below, with each cell being the result of the function call with combination of series element.
col_1___2___3___4___5__
row|
1|6, 7, 8, 9, 10
2|7, 9, 11, 13, 15
3|8, 11, 14, 17, 20
4|9, 13, 17, 21, 25
5|10, 15, 20, 25, 30
The series can sometimes be as much as 500 elements.
Thanks in advance!
You can do:
def f1(a,b):
return a*b + 5
s=[1,2,3,4,5]
df=pd.DataFrame(columns=s, index=s)
df=df.apply(lambda x: f1(x.name, x.index))
Output:
1 2 3 4 5
1 6 7 8 9 10
2 7 9 11 13 15
3 8 11 14 17 20
4 9 13 17 21 25
5 10 15 20 25 30
Using pandas
import pandas as pd
#create pandas dataframe with one column "col_" with data.
df = pd.DataFrame({'col_':list(range(1, 6))})
print(df['col_']) #this is data you provided above.
#Create a new column based on function.
b=1 #constant
df['1_'] = df['col_'].apply(lambda x: x*b+5)
#and another column
df['2_'] = df['col_'].apply(lambda x: x*(b+b)+5)

How to use two for loops to copy a list variable into a location variable in a dataframe?

I have a dataframe which has 2 columns called locStuff and data. Someone was kind enough to show me how to index a location range in the df so that it correctly changes the data to a single integer attached to locStuff instead of the dataframe index, that works fine, now I cannot see how to change the data values of that location range with a list of values.
import pandas as pd
INDEX = list(range(1, 11))
LOCATIONS = [3, 10, 6, 2, 9, 1, 7, 5, 8, 4]
DATA = [94, 43, 85, 10, 81, 57, 88, 11, 35, 86]
# Make dataframe
DF = pd.DataFrame(LOCATIONS, columns=['locStuff'], index=INDEX)
DF['data'] = pd.Series(DATA, index=INDEX)
# Location and new value inputs
LOC_TO_CHANGE = 8
NEW_LOC_VALUE = 999
NEW_LOC_VALUE = [999,666,333]
LOC_RANGE = list(range(3, 6))
DF.iloc[3:6, 1] = ('%03d' % NEW_LOC_VALUE)
print(DF)
#I TRIED BOTH OF THESE SEPARATELY
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
print(DF)
Neither of these work:
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
I know how to do this using loops or list comprehensions for an empty list but no idea how to adapt what I have above for a DataFrame.
Expected behaviour would be:
locStuff data
1 3 999
2 10 43
3 6 85
4 2 10
5 9 81
6 1 57
7 7 88
8 5 333
9 8 35
10 4 666
Try setting locStuff as index, assign values, and reset_index:
DF.set_index('locStuff', inplace=True)
DF.loc[LOC_RANGE, 'data'] = NEW_LOC_VALUE
DF.reset_index(inplace=True)
Output:
locStuff data
0 3 999
1 10 43
2 6 85
3 2 10
4 9 81
5 1 57
6 7 88
7 5 333
8 8 35
9 4 666

Can I create a dataframe from few 1d arrays as columns?

Is it possible to create a dataframe from few 1d arrays and place them as columns?
If I create a dataframe from 1 1d array everything is ok:
arr1 = np.array([11, 12, 13, 14, 15])
arr1_arr2_df = pd.DataFrame(data=arr1, index=None, columns=None)
arr1_arr2_df
Out:
0
0 11
1 12
2 13
3 14
4 15
But If make a datafreme form 2 arrays they are placed is rows:
arr1 = np.array([11, 12, 13, 14, 15])
arr2 = np.array([21, 22, 23, 24, 25])
arr1_arr2_df = pd.DataFrame(data=(arr1,arr2), index=None, columns=None)
arr1_arr2_df
Out:
0 1 2 3 4
0 11 12 13 14 15
1 21 22 23 24 25
I know that I can achieve it by using transpose:
arr1_arr2_df = arr1_arr2_df.transpose()
arr1_arr2_df
Out:
0 1
0 11 21
1 12 22
2 13 23
3 14 24
4 15 25
But is it possible to get it from the start?
You can use a dictionary:
arr1_arr2_df = pd.DataFrame(data={0:arr1,1:arr2})

Resources