How to specify a random seed while using Python's numpy random choice? - python-3.x

I have a list of four strings. Then in a Pandas dataframe I want to create a variable randomly selecting a value from this list and assign into each row. I am using numpy's random choice, but reading their documentation, there is no seed option. How can I specify the random seed to the random assignment so every time the random assignment will be the same?
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = [np.random.choice(service_code_options ) for i in df.index]

You need define it before by numpy.random.seed, also list comprehension is not necessary, because is possible use numpy.random.choice with parameter size:
np.random.seed(123)
df = pd.DataFrame({'a':range(10)})
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = np.random.choice(service_code_options, size=len(df))
print (df)
a SERVICE_CODE
0 0 13.59P
1 1 12.42R
2 2 13.59P
3 3 13.59P
4 4 899.59O
5 5 13.59P
6 6 13.59P
7 7 12.42R
8 8 204.68L
9 9 13.59P

Documentation numpy.random.seed
np.random.seed(this_is_my_seed)
That could be an integer or a list of integers
np.random.seed(300)
Or
np.random.seed([3, 1415])
Example
np.random.seed([3, 1415])
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
np.random.choice(service_code_options, 3)
array(['899.59O', '204.68L', '13.59P'], dtype='<U7')
Notice that I passed a 3 to the choice function to specify the size of the array.
numpy.random.choice

According to the notes of numpy.random.seed in numpy v1.2.4:
Best practice is to use a dedicated Generator instance rather than the random variate generation methods exposed directly in the random module.
Such a Generator is constructed using np.random.default_rng.
Thus, instead of np.random.seed, the current best practice is to use a np.random.default_rng with a seed to construct a Generator, which can be further used for reproducible results.
Combining jezrael's answer and the current best practice, we have:
import pandas as pd
import numpy as np
rng = np.random.default_rng(seed=121)
df = pd.DataFrame({'a':range(10)})
service_code_options = ['899.59O', '12.42R', '13.59P', '204.68L']
df['SERVICE_CODE'] = rng.choice(service_code_options, size=len(df))
print(df)
a SERVICE_CODE
0 0 12.42R
1 1 13.59P
2 2 12.42R
3 3 12.42R
4 4 899.59O
5 5 204.68L
6 6 204.68L
7 7 13.59P
8 8 12.42R
9 9 13.59P

Related

Compare two dataframes and export unmatched data using pandas or other packages?

I have two dataframes and one is a subset of another one (picture below). I am not sure whether pandas can compare two dataframes and filter the data which is not in the subset and export it as a dataframe. Or is there any package doing this kind of task?
The subset dataframe was generated from RandomUnderSampler but the RandomUnderSampler did not have function which exports the unselected data. Any comments are welcome.
Use drop_duplicates with keep=False parameter:
Example:
>>> df1
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
>>> df2
A B
0 0 1
1 2 3
2 6 7
>>> pd.concat([df1, df2]).drop_duplicates(keep=False)
A B
2 4 5
4 8 9

Plot Bar graph with multiple series

I currently have three dictionaries that have the same keys but have different values since they are for three different years. I am trying to create a comparison between the years. I was wondering how I could go about plotting all three graphs on the same axis. I have the following code
import pandas as pd
dt = {'2018':nnum.values, '2019':nnum2.values, '2020':nnum3.values,}
df1=pd.DataFrame(dt,index=nnum.keys())
df1.plot.bar()
I am trying to do something along this line since each dictionary contains the same keys but I keep getting an error. Is there anyway to set the index to the keys of the dictionary without having to type it out manually
Add () for values, convert to lists in dictionary dt for dictionary of lists:
nnum = {1:4,5:3,7:3}
nnum2 = {7:8,9:1,1:0}
nnum3 = {0:7,4:3,8:5}
dt = {'2018': list(nnum.values()), '2019':list(nnum2.values()), '2020':list(nnum3.values())}
df1=pd.DataFrame(dt,index=nnum.keys())
print(df1)
2018 2019 2020
1 4 8 7
5 3 1 3
7 3 0 5
df1.plot.bar()
EDIT: If there is different length of dictionaries and need new values filled by 0 is possible use:
nnum = {1:4,5:3,7:3}
nnum2 = {7:8,9:1,1:0}
nnum3 = {0:7,4:3}
from itertools import zip_longest
L = [list(nnum.values()), list(nnum2.values()), list(nnum3.values())]
L = list(zip_longest(*L, fillvalue=0))
df1 = pd.DataFrame(L,index=nnum.keys(), columns=['2018','2019','2020'])
print(df1)
2018 2019 2020
1 4 8 7
5 3 1 3
7 3 0 0

How can I delete useless strings by index from a Pandas DataFrame defining a function?

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.
I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Using relative positioning with Python 3.5 and pandas

I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)

Reshaping in julia

If I reshape in python I use this:
import numpy as np
y= np.asarray([1,2,3,4,5,6,7,8])
x=2
z=y.reshape(-1, x)
print(z)
and get this
>>>
[[1 2]
[3 4]
[5 6]
[7 8]]
How would I get the same thing in julia? I tried:
z = [1,2,3,4,5,6,7,8]
x= 2
a=reshape(z,x,4)
println(a)
and it gave me:
[1 3 5 7
2 4 6 8]
If I use reshape(z,4,x) it would give
[1 5
2 6
3 7
4 8]
Also is there a way to do reshape without specifying the second dimension like reshape(z,x) or if the secondary dimension is more ambiguous?
I think what you have hit upon is NumPy stores in row-major order and Julia stores arrays in column major order as covered here.
So Julia is doing what numpy would do if you used
z=y.reshape(-1,x,order='F')
what you want is the transpose of your first attempt, which is
z = [1,2,3,4,5,6,7,8]
x= 2
a=reshape(z,x,4)'
println(a)
you want to know if there is something that will compute the 2nd dimension assuming the array is 2 dimensional? Not that I know of. Possibly ArrayViews? Here's a simple function to start
julia> shape2d(x,shape...)=length(shape)!=1?reshape(x,shape...):reshape(x,shape[1],Int64(length(x)/shape[1]))
shape2d (generic function with 1 method)
julia> shape2d(z,x)'
4x2 Array{Int64,2}:
1 2
3 4
5 6
7 8
How about
z = [1,2,3,4,5,6,7,8]
x = 2
a = reshape(z,x,4)'
which gives
julia> a = reshape(z,x,4)'
4x2 Array{Int64,2}:
1 2
3 4
5 6
7 8
As for your bonus question
"Also is there a way to do reshape without specifying the second
dimension like reshape(z,x) or if the secondary dimension is more
ambiguous?"
the answer is not exactly, because it'd be ambiguous: reshape can make 3D, 4D, ..., tensors so its not clear what is expected. You can, however, do something like
matrix_reshape(z,x) = reshape(z, x, div(length(z),x))
which does what I think you expect.
"Also is there a way to do reshape without specifying the second dimension like reshape(z,x) or if the secondary dimension is more ambiguous?"
Use : instead of -1
I'm using Julia 1.1 (not sure if there was a feature when it was originally answered)
julia> z = [1,2,3,4,5,6,7,8]; a = reshape(z,:,2)
4×2 Array{Int64,2}:
1 5
2 6
3 7
4 8
However, if you want the first row to be 1 2 and match Python, you'll need to follow the other answer mentioning row-major vs column-major ordering and do
julia> z = [1,2,3,4,5,6,7,8]; a = reshape(z,2,:)'
4×2 LinearAlgebra.Adjoint{Int64,Array{Int64,2}}:
1 2
3 4
5 6
7 8

Resources