Is it possible to create a dataframe from few 1d arrays and place them as columns?
If I create a dataframe from 1 1d array everything is ok:
arr1 = np.array([11, 12, 13, 14, 15])
arr1_arr2_df = pd.DataFrame(data=arr1, index=None, columns=None)
arr1_arr2_df
Out:
0
0 11
1 12
2 13
3 14
4 15
But If make a datafreme form 2 arrays they are placed is rows:
arr1 = np.array([11, 12, 13, 14, 15])
arr2 = np.array([21, 22, 23, 24, 25])
arr1_arr2_df = pd.DataFrame(data=(arr1,arr2), index=None, columns=None)
arr1_arr2_df
Out:
0 1 2 3 4
0 11 12 13 14 15
1 21 22 23 24 25
I know that I can achieve it by using transpose:
arr1_arr2_df = arr1_arr2_df.transpose()
arr1_arr2_df
Out:
0 1
0 11 21
1 12 22
2 13 23
3 14 24
4 15 25
But is it possible to get it from the start?
You can use a dictionary:
arr1_arr2_df = pd.DataFrame(data={0:arr1,1:arr2})
Related
I can't see the forest for the trees right now:
I have a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({'UTCs': [32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783],
'Temperature': [5, 7, 7, 9, 12, 9, 9, 4],
'Humidity': [50, 50, 48, 47, 46, 47, 48, 52],
'pressure': [998, 998, 999, 999, 999, 999, 1000, 1000]})
print(df)
UTCs Temperature Humidity pressure
0 32776 5 50 998
1 32777 7 50 998
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
7 32783 4 52 1000
Now I want to create a subset of all dataset columns for UTCs between 32778 and 32782
I can chose a subset with:
df_sub=df.iloc[2:7,:]
print(df_sub)
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
But how can I do that with the condition like 'chose rows between UTCs=32778 and UTCs=32782'?
Something like
df_sub = df.iloc[df[df.UTCs == 32778] : df[df.UTCs == 32783], : ]
does not work.
Any hint for me?
Use between for boolean indexing:
df_sub = df[df['UTCs'].between(32778, 32783, inclusive='left')]
output:
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
Given a building infos dataframe as follows:
id floor type
0 1 13 office
1 2 12 office
2 3 9 office
3 4 9 office
4 5 7 office
5 6 6 office
6 7 9 office
7 8 5 office
8 9 5 office
9 10 5 office
10 11 4 retail
11 12 3 retail
12 13 2 retail
13 14 1 retail
14 15 -1 parking
15 16 -2 parking
16 17 13 office
I want to check if in the column floor, there are missing floors (except for floor 0, which is by default not existing).
Code:
set(df['floor'])
Out:
{-2, -1, 1, 2, 3, 4, 5, 6, 7, 9, 12, 13}
For example, for the dataset above (-2, -1, 1, 2, ..., 13), I want to return an indication floor 8, 10, 11 are missing in your dataset. Otherwise, just returns no missing floor in your dataset.
How could I do that in Pandas or Numpy? Thanks a lot for your help at advance.
Use np.setdiff1d for difference with range created np.arange and omitted 0:
arr = np.arange(df['floor'].min(), df['floor'].max() + 1)
arr = arr[arr != 0]
out = np.setdiff1d(arr, df['floor'])
out = ('no missing floor in your dataset'
if len(out) == 0
else f'floor(s) {", ".join(out.astype(str))} are missing in your dataset')
print (out)
floor(s) 8, 10, 11 are missing in your dataset
I have two different data set:
1. state VDM MDM OM
AP 1 2 5
GOA 1 2 1
GU 1 2 4
KA 1 5 1
2. Attribute:Value Support Item
VDM:1 4 1
VDM:2 0 2
VDM:3 0 3
VDM:4 0 4
VDM:5 0 5
MDM:1 0 6
MDM:2 3 7
MDM:3 0 8
MDM:4 0 9
MDM:5 1 10
OM:1 2 11
OM:2 0 12
OM:3 0 13
OM:4 1 14
OM:5 1 15
The first dataset only contains 1-5 values.
The second dataset holds the Attribute:Value pair and it's occurrences and a sequence number (Item).
I want a Dataset which looks like below:
state Item Number
AP 1, 7, 15
GOA 1, 7, 11
GU 1, 7, 14
KA 1, 10, 11
None of these are really appealing to me. But sometimes you just have to thrash about to get your data munged.
Attempt #0
a = dict(zip(df2['Attribute:Value'], df2['Item']))
cols = ['VDM', 'MDM', 'OM']
b = {
'Item Number':
[', '.join([str(a[f'{c}:{t._asdict()[c]}']) for c in cols]) for t in df1.itertuples()]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #1
a = dict(zip(df2['Attribute:Value'], df2['Item'].astype(str)))
d1 = df1.set_index('state').astype(str)
r1 = (d1.columns + ':' + d1).replace(a) # Thanks #anky_91
# r1 = (d1.columns + ':' + d1).applymap(a.get)
r1
VDM MDM OM
state
AP 1 7 15
GOA 1 7 11
GU 1 7 14
KA 1 10 11
Then
pd.DataFrame({'state': r1.index, 'Item Number': [*map(', '.join, zip(*map(r1.get, r1)))]})
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #2
a = dict(zip(df2['Attribute:Value'], df2['Item'].astype(str)))
cols = ['VDM', 'MDM', 'OM']
b = {
'Item Number':
[*map(', '.join, zip(*[[a[f'{c}:{i}'] for i in df1[c]] for c in cols]))]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #3
from itertools import cycle
a = dict(zip(zip(*df2['Attribute:Value'].str.split(':').str), df2['Item'].astype(str)))
d = df1.set_index('state')
b = {
'Item Number':
[*map(', '.join, zip(*[map(a.get, zip(cycle(d), np.ravel(d).astype(str)))] * 3))]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #4
a = pd.Series(dict(zip(
zip(*df2['Attribute:Value'].str.split(':').str),
df2.Item.astype(str)
)))
df1.set_index('state').stack().astype(str).groupby(level=0).apply(
lambda s: ', '.join(map(a.get, s.xs(s.name).items()))
).reset_index(name='Item Number')
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Here is another approach using stack, map and unstack:
s = df.set_index('state').stack()
s_map = df2.set_index(['Attribute:Value'])['Item']
s.loc[:] = (s.index.get_level_values(1) + ':' + s.astype(str)).map(s_map)
s.unstack().astype(str).apply(', '.join, axis=1).reset_index(name='Item Number')
[out]
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
I feel like this is merge and pivot problem
s=df2['Attribute:Value'].str.split(':',expand=True).assign(Item=df2.Item)
s[1]=s[1].astype(int)
s1=df1.melt('state')
s1.merge(s,right_on=[0,1],left_on=['variable','value']).pivot('state','variable','Item')
Out[113]:
variable MDM OM VDM
state
AP 7 15 1
GOA 7 11 1
GU 7 14 1
KA 10 11 1
I have a 2D matrix. I want to shuffle last few columns and rows associated with those columns.
I tried using np.random.shuffle but it only shuffles the column, not rows.
def randomize_the_data(original_matrix, reordering_sz):
new_matrix = np.transpose(original_matrix)
np.random.shuffle(new_matrix[reordering_sz:])
shuffled_matrix = np.transpose(new_matrix)
print(shuffled_matrix)
a = np.arange(20).reshape(4, 5)
print(a)
print()
randomize_the_data(a, 2)
My original matrix is this:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
I am getting this.
[[ 0 1 3 4 2]
[ 5 6 8 9 7]
[10 11 13 14 12]
[15 16 18 19 17]]
But I want something like this.
[[ 0 1 3 2 4]
[ 5 6 7 8 9]
[10 11 14 12 13]
[15 16 17 18 19]]
Another example would be:
Original =
-1.3702 0.3341 -1.2926 -1.4690 -0.0843
0.0170 0.0332 -0.1189 -0.0234 -0.0398
-0.1755 0.2182 -0.0563 -0.1633 0.1081
-0.0423 -0.0611 -0.8568 0.0184 -0.8866
Randomized =
-1.3702 0.3341 -0.0843 -1.2926 -1.4690
0.0170 0.0332 -0.0398 -0.0234 -0.1189
-0.1755 0.2182 -0.0563 0.1081 -0.1633
-0.0423 -0.0611 0.0184 -0.8866 -0.8568
To shuffle the last elements of each row you can go through each row independently and shuffle the last few numbers by doing the shuffle for each row, the rows will each be shuffled in different ways compared to each other, unlike what you had before where each row was shuffled the same way.
import numpy as np
def randomize_the_data(original_matrix, reordering_sz):
for ln in original_matrix:
np.random.shuffle(ln[reordering_sz:])
print(original_matrix)
a = np.arange(20).reshape(4, 5)
print(a)
print()
randomize_the_data(a, 2)
OUTPUT:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
[[ 0 1 4 2 3]
[ 5 6 8 7 9]
[10 11 13 14 12]
[15 16 17 18 19]]
How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64