Create a dataset from two different dataset - python-3.x

I have two different data set:
1. state VDM MDM OM
AP 1 2 5
GOA 1 2 1
GU 1 2 4
KA 1 5 1
2. Attribute:Value Support Item
VDM:1 4 1
VDM:2 0 2
VDM:3 0 3
VDM:4 0 4
VDM:5 0 5
MDM:1 0 6
MDM:2 3 7
MDM:3 0 8
MDM:4 0 9
MDM:5 1 10
OM:1 2 11
OM:2 0 12
OM:3 0 13
OM:4 1 14
OM:5 1 15
The first dataset only contains 1-5 values.
The second dataset holds the Attribute:Value pair and it's occurrences and a sequence number (Item).
I want a Dataset which looks like below:
state Item Number
AP 1, 7, 15
GOA 1, 7, 11
GU 1, 7, 14
KA 1, 10, 11

None of these are really appealing to me. But sometimes you just have to thrash about to get your data munged.
Attempt #0
a = dict(zip(df2['Attribute:Value'], df2['Item']))
cols = ['VDM', 'MDM', 'OM']
b = {
'Item Number':
[', '.join([str(a[f'{c}:{t._asdict()[c]}']) for c in cols]) for t in df1.itertuples()]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #1
a = dict(zip(df2['Attribute:Value'], df2['Item'].astype(str)))
d1 = df1.set_index('state').astype(str)
r1 = (d1.columns + ':' + d1).replace(a) # Thanks #anky_91
# r1 = (d1.columns + ':' + d1).applymap(a.get)
r1
VDM MDM OM
state
AP 1 7 15
GOA 1 7 11
GU 1 7 14
KA 1 10 11
Then
pd.DataFrame({'state': r1.index, 'Item Number': [*map(', '.join, zip(*map(r1.get, r1)))]})
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #2
a = dict(zip(df2['Attribute:Value'], df2['Item'].astype(str)))
cols = ['VDM', 'MDM', 'OM']
b = {
'Item Number':
[*map(', '.join, zip(*[[a[f'{c}:{i}'] for i in df1[c]] for c in cols]))]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #3
from itertools import cycle
a = dict(zip(zip(*df2['Attribute:Value'].str.split(':').str), df2['Item'].astype(str)))
d = df1.set_index('state')
b = {
'Item Number':
[*map(', '.join, zip(*[map(a.get, zip(cycle(d), np.ravel(d).astype(str)))] * 3))]
}
df1[['state']].assign(**b)
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11
Attempt #4
a = pd.Series(dict(zip(
zip(*df2['Attribute:Value'].str.split(':').str),
df2.Item.astype(str)
)))
df1.set_index('state').stack().astype(str).groupby(level=0).apply(
lambda s: ', '.join(map(a.get, s.xs(s.name).items()))
).reset_index(name='Item Number')
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11

Here is another approach using stack, map and unstack:
s = df.set_index('state').stack()
s_map = df2.set_index(['Attribute:Value'])['Item']
s.loc[:] = (s.index.get_level_values(1) + ':' + s.astype(str)).map(s_map)
s.unstack().astype(str).apply(', '.join, axis=1).reset_index(name='Item Number')
[out]
state Item Number
0 AP 1, 7, 15
1 GOA 1, 7, 11
2 GU 1, 7, 14
3 KA 1, 10, 11

I feel like this is merge and pivot problem
s=df2['Attribute:Value'].str.split(':',expand=True).assign(Item=df2.Item)
s[1]=s[1].astype(int)
s1=df1.melt('state')
s1.merge(s,right_on=[0,1],left_on=['variable','value']).pivot('state','variable','Item')
Out[113]:
variable MDM OM VDM
state
AP 7 15 1
GOA 7 11 1
GU 7 14 1
KA 10 11 1

Related

Spotfire calculate the difference of values in a single column based on date for each group

I have a requirement of calculating the difference in Spotfire for values in a single column based on the occurrence.
the data looks like
ID
Value
Date
1
7
07/01/2021
1
8
09/01/2021
1
10
10/01/2021
1
15
11/01/2021
1
6
12/01/2021
1
3
15/01/2021
2
10
07/01/2021
2
11
08/01/2021
2
12
09/01/2021
the expected output is
ID
Value
Date
Flag
1
7
07/01/2021
True
1
8
09/01/2021
1
10
10/01/2021
1
15
11/01/2021
1
6
12/01/2021
1
3
15/01/2021
2
10
07/01/2021
False
2
11
08/01/2021
2
12
09/01/2021
the logic is
we need to find the flag by comparing the value of first received and latest received for each id.
for id 1 first received value is 7
for id 1 latest received value is 3
is 7>3 True .
for easy understanding i had sorted the id column.
Thanks in Advance.
Hi I think you can do this in two step:
First step :
Concatenate(String([Value])) OVER ([ID])
Will say you name this column : test
Which will give you (7,8,10,15,6,3)
Second step :
Integer(left([test],1)) > Integer(right([test],1))
ID
Value
Date
test
flag
1
7
07/01/2021
7, 8, 10, 15, 6, 3
True
1
8
09/01/2021
7, 8, 10, 15, 6, 3
True
1
10
10/01/2021
7, 8, 10, 15, 6, 3
True
1
15
11/01/2021
7, 8, 10, 15, 6, 3
True
1
6
12/01/2021
7, 8, 10, 15, 6, 3
True
1
3
15/01/2021
7, 8, 10, 15, 6, 3
True
2
10
07/01/2021
10, 11, 12
False
2
11
08/01/2021
10, 11, 12
False
2
12
09/01/2021
10, 11, 12
False

Check the missing integers from a range in Python

Given a building infos dataframe as follows:
id floor type
0 1 13 office
1 2 12 office
2 3 9 office
3 4 9 office
4 5 7 office
5 6 6 office
6 7 9 office
7 8 5 office
8 9 5 office
9 10 5 office
10 11 4 retail
11 12 3 retail
12 13 2 retail
13 14 1 retail
14 15 -1 parking
15 16 -2 parking
16 17 13 office
I want to check if in the column floor, there are missing floors (except for floor 0, which is by default not existing).
Code:
set(df['floor'])
Out:
{-2, -1, 1, 2, 3, 4, 5, 6, 7, 9, 12, 13}
For example, for the dataset above (-2, -1, 1, 2, ..., 13), I want to return an indication floor 8, 10, 11 are missing in your dataset. Otherwise, just returns no missing floor in your dataset.
How could I do that in Pandas or Numpy? Thanks a lot for your help at advance.
Use np.setdiff1d for difference with range created np.arange and omitted 0:
arr = np.arange(df['floor'].min(), df['floor'].max() + 1)
arr = arr[arr != 0]
out = np.setdiff1d(arr, df['floor'])
out = ('no missing floor in your dataset'
if len(out) == 0
else f'floor(s) {", ".join(out.astype(str))} are missing in your dataset')
print (out)
floor(s) 8, 10, 11 are missing in your dataset

pd.Series(pred).value_counts() how to get the first column in dataframe?

I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()

Set 3 level of column names in pandas DataFrame

I'm trying to have a frame with the following structure
h/a totales
sub1 sub2 sub1 sub2
a b ... f g ....m a b ... f g ....m
That being, 2 labels for the first layer, again 2 labels for the second one, and then a subset of column names where sub1 and sub2 doesn't have the same column names.
In order to do so I did the following:
columnas=pd.MultiIndex.from_product([['h/a','totals'],['means','percentages'],
[('means','a'),('means','b'),....('percentage','g'),....],
names=['data level 1','data level 2','data level 3']])
data=[data,pata,......]
newframe=pd.DataFrame(data,columns=columnas)
What I get is this error:
>ValueError: Shape of passed values is (1, 21), indices imply (84, 21)
How can I fix this to have a multi leveled frame by column names?
Thank you
I think need MultiIndex.from_tuples from list comprehensions:
L1 = list('abc')
L2 = list('ghi')
tups = ([('h/a','means', x) for x in L1] +
[('h/a','percentage', x) for x in L2] +
[('totals','means', x) for x in L1] +
[('totals','percentage', x) for x in L2])
columnas=pd.MultiIndex.from_tuples(tups, names=['data level 1','data level 2','data level 3'])
print (columnas)
MultiIndex(levels=[['h/a', 'totals'],
['means', 'percentage'],
['a', 'b', 'c', 'g', 'h', 'i']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['data level 1', 'data level 2', 'data level 3'])
#some random data
np.random.seed(785)
data = np.random.randint(10, size=(3, 12))
print (data)
[[8 0 4 1 2 5 4 1 4 1 1 8]
[1 5 0 7 4 8 4 1 3 8 0 2]
[5 9 4 9 4 6 3 7 0 5 2 1]]
newframe=pd.DataFrame(data,columns=columnas)
print (newframe)
data level 1 h/a totals
data level 2 means percentage means percentage
data level 3 a b c g h i a b c g h i
0 8 0 4 1 2 5 4 1 4 1 1 8
1 1 5 0 7 4 8 4 1 3 8 0 2
2 5 9 4 9 4 6 3 7 0 5 2 1

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]

Resources