python/pandas - converting column headers into index - python-3.x

The data I have to deal with treats hourly data as the columns. I want to convert this as an index. Sample looks like this:
year month day 1 2 3 4 5 ... 24
2015 1 1 a b ................... c
2015 1 2 d e ................... f
2015 1 3 g h ................... i
I want to make the output file something like this:
year month day hour value
2015 1 1 1 a
2015 1 1 2 b
. . . . .
2015 1 1 24 c
2015 1 2 1 d
. . . . .
Currently using python 3.4 with the pandas module

Use set_index with stack:
print (df.set_index(['year','month','day'])
.stack()
.reset_index(name='value')
.rename(columns={'level_3':'hour'}))
year month day hour value
0 2015 1 1 1 a
1 2015 1 1 2 b
2 2015 1 1 24 c
3 2015 1 2 1 d
4 2015 1 2 2 e
5 2015 1 2 24 f
6 2015 1 3 1 g
7 2015 1 3 2 h
8 2015 1 3 24 i
Another solution with melt and sort_values:
print (pd.melt(df, id_vars=['year','month','day'], var_name='hour')
.sort_values(['year', 'month', 'day','hour']))
year month day hour value
0 2015 1 1 1 a
3 2015 1 1 2 b
6 2015 1 1 24 c
1 2015 1 2 1 d
4 2015 1 2 2 e
7 2015 1 2 24 f
2 2015 1 3 1 g
5 2015 1 3 2 h
8 2015 1 3 24 i

Related

Row comparison on different tables

friends.
I'm trying to figure out a formula that verifies if there is a matching row from table 2 on table 1. If not, the formula must show that the row were not listed, like stated on column E (CHECK). Is that possible? Or maybe a VBA macro, idk.
TABLE 1
A
B
C
D
29
1
1
1
29
2
1
2
30
3
1
2
15
1
1
1
15
2
1
2
15
3
1
2
20
1
1
1
20
2
1
2
20
3
2
1
20
4
2
2
20
5
1
3
TABLE 2
A
B
C
D
CHECK
29
1
1
1
EXISTS
15
1
1
2
NOT
15
2
1
2
EXISTS
15
3
1
2
EXISTS
20
6
1
1
NOT
100
1
2
3
NOT LISTED
Thanks, guys, would appreciate some help.

if values existes from given list in multiple column and counts the number of column

i have below df
B C D E
2 2 4 11
11 0 5 3
12 10 1 11
5 9 7 15
1st i wants a unique value from whole df like below:
[0,1,2,3,4,5,7,9,10,11,12,15]
then i wants final output
value value exists in number of col
0 1
1 1
2 2
3 1
4 1
5 1
7 1
9 1
10 1
11 2
12 1
15 1
that means each value,how many columns its available
i wants that output
Using python you can do something like this:
# your input df as a list of lists
df = [[2,11,12,5], [2,0,10,9], [4,5,1,7], [11,3,11,15]]
#remove duplicates in each list
dfU = [list(set(l)) for l in df]
# sort each list (not required for this approach)
for l in dfU:
l.sort()
# the requested unique list
flatList = [item for sublist in df for item in sublist]
uniqueList = list(set(flatList))
print(uniqueList)
# output as a list of lists
output = []
for num in uniqueList:
cnt = 0
for idx in range(len(dfU)):
if dfU[idx].count(num) > 0:
cnt+=1
output.append([num,cnt])
print(output)
Side note, the count function is computationally expensive, so it would be better to do a linear scan along all sorted columns.
Use DataFrame.melt for reshape, remove duplicates by both columns and count by GroupBy.size with Series.reset_index for DataFrame:
df1 = (df.melt(value_name='value')
.drop_duplicates()
.groupby('value')
.size()
.reset_index(name='count'))
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1
Details:
print (df.melt(value_name='value'))
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
14 E 11
15 E 15
One 11 for index 14 is removed:
print (df.melt(value_name='value').drop_duplicates())
variable value
0 B 2
1 B 11
2 B 12
3 B 5
4 C 2
5 C 0
6 C 10
7 C 9
8 D 4
9 D 5
10 D 1
11 D 7
12 E 11
13 E 3
15 E 15
If want pure python solution:
from collections import Counter
L = sorted(Counter([y for x in df.T.values for y in set(x)]).items())
df1 = pd.DataFrame(L, columns=['value','count'])
print (df1)
value count
0 0 1
1 1 1
2 2 2
3 3 1
4 4 1
5 5 2
6 7 1
7 9 1
8 10 1
9 11 2
10 12 1
11 15 1

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

reproduce/break rows based on field value

I have my dataframe as:
id date value
1 2016 3
2 2016 1
1 2018 1
1 2016 1.1
Now I want to reproduce rows for some weird reason with logic as:
if value > 1
reproduce row value times - 1
with value = 1
or fraction left for last unit
for better understanding consider only 1st row of dataframe i.e. :
id date value
1 2016 3
which I have broken down into 3 rows as:
id date value
1 2016 1
1 2016 1
1 2016 1
but consider last row i.e.:
id date value
1 2016 1.1
Which is broken as:
id date value
1 2016 1
1 2016 0.1
i.e. if any fraction is there, this is broken separately, else in one unit
and then group by id and sort by date is obviously easy.
i.e. new dataframe will look like:
id date value
1 2016 1
1 2016 1
1 2016 1
1 2016 1
1 2016 0.1
1 2018 1
2 2016 1
The main problem is reproducing rows.
UPDATED
A sample dataframe code:
df = pd.DataFrame([[1,2018,5.1],[2,2018,2],[1,2016,1]], columns=["id", "date", "value"])
generator
def f(df):
for i, *t, v in df.itertuples():
while v > 0:
yield t + [min(v, 1)]
v -= 1
pd.DataFrame([*f(df)], columns=df.columns)
id date value
0 1 2018 1.0
1 1 2018 1.0
2 1 2018 1.0
3 1 2018 1.0
4 1 2018 1.0
5 1 2018 0.1
6 2 2018 1.0
7 2 2018 1.0
8 1 2016 1.0
Using // and % with pandas repeat
s1=df.value//1
s2=df.value%1
s=pd.concat([s1.loc[s1.index.repeat(s1.astype(int))],s2[s2!=0]]).sort_index()
s.loc[s>=1]=1
newdf=df.reindex(df.index.repeat((s1+(s2).ne(0)).astype(int)))
newdf['value']=s.values
newdf
Out[236]:
id date value
0 1 2016 1.0
0 1 2016 1.0
0 1 2016 1.0
1 2 2016 1.0
2 1 2018 1.0
3 1 2016 1.0
3 1 2016 0.1

how to use group by in filter condition in pandas

I have below data stored in a dataframe and I want to remove the rows that has id equal to finalid and for the same id, i have multiple rows.
example:
df_target
id finalid month year count_ph count_sh
1 1 1 2012 12 20
1 2 1 2012 6 18
1 32 1 2012 6 2
2 2 1 2012 2 6
2 23 1 2012 2 6
3 3 1 2012 2 2
output
id finalid month year count_ph count_sh
1 2 1 2012 6 18
1 32 1 2012 6 2
2 23 1 2012 2 6
3 3 1 2012 2 2
functionality is something like:
remove records and get the final dataframe
(df_target.groupby(['id','month','year']).size() > 1) & (df_target['id'] == df_target['finalid'])
I think need transform for same Series as origonal DataFrame and ~ for invert final boolean mask:
df = df_target[~((df_target.groupby(['id','month','year'])['id'].transform('size') > 1) &
(df_target['id'] == df_target['finalid']))]
Alternative solution:
df = df_target[((df_target.groupby(['id','month','year'])['id'].transform('size') <= 1) |
(df_target['id'] != df_target['finalid']))]
print (df)
id finalid month year count_ph count_sh
1 1 2 1 2012 6 18
2 1 32 1 2012 6 2
4 2 23 1 2012 2 6
5 3 3 1 2012 2 2

Resources