Adding empty row base on two columns in Pandas DataFrame - python-3.x

I have a dataframe of following structure
x y z
93 122 787.185547
93 123 847.964905
93 124 908.932190
93 125 1054.865845
93 126 1109.340576
x y is coordinates,and I know their range.For example
x_range=np.arange(90,130)
y_range=np.arange(100,130)
z is measurement data
Now I want to insert missing points with nan value in z
so it looks like
x y z
90 100 NaN
90 101 NaN
...........................
93 121 NaN
93 122 787.185547
93 123 847.964905
93 124 908.932190
...........................
129 128 NaN
129 129 NaN
It can be done by a simple but stupid for loop.
But is there a simple way to perform this?

I will recommend use itertools.product follow by merge
import itertools
df=pd.DataFrame(itertools.product(x_range,y_range),columns=['x','y']).merge(df,how='left')

Related

How does one use an assignment expression in a dictionary comprehension?

Suppose I have the below data frame:
df = pd.DataFrame([
[100,90,80,70,36,45],
[101,78,65,88,55,78],
[92,77,42,79,43,32],
[103,98,76,54,45,65]],
index = pd.date_range(start='2022-01-01', periods=4)
)
df.columns = pd.MultiIndex.from_tuples(
(("mkf", "Open"),
("mkf", "Close"),
("tdf", "Open"),
("tdf","Close"),
("ghi","Open"),
("ghi", "Close"))
)
And then I execute the following dictionary comprehension:
{c:df[c].assign(r=np.log(df[(c, 'Close')]).diff()) for c in df.columns.levels[0]}
{'ghi': Open Close r
2022-01-01 36 45 NaN
2022-01-02 55 78 0.550046
2022-01-03 43 32 -0.890973
2022-01-04 45 65 0.708651,
'mkf': Open Close r
2022-01-01 100 90 NaN
2022-01-02 101 78 -0.143101
2022-01-03 92 77 -0.012903
2022-01-04 103 98 0.241162,
'tdf': Open Close r
2022-01-01 80 70 NaN
2022-01-02 65 88 0.228842
2022-01-03 42 79 -0.107889
2022-01-04 76 54 -0.380464}
How would one produce the same result with an assignment expression (i.e. the symbol := )?
https://www.digitalocean.com/community/tutorials/how-to-use-assignment-expressions-in-python

Fill NaN values from its Previous Value pandas

I have below Data from the excel sheet and i want every NaN to be filled from Just its previous value even if its one or more NaN. I tried with ffill() method but doesn't solve the purpose because it takes very First value before NaN of the column and populated that to all NaN.
Could someone help pls.
My Dtaframe:
import pandas as pd
df = pd.read_excel("Example-sheat.xlsx",sheet_name='Sheet1')
#df = df.fillna(method='ffill')
#df = df['AuthenticZTed domaTT controller'].ffill()
print(df)
My DataFrame output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 NaN TTv1670
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 NaN xxgb-gen
8 ZTPGRKMIK1DC200.example.com TTva038
9 DU2RDCRDC1DC204.example.com TTv0071
10 NaN ttv0032
11 KT1MUC02DUDC201.example.com TTv0073
12 NaN TTv0679
13 TN1SZZ67DC200.example.com TTv1180
14 TT1NDZ45DC202.example.com TTv1181
15 TT1BLR01APDC200.example.com TTv0859
16 TN1SZZ67DC200.example.com xxg2089
17 NaN TTv1846
18 ZTPGRKMIK1DC200.example.com TTvtp064
19 PR1CPQ01DC200.example.com TTv0950
20 PR1CPQ01DC200.example.com TTc7005
21 NaN TTv0678
22 KT1MUC02DUDC201.example.com TTv257032798
23 PR1CPQ01DC200.example.com xxg2016
24 NaN TTv0313
25 TT1BLR01APDC200.example.com TTc4901
26 NaN TTv0710
27 DU2RDCRDC1DC204.example.com xxg3008
28 NaN TTv1080
29 PR1CPQ01DC200.example.com xxg2022
30 NaN xxg2057
31 NaN TTv1522
32 TN1SZZ67DC200.example.com TTv258998881
33 PR1CPQ01DC200.example.com TTv259064418
34 ZTPGRKMIK1DC200.example.com TTv259129955
35 TT1BLR01APDC200.example.com xxg2034
36 NaN TTv259326564
37 TNHSZPBCD2DC200.example.com TTv259129952
38 KT1MUC02DUDC201.example.com TTv259195489
39 ZTPGRKMIK1DC200.example.com TTv0683
40 ZTPGRKMIK1DC200.example.com TTv0885
41 TT1BLR01APDC200.example.com dbexh
42 NaN TTvtp065
43 TN1PEK01APDC200.example.com TTvtp057
44 ZTPGRKMIK1DC200.example.com TTvtp007
45 NaN TTvtp063
46 TT1BLR01APDC200.example.com TTvtp032
47 KTphbgsa11dc201.example.com TTvtp046
48 NaN TTvtp062
49 PR1CPQ01DC200.example.com TTv0235
50 NaN TTv0485
51 TT1NDZ45DC202.example.com TTv0236
52 NaN TTv0486
53 PR1CPQ01DC200.example.com TTv0237
54 NaN TTv0487
55 TT1BLR01APDC200.example.com TTv0516
56 TN1CQI02DC200.example.com TTv1285
57 TN1PEK01APDC200.example.com TTv0440
58 NaN liv9007
59 HR1GDL28DC200.example.com TTv0445
60 NaN tuv006
61 FTGFTPTP34DC203.example.com TTv0477
62 NaN tuv002
63 TN1CQI02DC200.example.com TTv0534
64 TN1SZZ67DC200.example.com TTv0639
65 NaN TTv0825
66 NaN TTv1856
67 TT1BLR01APDC200.example.com TTva101
68 TN1SZZ67DC200.example.com TTv1306
69 KTphbgsa11dc201.example.com TTv1072
70 NaN webx02
71 KT1MUC02DUDC201.example.com TTv1310
72 PR1CPQ01DC200.example.com TTv1151
73 TN1CQI02DC200.example.com TTv1165
74 NaN tuv90
75 TN1SZZ67DC200.example.com TTv1065
76 KTphbgsa11dc201.example.com TTv1737
77 NaN ramn01
78 HR1GDL28DC200.example.com ramn02
79 NaN ptb001
80 HR1GDL28DC200.example.com ptn002
81 NaN ptn003
82 TN1SZZ67DC200.example.com TTs0057
83 PR1CPQ01DC200.example.com TTs0058
84 NaN TTs0058-duplicZTe
85 PR1CPQ01DC200.example.com xxg2080
86 KTphbgsa11dc204.example.com xxg2081
87 TN1PEK01APDC200.example.com xxg2082
88 NaN xxg3002
89 TN1SZZ67DC200.example.com xxg2084
90 NaN xxg3005
91 ZTPGRKMIK1DC200.example.com xxg2086
92 NaN xxg3007
93 KT1MUC02DUDC201.example.com xxg2098
94 NaN xxg3014
95 TN1PEK01APDC200.example.com xxg2026
96 NaN xxg2094
97 TN1PEK01APDC200.example.com livtp005
98 KT1MUC02DUDC201.example.com xxg2059
99 ZTPGRKMIK1DC200.example.com acc9102
100 NaN xxg2111
101 TN1CQI02DC200.example.com xxgtp009
Desired Output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 TT1NDZ45DC202.example.com TTv1670 <---
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 DU2RDCRDC1DC204.example.com xxgb-gen <---
1- You are already close to your solution, just use shift() with ffill() it should work.
df = df.apply(lambda x: x.fillna(df['AuthenticZTed domaTT controller']).shift()).ffill()
2- As Quang Suggested that in the comments aso works..
df['AuthenticZTed domaTT controller'] = df['AuthenticZTed domaTT controller'].ffill()
3- or you can also try follows
df = df.fillna({var: df['AuthenticZTed domaTT controller'].shift() for var in df}).ffill()
4- other way around you can define a cols variable if you have multiple columns and then loop through it.
cols = ['AuthenticZTed domaTT controller', 'source KTvice naHR']
for cols in df.columns:
df[cols] = df[cols].ffill()
print(df)
OR
df.loc[:,cols] = df.loc[:,cols].ffill()

Python fill up blank space from web extraction text with NaN

I have extracted some text from web and saved using numpy in format string (fmt="%s").
The data is successfully transferred and readable as follows:
250.0 1000 39.9 45.9 53 60 210 16
250.0 1000 39.9 45.9 53 60 210 16
250.0 1020 40.7 70 200 10
250.0 1010 40.1 95 175 9
250.0 1010 39.9 43.7 67 150 120 16
250.0 1000 39.5 49.5 34 80 190 15
The data consists 2 blank spaces at row 3 and 4 which I believe missing values originates from web. I tried to read the file (sample-250.dat) using numpy and loadtxt procedure :
data5 = np.loadtxt(path1+"sample-250.dat",dtype=object)
PRES=data5[:,0]
HIGHT=data5[:,1]
TEMP=data5[:,2]
DWPT=data5[:,3]
RELH=data5[:,4]
DRCT=data5[:,5]
md=data5[:,6]
SKNT=data5[:,7]
Sadly, the output shows error as follows : ValueError: Wrong number of columns at line 3.
Anyone got ideas on how to read such data probably to replace those blank spaces with NaN values?.
Thanks
How about using pandas instead of numpy?
import pandas as pd
import numpy as np
data5 = pd.read_table(path1+"sample-250.dat", header = None).values
data5 is what you need.

create lag features based on multiple columns

i have a time series dataset. i need to extract the lag features. i am using below code but got all NAN's
df.groupby(['week','id1','id2','id3'],as_index=False)['value'].shift(1)
input
week,id1,id2,id3,value
1,101,123,001,45
1,102,231,004,89
1,203,435,099,65
2,101,123,001,48
2,102,231,004,75
2,203,435,099,90
output
week,id1,id2,id3,value,t-1
1,101,123,001,45,NAN
1,102,231,004,89,NAN
1,203,435,099,65,NAN
2,101,123,001,48,45
2,102,231,004,75,89
2,203,435,099,90,65
You want to shift to the next week so remove 'week' from the grouping:
df['t-1'] = df.groupby(['id1','id2','id3'],as_index=False)['value'].shift()
# week id1 id2 id3 value t-1
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0
That's error prone to missing weeks. In this case we can merge after changing the week, which ensures it is the prior week regardless of missing weeks.
df2 = df.assign(week=df.week+1).rename(columns={'value': 't-1'})
df = df.merge(df2, on=['week', 'id1', 'id2', 'id3'], how='left')
Another way to bring and rename many columns would be to use the suffixes argument in the merge. This will rename all overlapping columns (that are not keys) in the right DataFrame.
df.merge(df.assign(week=df.week+1), # Manally lag
on=['week', 'id1', 'id2', 'id3'],
how='left',
suffixes=['', '_lagged'] # Right df columns -> _lagged
)
# week id1 id2 id3 value value_lagged
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0

Generating all the combinations of 7 columns in a dataframe and add the corresponding rows to generate new columns

I have a dataframe that looks similar to below:
Wave A B C
340 77 70 15
341 80 73 15
342 83 76 16
343 86 78 17
I want to generate columns that will have all the possible combinations of the existing columns. I showed 3 cols here but in my actual data, I have 7 columns and therefore 127 total combinations. The desired output is as follows:
Wave A B C AB AC AD BC ... ABC
340 77 70 15 147 92 ...
341 80 73 15 153 95 ...
342 83 76 16 159 99 ...
I implemented a quite inefficient version where the user inputs the combinations (AB, AC, etc.) and a new col is created with the sum of the rows. This seems almost impossible to accomplish for 127 combinations, esp with descriptive col names.
Create a list of all combinations with chain + combinations from itertools, then sum the appropriate columns:
from itertools import combinations, chain
cols = [*df.iloc[:,1:]]
l = list(chain.from_iterable(combinations(cols, n+2) for n in range(len(cols))))
#[('A', 'B'), ('A', 'C'), ('B', 'C'), ('A', 'B', 'C')]
for items in l:
df[''.join(items)] = df.loc[:, items].sum(1)
Wave A B C AB AC BC ABC
0 340 77 70 15 147 92 85 162
1 341 80 73 15 153 95 88 168
2 342 83 76 16 159 99 92 175
3 343 86 78 17 164 103 95 181
You need to get the all combination first , then we just get the combination , and we need create the maps dict or Series
l=df.columns[1:].tolist()
l1=[list(map(list, itertools.combinations(l, i))) for i in range(len(l) + 1)]
d=[dict.fromkeys(y,''.join(y))for x in l1 for y in x ]
maps=pd.Series(d).apply(pd.Series).stack()
df.set_index('Wave',inplace=True)
df=df.reindex(columns=maps.index.get_level_values(1))
#here using reindex , get the order of your new df to the maps keys
df.columns=maps.tolist()
# here assign the new value to the column , since the order is same that why here I am assign it back
df.sum(level=0,axis=1)
Out[303]:
A B C AB AC BC ABC
Wave
340 77 70 15 147 92 85 162
341 80 73 15 153 95 88 168
342 83 76 16 159 99 92 175
343 86 78 17 164 103 95 181

Resources