Adding attribute "keys" to concatenated dataframes - python-3.x

I am concatenating two dataframes along axis = 1 (columns) and try to use "keys" to later be able to distinguish between the columns of the two dataframes that have the same name.
df1 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'dd': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
df2 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'rf': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1, ignore_index=True)
However, when trying to print all the columns belonging to one of the keys as suggested by #YashTD in Pandas add keys while concatenating dataframes at column level
print(comb_data.snow.tl)
I get the following error message:
AttributeError: 'DataFrame' object has no attribute 'snow'
I think, the keys are just not being added to the dataframe, but I don't know why. They also don't show up wenn printing the dataframe head() at they should be suggested by
Pandas add keys while concatenating dataframes at column level
Do you know how to add the keys to the dataframe?

First remove parameter ignore_index=True for MultiIndex in columns and then select by tuple:
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1)
print (data)
snow wind
tl ff dd tl ff rf
2016-01-01 x1 y1 z1 x1 y1 z1
2016-01-02 x2 y2 z2 x2 y2 z2
2016-01-03 x3 y3 z3 x3 y3 z3
2016-01-04 x4 y4 z4 x4 y4 z4
print (data[('snow','tl')])
2016-01-01 x1
2016-01-02 x2
2016-01-03 x3
2016-01-04 x4
Name: (snow, tl), dtype: object

Related

Rolling window with pandas

I want to separate a dataset in the following fashion:
import pandas as pd
import numpy as np
df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")
sepal_length = df["sepal_length"]
sepal_length
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
Name: sepal_length, Length: 150, dtype: float64
I would like to create another dataset, trying to predict those values, based in 10 previous observations for instance (Suppose that this dataset is ordered and date dependant).
So for my predictors, I would like to have another dataset having the 10 previous values for each index. this is:
10 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
11 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
...
where $ x_i $ is the sepal length at the i-th index.
This does what you want:
for i in range(1,11):
df[f'feature_{i}']=df['sepal_length'].shift(i)

get distinct columns dataframe

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)
First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3
F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

how to eliminate 3 letter words or 4 letter words from a column of a dataframe

I have a dataframe as below:
import pandas as pd
import dask.dataframe as dd
a = {'b':['category','categorical','cater pillar','coming and going','bat','No Data','calling','cal'],
'c':['strd1','strd2','strd3', 'strd4','strd5','strd6','strd7', 'strd8']
}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
I wanted to remove words whose length of each value is three.
I expect results to be like:
b c
category strd1
categorical strd2
cater pillar strd3
coming and going strd4
NaN strd5
No Data strd6
calling strd7
NaN strd8
Use series.str.len() to identify the length of the string in a series and then compare with series.eq(), then using df.loc[] you can assign the values of b as np.nan where the condition matches:
df11.loc[df11.b.str.len().eq(3),'b']=np.nan
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
Use str.len to get the length of each string and then conditionally replace them toNaN with np.where if the length is equal to 3:
df11['b'] = np.where(df11['b'].str.len().eq(3), np.NaN, df11['b'])
b c
0 category strd1
1 categorical strd2
2 cater pillar strd3
3 coming and going strd4
4 NaN strd5
5 No Data strd6
6 calling strd7
7 NaN strd8
Maybe check mask
df11.b.mask(df11.b.str.len()<=3,inplace=True)
df11
Out[16]:
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
You could use a where conditional:
df11['b'] = df11['b'].where(df11.b.map(len) != 3, np.nan)
Something like:
for i, ele in enumerate(df11['b']):
if len(ele) == 3:
df11['b'][i] = np.nan

Replace column values based on partial string match from another dataframe python pandas

I need to update some cell values, based on keys from a different dataframe. The keys are always unique strings, but the second dataframe may or may not contain some extra text at the beginning or at the end of the key. (not necessarily separated by " ")
Frame:
Keys Values
x1 1
x2 0
x3 0
x4 0
x5 1
Correction:
Name Values
SS x1 1
x2 AA 1
x4 1
Expected output Frame:
Keys Values
x1 1
x2 1
x3 0
x4 1
x5 1
I am using the following:
frame.loc[frame['Keys'].isin(correction['Keys']), ['Values']] = correction['Values']
The problem is that isin returns True only on exact mach (as far as I know), which works for only about 30% of my data.
First extract values by Frame['Keys'] joined by | for OR:
pat = '|'.join(x for x in Frame['Keys'])
Correction['Name'] = Correction['Name'].str.extract('('+ pat + ')', expand=False)
#remove non matched rows filled by NaNs
Correction = Correction.dropna(subset=['Name'])
print (Correction)
Name Values
0 x1 1
1 x2 1
2 x4 1
Then create dictionary and map for map by Correction['Name']:
d = dict(zip(Correction['Name'], Correction['Values']))
Frame['Values'] = Frame['Keys'].map(d).fillna(Frame['Values']).astype(int)
print (Frame)
Keys Values
0 x1 1
1 x2 1
2 x3 0
3 x4 1
4 x5 1

Cross table in Spotfire

*UPDATE based on ksp's answer (thank you very much for that, it was almost what I was looking for.)
Can somebody help me with the following problem.
Given the data table:
Key Rec Period DOW Category Value
Key1 Rec1 Period1 dow1 KPIa x1
Key1 Rec2 Period1 dow1 KPIb z1
Key1 Rec3 Period2 dow1 KPIa y1
Key2 Rec4 Period1 dow1 KPIa x1
Key2 Rec5 Period1 dow1 KPIb z1
Key2 Rec6 Period2 dow1 KPIa y1
Key1 Rec7 Period1 dow2 KPIa x2
Key1 Rec8 Period1 dow2 KPIb z2
Key1 Rec9 Period2 dow2 KPIa y2
Key2 Rec10 Period1 dow2 KPIa x2
Key2 Rec11 Period1 dow2 KPIb z2
Key2 Rec12 Period2 dow2 KPIa y2
Key1 Rec13 Period1 dow1 Delta d1
Key1 Rec14 Period1 dow2 Delta d2
Key2 Rec15 Period1 dow1 Delta d3
Key2 Rec16 Period1 dow2 Delta d4
In Spotfire, it is possible to create the following cross table:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1 Period1
dow1 dow2 dow1 dow2 dow1 dow2 dow1 dow2
Key1 x1 x2 y1 y2 z1 z2 d1 d2
Key2 x1 y1 y2 z1 z2 d3 d4
Now there is something I would want to change in this cross table but I can’t manage to figure out how:
Delta is a column which is only valid for Period1. Is it possible to apply the extra Period and DOW level only to certain columns of the cross table?
So what I want is:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1
dow1 dow2 dow1 dow2 dow1 dow2
Key1 x1 x2 y1 y2 z1 z2 (d1 + d2) / 2
Key2 x1 y1 y2 z1 z2 (d3 + d4) / 2
And when the dow2 is filtered out:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1
dow1 dow1 dow1
Key1 x1 y1 z1 d1
Key2 x1 y1 z1 d3
Thanks in advance.
# user6076025 - Please check this solution and let me know if this helps.
I have considered X as 1, Y as 2 and Z as 3 for computation purpose.
I have unpivoted your data which is there in the first screenshot of your post and then created a cross table from the unpivoted data.
Attached are the screenshots for your reference.
# user6076025 - I have assigned values to 'dummy value' column in your table for computation purpose and added a calculated column 'new delta' which will average d1,d2 and d3,d4.
Here is the formula:
Now, I have created a cross table from this data. Below are the screenshots of the table and cross table.
Please let me know if this helps.
Regarding the Dow issue, I would place a Drop-down list in a text area with Fixed Value options
Display Name : 'Include Dow2' Value: 0
Display Name : 'Exclude Dow2' Value: 1
Which would have a script for on-change that would do the following:
if Document.Properties["udDowChoice"] == '0':
Document.Properties["PivotString"] = '<[Category] NEST [Period] NEST [DOW]>'
else:
Document.Properties["PivotString"] = '<[Category] NEST [Period]>'
Then in the Custom Expression for the Horizontal Axes, you make it equal ${PivotString}
And Limit Data Using Expression
If(${udDow} = 0, 1=1, [DOW] <> 'dow2')
To avoid potential confusion from the users, I also recommend hiding the DOW filter from the Filtering Scheme.

Resources