How to join a series into a dataframe - python-3.x

So I counted the frequency of a column 'address' from the dataframe 'df_two' and saved the data as dict. used that dict to create a series 'new_series'. so now I want to join this series into a dataframe making 'df_three' so that I can do some maths with the column 'new_count' and the column 'number' from 'new_series' and 'df_two' respectively.
I have tried to use merge / concat the items of 'new_count' were changed to NaN
Image for what i got(NaN)
df_three
number address name new_Count
14 12 ab pra NaN
49 03 cd ken NaN
97 07 ef dhi NaN
91 10 fg rav NaN
Image for input
Input
new_series
new_count
12 ab 8778
03 cd 6499
07 ef 5923
10 fg 5631
df_two
number address name
14 12 ab pra
49 03 cd ken
97 07 ef dhi
91 10 fg rav
output
df_three
number address name new_Count
14 12 ab pra 8778
49 03 cd ken 6499
97 07 ef dhi 5923
91 10 fg rav 5631

It seems you forget parameter on:
df = df_two.join(new_series, on='address')
print (df)
number address name new_count
0 14 12 ab pra 8778
1 49 03 cd ken 6499
2 97 07 ef dhi 5923
3 91 10 fg rav 5631

Related

Data frame transformation using transposing and flatening

I have a data frame that looks like:
tdelta A B label
1 11 21 Lab1
2 24 45 Lab2
3 44 65 Lab3
4 77 22 Lab4
5 12 64 Lab5
6 39 09 Lab6
7 85 11 Lab7
8 01 45 Lab8
And I need to transform this dataset into:
For selected window: 4
A1 A2 A3 A4 B1 B2 B3 B4 L1 label
11 24 44 77 21 45 65 22 Lab1 Lab4
12 39 85 01 64 09 11 45 Lab5 Lab8
So based on the selected window - 'w', I need to transpose w rows with the first corresponding label as my X values and the corresponding last label as my Y value. here is what I have developed till now:
def data_process(data,window):
n=len(data)
A = pd.DataFrame(data['A'])
B = pd.DataFrame(data['B'])
lb = pd.DataFrame(data['lab'])
df_A = pd.concat([gsr.loc[i] for i in range(0,window)],axis=1).reset_index()
df_B = pd.concat([st.loc[i] for i in range(0,window)],axis=1).reset_index()
df_lb = pd.concat([lb.loc[0],axis=1).reset_index()
X = pd.concat([df_A,df_B,df_lab],axis=1)
Y = pd.DataFrame(data['lab']).shift(-window)
return X, Y
I think this works for only the first 'window' rows. I need it to work for my entire dataframe.
This is essentially a pivot, with a lot of cleaning up after the pivot. For the pivot to work we need to use integer and modulus division so that we can group the rows into windows of length w and figure out which column they then belong to.
# Number of rows to group together
w = 4
df['col'] = np.arange(len(df))%w + 1
df['i'] = np.arange(len(df))//w
# Reshape and flatten the MultiIndex
df = (df.drop(columns='tdelta')
.pivot(index='i', columns='col')
.rename_axis(index=None))
df.columns = [f'{x}{y}'for x,y in df.columns]
# Define these columns and remove the intermediate label columns.
df['L1'] = df['label1']
df['label'] = df[f'label{w}']
df = df.drop(columns=[f'label{i}' for i in range(1, w+1)])
print(df)
A1 A2 A3 A4 B1 B2 B3 B4 L1 label
0 11 24 44 77 21 45 65 22 Lab1 Lab4
1 12 39 85 1 64 9 11 45 Lab5 Lab8

Modelling a moving window with a shift( ) function in python problem

Problem: Lets suppose that we supply robots to a factory. Each of these robots is programmed to switch into the work mode after 3 days (e.g. if it arrives on day 1, it starts working on day 3), and then they work for 5 days. after that, the battery runs out and they stop working. The number of robots supplied each day varies.
The following code is the supplies for the first 15 days like so:
import pandas as pd
df = pd.DataFrame({
'date': ['01','02', '03', '04', '05','06', \
'07','08','09','10', '11', '12', '13', '14', '15'],
'value': [10,20,20,30,20,10,30,20,10,20,30,40,20,20,20]
})
df.set_index('date',inplace=True)
df
Let's now estimate the number of working robots on each of these days like so ( we move two days back and sum up only the numbers within the past 5 days):
04 10
05 20+10 = 30
06 20+20 = 40
07 30+20 = 50
08 20+30 = 50
09 10+20 = 30
10 30+10 = 40
11 20+30 = 50
12 10+20 = 30
13 20+10 = 30
14 30+20 = 50
15 40+30 = 70
Is it possible to model this in python? I have tried this - not quite but close.
df_p = (((df.rolling(2)).sum())).shift(5).rolling(1).mean().shift(-3)
p.s. if you dont think its complicated enough then I also need to include the last 7-day average for each of these numbers for my real problem.
Let's try shift forward first the window (5) less the rolling window length (2) and taking rolling sum with min periods set to 1:
shift_window = 5
rolling_window = 2
df['new_col'] = (
df['value'].shift(shift_window - rolling_window)
.rolling(rolling_window, min_periods=1).sum()
)
Or with hard coded values:
df['new_col'] = df['value'].shift(3).rolling(2, min_periods=1).sum()
df:
value new_col
date
01 10 NaN
02 20 NaN
03 20 NaN
04 30 10.0
05 20 30.0
06 10 40.0
07 30 50.0
08 20 50.0
09 10 30.0
10 20 40.0
11 30 50.0
12 40 30.0
13 20 30.0
14 20 50.0
15 20 70.0

Multiple nested if conditions inside lambda

I have a column called zipcode in the pandas data frame. Some the rows contain NaN values, some contain correct string format like '160 00' and the rest contain the wrong format like '18000'. What I want is to skip NaN values (not to drop them) and convert wrong zipcodes into correct ones; for example: '18000' -> '180 00'.
Is it possible to do that by applying lambda? All I got is this so far:
df['zipcode']apply(lambda row: print(row[:3] + ' ' + row[3:]) if type(row) == str else row)
Sample of dataframe:
df = pd.DataFrame(np.array(['11100', '246 00', '356 50',
np.nan, '18000', '156 00', '163 00']), columns=['zipcode'])
zipcode
0 11100
1 246 00
2 356 50
3 nan
4 18000
5 156 00
6 163 00
Thank you.
Let us try .str.replace:
df['zipcode'] = df['zipcode'].str.replace(r'(\d{3})\s*(\d+)', r'\g<1> \g<2>')
zipcode
0 111 00
1 246 00
2 356 50
3 nan
4 180 00
5 156 00
6 163 00

How to convert repated rows data to columns in python?

Hi I have a data frame df1 where column name names repeat after every 3 rows. I need to convert them to a single row.
This is how df looks
name marks
john 63
mark 45
pieter 32
beth 02
john 25
mark 01
pieter 23
beth 42
john 03
mark 43
pieter 42
beth 23
I need the output in the following format
type john mark pieter beth
marks 63 45 32 02
marks 25 01 23 42
marks 03 43 42 23
Consider this:
df=df.assign(id=df.groupby("name").cumcount()) \
.pivot(columns="name", index="id") \
.stack(level=0).reset_index(level=1) \
.rename(columns={"level_1": "type"})
del df.index.name
del df.columns.name
Outputs:
type beth john mark pieter
0 marks 02 63 45 32
1 marks 42 25 01 23
2 marks 23 03 43 42
IIUC:
new_df = (df.pivot_table(index=df.groupby('name').cumcount(), columns='name')
.rename_axis(columns=['type',None])
.stack(level=0)
.reset_index(level=1))
print(new_df)
type beth john mark pieter
0 marks 2 63 45 32
1 marks 42 25 1 23
2 marks 23 3 43 42
or
new_df = (df.assign(index=df.groupby('name').cumcount())
.melt(['index','name'], var_name='type')
.pivot_table(index=['type','index'], columns='name',values = 'value')
.reset_index('type'))
An alternative :
res = (df
.pivot(columns='name',values='marks')
.bfill()
#remove repititions on the beth column
#this impacts the other columns as well
.drop_duplicates('beth')
.rename_axis(columns=None)
.astype(int)
.assign(type='marks')
#adjust column positions to match ur output
.reindex(['type','john','mark','pieter','beth'],axis=1)
.reset_index(drop=True)
)
res
type john mark pieter beth
0 marks 63 45 32 2
1 marks 25 1 23 42
2 marks 3 43 42 23
You could also step out of Pandas into numpy, using the reshape method and at the end, create a new dataframe :
name_len = df.name.nunique()
names = df.name.unique()
df_len = len(df)
reshape_tuple = (df_len//name_len,name_len)
reshaped = df.marks.to_numpy().reshape(reshape_tuple)
#create new dataframe
res = pd.DataFrame(reshaped, columns = names)
#insert the 'type' column at the beginning of the dataframe
res.insert(0,'type','marks')
print(res)
type john mark pieter beth
0 marks 63 45 32 2
1 marks 25 1 23 42
2 marks 3 43 42 23

match multiple partial strings from string in excel

So here I have 2 tables.
ring layer1 layer2 output
12 45 46 bingo
12 34 75
13 23 47
14 23 34 nice_work
14 12 15
14 45 23
14 67 89 wow
25 90 124
67 76 341
ring whole_string value as output
12 23_45_12_78_46 bingo
12 78_89_23_45_90 great
13 23_89_90 awesome
14 45_78_23_45_34 nice_work
14 88_86_85_12 cool
14 67_89_111 wow
what I need is: value as output from tbl2 if
1. tbl1 ring = tbl2 ring
2. tbl1 layer1 & layer2 values must be present in tbl2 whole_string
Can someone help me with excel formula?
Thank you...
I tried using a for loop. It takes a whole lot time.
You could use:
Formula in D2:
=IFERROR(INDEX($H$1:$H$7,AGGREGATE(14,3,($F$2:$F$7=A2)*(IF(ISNUMBER(SEARCH("_"&B2&"_","_"&$G$2:$G$7&"_")),1,""))*(IF(ISNUMBER(SEARCH("_"&C2&"_","_"&$G$2:$G$7&"_")),1,""))*ROW($F$2:$F$7),1)),"")
Entered as array formula through: Ctrl+Shift+Enter
Drag down...

Resources