Hi the following is my DataSet
MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00
I would like to perform a calculation, create a new column and in this column add the row value to the previous row value in the column and so on.
import pandas as pd
import math
import numpy as np
#open excel file
df = pd.read_excel
print(df)
for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))
print(df)
MD Incl. Azi. TVD
0 0.00 0.00 350.00 NaN
1 161.00 0.00 350.00 161.000000
2 261.00 0.00 350.00 100.000000
3 361.00 0.00 350.00 100.000000
4 461.00 0.00 350.00 100.000000
I would like the TVD column to be
TVD
NaN
161
261
361
461
and so on by adding its current value to the value before
Use cumsum
df['tvd'] = df['tvd'].cumsum()
Example:
import pandas as pd
import numpy as np
from io import StringIO
txt = """ MD Incl. Azi.
0 0.00 0.00 350.00
1 161.00 0.00 350.00
2 261.00 0.00 350.00
3 361.00 0.00 350.00
4 461.00 0.00 350.00"""
df = pd.read_csv(StringIO(txt), sep='\s\s+')
for i in range (1, len(df)):
incl = np.deg2rad(df['Incl.'])
df['TVD_diff'] = (((df['MD'] - df['MD'].shift())/2)*(np.cos(incl).shift() + np.cos(incl)))
df['TVD_diff'] = df['TVD_diff'].cumsum()
print(df)
Output:
MD Incl. Azi. TVD_diff
0 0.0 0.0 350.0 NaN
1 161.0 0.0 350.0 161.0
2 261.0 0.0 350.0 261.0
3 361.0 0.0 350.0 361.0
4 461.0 0.0 350.0 461.0
Related
I have a data frame with over three million rows. I am trying to Group values in Bar_Code column and extract only those rows where sum of all rows in SOH, Cost and Sold_Date is zero.
My dataframe is as under:
Location Bar_Code SOH Cost Sold_Date
1 00000003589823 0 0.00 NULL
2 00000003589823 0 0.00 NULL
3 00000003589823 0 0.00 NULL
1 0000000151818 -102 0.00 NULL
2 0000000151818 0 8.00 NULL
3 0000000151818 0 0.00 2020-10-06T16:35:25.000
1 0000131604108 0 0.00 NULL
2 0000131604108 0 0.00 NULL
3 0000131604108 0 0.00 NULL
1 0000141073505 -53 3.00 2020-10-06T16:35:25.000
2 0000141073505 0 0.00 NULL
3 0000141073505 -20 20.00 2020-09-25T10:11:30.000
I have tried the below code:
df.groupby(['Bar_Code','SOH','Cost','Sold_Date']).sum()
but I am getting the below output:
Bar_Code SOH Cost Sold_Date
0000000151818 -102.0 0.0000 2021-12-13T10:01:59.000
0.0 8.0000 2020-10-06T16:35:25.000
0000131604108 0.0 0.0000 NULL
0000141073505 -53.0 0.0000 2021-11-28T16:57:59.000
3.0000 2021-12-05T11:23:02.000
0.0 0.0000 2020-04-14T08:02:45.000
0000161604109 -8.0 4.1000 2020-09-25T10:11:30.000
00000003589823 0 0.00 NULL
I need to check if it is possible to get the below desired output to get only the specific rows where sum of SOH, Cost & Sold_Date is 0 or NULL, its safe that the code ignores first Column (Locations):
Bar_Code SOH Cost Sold_Date
00000003589823 0 0.00 NULL
0000131604108 0.0 0.0000 NULL
Idea is filter all groups if SOH, Cost and Sold_Date is 0 or NaN by filter rows if not match first, get Bar_Code and last invert mask for filter all groups in isin:
g = df.loc[df[['SOH','Cost','Sold_Date']].fillna(0).ne(0).any(axis=1), 'Bar_Code']
df1 = df[~df['Bar_Code'].isin(g)].drop_duplicates('Bar_Code').drop('Location', axis=1)
print (df1)
Bar_Code SOH Cost Sold_Date
0 00000003589823 0 0.0 NaN
6 0000131604108 0 0.0 NaN
I have a df like this,
Date Value
0 2019-03-01 0
1 2019-04-01 1
2 2019-09-01 0
3 2019-10-01 1
4 2019-12-01 0
5 2019-12-20 0
6 2019-12-20 0
7 2020-01-01 0
Now, I need to group them by quarter and get the proportions of 1 and 0. So, I get my final output like this,
Date Value1 Value0
0 2019-03-31 0 1
1 2019-06-30 1 0
2 2019-09-30 0 1
3 2019-12-31 0.25 0.75
4 2020-03-31 0 1
I tried the following code, doesn't seem to work.
def custom_resampler(array):
import numpy as np
return array/np.sum(array)
>>df.set_index('Date').resample('Q')['Value'].apply(custom_resampler)
Is there a pandastic way I can achieve my desired output?
Resample by quarter, get the value_counts, and unstack. Next, rename the columns, using the name property of the columns. Last, divide each row value by the total per row :
df = pd.read_clipboard(sep='\s{2,}', parse_dates = ['Date'])
res = (df
.resample(rule="Q",on="Date")
.Value
.value_counts()
.unstack("Value",fill_value=0)
)
res.columns = [f"{res.columns.name}{ent}" for ent in res.columns]
res = res.div(res.sum(axis=1),axis=0)
res
Value0 Value1
Date
2019-03-31 1.00 0.00
2019-06-30 0.00 1.00
2019-09-30 1.00 0.00
2019-12-31 0.75 0.25
2020-03-31 1.00 0.00
I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar.
The timestamp data is around every 10 minutes.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.10 0.05
2019-04-14 00:10:00 2 0.30 0.10
For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.
The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id
So the output would be like this.
timestamp id foo bar
2019-04-14 00:00:10 1 0.10 0.05
2019-04-14 00:02:10 1 0.14 0.06
2019-04-14 00:04:10 1 0.11 0.06
2019-04-14 00:06:10 1 0.29 0.07
2019-04-14 00:08:10 1 0.22 0.09
2019-04-14 00:10:02 1 0.30 0.10
2019-04-14 00:00:00 2 0.80 0.50
2019-04-14 00:02:00 2 0.45 0.48
2019-04-14 00:04:00 2 0.52 0.42
2019-04-14 00:06:00 2 0.74 0.48
2019-04-14 00:08:00 2 0.41 0.45
2019-04-14 00:10:00 2 0.40 0.40
I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).
But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.
Appreciate if someone can point me in the right direction!
The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = (df.set_index('timestamp')
.groupby('id')['foo','bar']
.apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))
Then create helper DataFrame with same size like original and DataFrame.fillna missing values:
df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
foo bar
id
1 2019-04-14 00:00:10.000 0.100000 0.050000
2019-04-14 00:02:08.400 0.903435 0.755841
2019-04-14 00:04:06.800 0.956002 0.253878
2019-04-14 00:06:05.200 0.388454 0.257639
2019-04-14 00:08:03.600 0.225535 0.195306
2019-04-14 00:10:02.000 0.300000 0.100000
2 2019-04-14 00:00:00.000 0.100000 0.050000
2019-04-14 00:02:00.000 0.180865 0.327581
2019-04-14 00:04:00.000 0.417956 0.414400
2019-04-14 00:06:00.000 0.012686 0.800948
2019-04-14 00:08:00.000 0.716216 0.941396
2019-04-14 00:10:00.000 0.300000 0.100000
If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:
df_new = pd.concat([
d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)
df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)
df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())
Which gives the following output:
print(df_new)
timestamp id foo bar
0 2019-04-14 00:00:10.000 1.0 0.10 0.05
1 2019-04-14 00:02:08.400 1.0 0.14 0.06
2 2019-04-14 00:04:06.800 1.0 0.18 0.07
3 2019-04-14 00:06:05.200 1.0 0.22 0.08
4 2019-04-14 00:08:03.600 1.0 0.26 0.09
5 2019-04-14 00:10:02.000 1.0 0.30 0.10
6 2019-04-14 00:00:00.000 2.0 0.10 0.05
7 2019-04-14 00:02:00.000 2.0 0.14 0.06
8 2019-04-14 00:04:00.000 2.0 0.18 0.07
9 2019-04-14 00:06:00.000 2.0 0.22 0.08
10 2019-04-14 00:08:00.000 2.0 0.26 0.09
11 2019-04-14 00:10:00.000 2.0 0.30 0.10
I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)
I want to find the rows from pandas Dataframe_1 if the value in the fourth column within this row exists in any row of the entire first column of Dataframe_2. I need to copy these rows to the new table.
EDIT
Here I also include the dataframes:
Dataframe_1:
1 2 3 4
0
chr1 128611 128681 cuffs_1_128645 .
chr1 186868 186933 cuffs_2_186901 .
chr1 186978 187035 cuffs_3_187015 .
chr1 187054 187122 cuffs_4_187082 .
chr1 262712 262773 cuffs_5_262742 .
Dataframe_2:
1 2 3 4 5 6 7 8
0
cuffs_100001_101338862 1.24 3.11 1.86 11.19 5.59 8.08 0.62 0
cuffs_100004_101354225 2.49 0.62 1.86 1.86 2.49 1.24 0.00 0
cuffs_100045_101386584 14.92 14.92 3.11 10.57 5.59 15.54 0.62 0
cuffs_100089_101719129 2.49 0.62 1.86 5.59 1.86 1.86 0.00 0
cuffs_100111_101726996 6.84 0.00 3.73 3.11 6.84 2.49 0.62 0
Both dataframes are imported from .csv and are huge, so here I've put only a few rows and columns.
This is what I tried:
import pandas as pd
df1 = pd.DataFrame.from_csv(Dataframe_1, sep = '\t', index_col=list(range(0,1,2)), header = None)
df2 = pd.DataFrame.from_csv(Dataframe_2, sep = '\t', index_col=list(range(0,1,2)), header = None)
df1 = df1[df1[3] == df2[0]]
df1.to_csv(fileout, sep = '\t', header = False)
When performing this I get eight (or so) lines of response referring to the pandas package files, index.pyx and hashtable.pyx which I don't understand.
Got it!
Apparently, none of the tested commands for filtering, be it df1 = df1[df1[3].isin(df2[0])] or df1 = df1[df1[3] == df2[0]] recognise the "0" columns, which represented the rows indexes. The way out would be to import the Dataframe_2 assigning the columns not like (0,1,2) but (1,2,3) this will lead to the following formatting of the df2:
0 2 3 4 5 6 7 8
1
1.24 cuffs_100001_101338862 3.11 1.86 11.19 5.59 8.08 0.62 0
2.49 cuffs_100004_101354225 0.62 1.86 1.86 2.49 1.24 0.00 0
14.92 cuffs_100045_101386584 14.92 3.11 10.57 5.59 15.54 0.62 0
2.49 cuffs_100089_101719129 0.62 1.86 5.59 1.86 1.86 0.00 0
6.84 cuffs_100111_101726996 0.00 3.73 3.11 6.84 2.49 0.62 0
Where the "0" column is no longer the index for rows. Then we can apply df1 = df1[df1[3].isin(df2[0])]. NOTE: application of df1 = df1[df1[3] == df2[0]] will raise the error message Series lengths must match to compare
Thanks!