Keep only the last record if the value occurs continuously - python-3.x

Keep only the last record if the values occurs continuously.
Input_df:
Date
Value
2022/01/01
5
2022/01/03
4
2022/01/05
3
2022/01/06
3
2022/01/07
3
2022/01/08
4
2022/01/09
3
Output_df:
Date
Value
2022/01/01
5
2022/01/03
4
2022/01/07
3
2022/01/08
4
2022/01/09
3
-- The value 3 repeats continuously for 3 dates, so we only keep the latest record out of the three continuous dates and if there is a different value transmitted in between the continuity breaks, so do not delete the record.

You can use pandas.Series.diff to create a flag and see is the column value is continous or not. See the documentation here.
Then drop line that are continous.
# Create the dataframe
df = pd.DataFrame({
"Date" : ["2022/01/01", "2022/01/03", "2022/01/05", "2022/01/06", "2022/01/07", "2022/01/08", "2022/01/09"],
"Value" : [5, 4, 3, 3, 3, 4, 3]
})
# Create a flag
df['Diff'] = df['Value'].diff(periods = -1).fillna(1)
df = df.loc[df['Diff'] != 0, :].drop('Diff', axis = 1)

Try this with sql
SELECT distinct date, VALUE,
max(case
when value=lead(value) then
lead(date) else date end)
Over (order by Null) from table;

Related

python3.7 & pandas - use column value in row as lookup value to return different column value

I've got a tricky situation - tricky for me since I'm really new to python. I've got a dataframe in pandas and I need to logic my way through building a new column that will be used later in a data match from a difference source. Basically, the picture tells what I can't figure out.
For any of the LOW labels I need to retrieve their MID_LEVEL label and copy it to a new column. The DESIRED OUTPUT column is what I need to create.
You can see that the LABEL_PATH is formatted in a way that I can use the first 9 digits as a "lookup" to find the corresponding LABEL, but I can't figure out how to achieve that. As an example, for any row that the LABEL_PATH starts with "0.02.0004" the desired output needs to be "MID_LEVEL1".
This dataset has around 25k rows, so wanted to avoid row iteration as well.
Any help would be greatly appreciated!
Chosing a similar example as you did:
df = pd.DataFrame({"a":["1","1.1","1.1.1","1.1.2","2"],"b":range(5)})
df["c"] = np.nan
mask = df.a.apply(lambda x: len(x.split(".")) < 3)
df.loc[mask,"c"] = df.b[mask]
df.c.fillna(method="ffill", inplace=True)
Most of the magic takes place in the line where mask is defined, but it's not that difficult: if the value in a gets split into less than 3 parts (i.e., has at most one dot), mark it as True, otherwise not.
Use that mask to copy over the values, and then fill unspecified values with valid values from above.
I am using this data for comparison :
test_dict = {"label_path": [1, 2, 3, 4, 5, 6], "label": ["low1", "low2", "mid1", "mid2", "high1", "high2"], "desired_output": ["mid1", "mid2", "mid1", "mid2", "high1", "high2"]}
df = pd.DataFrame(test_dict)
Which gives :
label_path label desired_output
0 1 low1 mid1
1 2 low2 mid2
2 3 mid1 mid1
3 4 mid2 mid2
4 5 high1 high1
5 6 high2 high2
With a bit ogf logic and a merge :
desired_label_df = df.drop_duplicates("desired_output", keep="last")
desired_label_df = desired_label_df[["label_path", "desired_output"]]
desired_label_df.columns = ["desired_label_path", "desired_output"]
df = df.merge(desired_label_df, on="desired_output", how="left")
Gives us :
label_path label desired_output desired_label_path
0 1 low1 mid1 3
1 2 low2 mid2 4
2 3 mid1 mid1 3
3 4 mid2 mid2 4
4 5 high1 high1 5
5 6 high2 high2 6
Edit: if you want to create the desired_output column, just do the following :
df["desired_output"] = df["label"].apply(lambda x: x.replace("low", "mid"))

How to return index of a row 60 seconds before current row

I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')

Delete dataframe rows based upon two dependent conditions

I have a fairly large dataframe (a few hundred columns) and I want to perform the following operation on it. I am using a toy dataframe below with a simple condition to illustrate what I need.
For every row:
Condition #1:
Check two of the columns for a value of zero (0). If this is true, keep the row
and move on to the next. If either column has a value of zero (0), the condition is True.
If Condition #1 is False (no zeros in either column 1 or 4)
Check all remaining columns in the row.
If any of the remaining columns has a value of zero, drop the row.
I would like the filtered dataframe returned as a new, separate dataframe.
My code so far:
# https://codereview.stackexchange.com/questions/185389/dropping-rows-from-a-pandas-dataframe-where-some-of-the-columns-have-value-0/185390
# https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
# https://stackoverflow.com/questions/29763620/how-to-select-all-columns-except-one-column-in-pandas
import pandas as pd
df = pd.DataFrame({'Col1': [7, 6, 0, 1, 8],
'Col2': [0.5, 0.5, 0, 0, 7],
'Col3': [0, 0, 3, 3, 6],
'Col4': [7, 0, 6, 4, 5]})
print(df)
print()
exclude = ['Col1', 'Col4']
all_but_1_and_4 = df[df.columns.difference(exclude)] # Filter out columns 1 and 4
print(all_but_1_and_4)
print()
def delete_rows(row):
if row['Col1'] == 0 or row['Col4'] == 0: # Is the value in either Col1 or Col4 zero(0)
skip = True # If it is, keep the row
if not skip: # If not, check the second condition
is_zero = all_but_1_and_4.apply(lambda x: 0 in x.values, axis=1).any() # Are any values in the remaining columns zero(0)
if is_zero: # If any of the remaining columns has a value of zero(0)
pass
# drop the row being analyzed # Drop the row.
new_df = df.apply(delete_rows, axis=1)
print(new_df)
I don't know how to actually drop the row if both of my conditions are met.
In my toy dataframe, rows 1, 2 and 4 should be kept, 0 and 3 dropped.
I do not want to manually check all columns for step 2 because there are several hundred. That is why I filtered using .difference().
What I will do
s1=df[exclude].eq(0).any(1)
s2=df[df.columns.difference(exclude)].eq(0).any(1)
~(~s1&s2) #s1 | ~s2
Out[97]:
0 False
1 True
2 True
3 False
4 True
dtype: bool
yourdf=df[s1 | ~s2].copy()
The WeNYoBen's answer is excellent, so I will only show mistakes in your code:
The condition in the following if statement will never fulfill:
skip = True # If it is, keep the row
if not skip: # If not, check the second condition
You probably wanted to unindent the following rows, i.e. something as
skip = True # If it is, keep the row
if not skip: # If not, check the second condition
which is the same as a simple else:, without the need of skip = True:
else: # If not, check the second condition
The condition in the following if statement will always fulfill, if at least one value in you whole table is zero (so not only in the current row, as you supposed):
is_zero = all_but_1_and_4.apply(lambda x: 0 in x.values, axis=1).any() # Are any values in the remaining columns zero(0)
if is_zero: # If any of the remaining columns has a value of zero(0)
because all_but_1_and_4.apply(lambda x: 0 in x.values, axis=1) is a series of True / False values - one for every row in the all_but_1_and_4 table. So after applying the .any() method to it you receive what I said.
Note:
Your approach is not bad, you may add a variable dropThisRow in your function, set it to True or False depending on conditions, and return it.
Then you may use your function to make the True / False series and use it for creating your target table:
dropRows = df.apply(delete_rows, axis=1) # True/False for dropping/keeping - for every row
new_df = df[~dropRows] # Select only rows with False

How to get values of one column based on another column using specific match values

I have 5 columns contains [ Voltage,Bus,Load,load_Values,transmission, transmission_Values]. all the column name with Values contain numerical value based on their corresponding value.The csv files looks like that below
Voltage Bus Load load_Values transmission transmission_Values
Voltage(1) 2 load(1) 3 transmission(1) 2
Voltage(2) 2 load(2) 4 transmission(2) 3
Voltage(5) 3 load(3) 5 transmission(3) 5
I have to fetch value of Bus based on Transmission and load. for example
To get the value of bus. First, I need to fetch the value of transmission(2) which is 3. Now based on this value, I need to get the value of load which is load(3)=5.Next, Based on this value, I have to get the value of Voltage(5) which is 3.
I tried to get the value of single column based on the their corresponding column value.
total=df[df['load']=='load(1)']['load_Values']
next_total= df[df['transmission']=='transmission['total']']['transmission_Values']
v_total= df[df['Voltage']=='Voltage(5)']['Voltage_Values']
How to get all these values automatically. For example, if i have 1100 values in every column, How I can fetch all the values for 1100 in these columns.
This is how dataset looks like
So to get the Value of VRES_LD which is new column. For that I have to look for the I__ND_LD Column which has value I__ND_LD(1) and corressponding value stored in I__ND_LD_Values which is 10. Once I get the value 10 now based on that I ahve to Look for I__BS_ND column which has I__BS__ND(10) and its value is 5.0 in I__BS_ND_Values. Based on this value, I have to find the value of V_BS(5) which is 0.986009. Now this value should be store in the new column VRES_LD. Please let me know if you get it now.
I generalized your solution so you can work with as many values as you want.
I changed the name "Load_Value" to "load_value_name" to avoid confusion since there is a variable named "load_value" in lowercase.
You can start with as many values as you want; in our example we start with "1":
start_values = [1]
load_value_name = [f"^I__ND_LD({n})" for n in start_values]
#Output: but you'll have more than one if needed
['^I__ND_LD(1)']
Then we fetch all the values:
load_values=df[df['I__ND_LD'].isin(load_names)]['I__ND_LD_Values'].values.astype(np.int)
#output: again, more if needed
array([10])
let's get the bus names:
bus_names = [f"^I__BS_ND({n})" for n in load_values]
bus_values = df[df['I__BS_ND'].isin(bus_names)]['I__BS_ND_Values'].values.astype(np.int)
#output
array([5])
And finally voltage:
voltage_bus_value = [f"^V_BS({n})" for n in bus_values]
voltage_values = df[df['V_BS'].isin(voltage_names)]['V_BS_Values'].values
#output
array([0.98974069])
Notes:
Instead of rounding I downcasted to int; and .isin() method looks for all occurances so you can fetch all of the values.
If I understand correctly, you should be able to create key/value tables and use merge. The step to voltage is a little unclear, but the basic idea below should work, I think:
df = pd.DataFrame({'voltage': {0: 'Voltage(1)', 1: 'Voltage(2)', 2: 'Voltage(5)'},
'bus': {0: 2, 1: 2, 2: 3},
'load': {0: 'load(1)', 1: 'load(2)', 2: 'load(3)'},
'load_values': {0: 3, 1: 4, 2: 5},
'transmission': {0: 'transmission(1)',
1: 'transmission(2)',
2: 'transmission(3)'},
'transmission_values': {0: 2, 1: 3, 2: 5}})
load = df[['load', 'load_values']].copy()
trans = df[['transmission','transmission_values']].copy()
load['load'] = load['load'].str.extract('(\d)').astype(int)
trans['transmission'] = trans['transmission'].str.extract('(\d)').astype(int)
(df[['bus']].merge(trans, how='left', left_on='bus', right_on='transmission')
.merge(load, how='left', left_on='transmission_values', right_on='load'))
resulting in:
bus transmission transmission_values load load_values
0 2 2 3 3.0 5.0
1 2 2 3 3.0 5.0
2 3 3 5 NaN NaN
I think you need to do 3 things.
1.You need to put a number inside a string. You do it like this:
n_cookies = 3
f"I want {n_cookies} cookies"
#Output
I want 3 cookies
2.Let's say the values you need to fetch are:
transmission_values = [2,5,20]
You than need to fetch those load values:
load_values_to_fetch = [f"transmission({n})" for n in transmission_values]
#output
[transmission(2),transmission(5),transmission(20)]
3.Get all the voltage values from the df. Use .isin() method:
voltage_value= df[df['Voltage'].isin(load_values_to_fetch )]['Voltage_Values'].values
I hope I understood the problem correctly. Try and let us know because I can't try the code without data

How to get data in groupby like SQL having with pandas.

I have data like below.
id, name, password, note, num
1, hoge, xxxxxxxx, aaaaa, 2
2, hoge, xxxxxxxx, bbbbb, 1
3, moge, yyyyyyyy, ccccc, 2
4, zape, zzzzzzzz, ddddd, 3
I would like to make framedata using groupby same name and password. In this case, 1,hoge and 2,hoge are treated as same data. Then I would like to get count 3
from num column.
I tried like below.
df1 = pd.read_csv("sample.csv")
df2 = df1.groupby(['name','password']).count()
print(df2[df2[note] > 1])
It goes like this.
name, password, note, num
hoge, xxxxxxxx, 2, 2
How can I get sum of num value?
I belive you need GroupBy.size or count for exclude NaNs rows with transform for new Series with same size like original DaatFrame, so possible filtering with sum:
s = df1.groupby(['name','password'])['note'].transform('size')
s = df1.groupby(['name','password'])['note'].transform('count')
out = df1.loc[s > 1, 'num'].sum()
print (out)
3
If want count only duplicated rows filter by DataFrame.duplicated with specify columns for check dupes:
out = df1.loc[df1.duplicated(['name','password'], keep=False), 'num'].sum()
print (out)
3

Resources