Removing duplicates based on 2 columns but keeping the row where the 3rd column is not null

Removing duplicates based on 2 columns but keeping the row where the 3rd column is not null - powerbi-desktop

I have a dataset that has some duplicates in the name and date, however the 3rd column has a number or null value, when removing the duplicate I want to make the condition to keep the not null value in the 3rd column
example:
table
I want to keep all the ones that are in blue, so as you can see it isnt that I want to get rid of all the null values. and if i highlight the date and name column and remove duplicates it keeps first, and there is no pattern as to whether the number shows as first or second in the duplicates.
Any help would be appreciated. 
Thank you

Try this:
library(gdata)
your.data <- tibble(
Name = c("Wanita Jones" , "Wanita Jones", "Corbin Dallas", "Corbin Dallas", "Corbin Dallas", "Corbin Dallas", "Alex Mills", "Alex Mills", "Andrea Coyle", "Andrea Coyle"),
Date = c("01/24/2021" , "01/24/2021", "03/28/2021", "03/28/2021", "03/29/2021", "03/31/2021", "01/24/2021", "01/24/2021", "03/27/2021", "04/01/2021"),
Request = c("null" , 6, 7, "null", "null", 1, "null", 10, "null", "null")
)
your.data %>% filter(
!duplicated2(interaction(Name,Date)) | Request != "null"
)
Output:
# A tibble: 7 x 3
Name Date Request
<chr> <chr> <chr>
1 Wanita Jones 01/24/2021 6
2 Corbin Dallas 03/28/2021 7
3 Corbin Dallas 03/29/2021 null
4 Corbin Dallas 03/31/2021 1
5 Alex Mills 01/24/2021 10
6 Andrea Coyle 03/27/2021 null
7 Andrea Coyle 04/01/2021 null
If you have no interest in installing gdata, you can first sort on Request you make sure the first item of a duplicate is always the one without null:
your.data %>% arrange( Request == "null" ) %>%
filter(
!duplicated(interaction(Name,Date)) | Request != "null"
) %>% arrange( Date, Name )
Or just paste the function from the source

Related

Keep only the last record if the value occurs continuously

Keep only the last record if the values occurs continuously.
Input_df:
Date
Value
2022/01/01
5
2022/01/03
4
2022/01/05
3
2022/01/06
3
2022/01/07
3
2022/01/08
4
2022/01/09
3
Output_df:
Date
Value
2022/01/01
5
2022/01/03
4
2022/01/07
3
2022/01/08
4
2022/01/09
3
-- The value 3 repeats continuously for 3 dates, so we only keep the latest record out of the three continuous dates and if there is a different value transmitted in between the continuity breaks, so do not delete the record.

You can use pandas.Series.diff to create a flag and see is the column value is continous or not. See the documentation here.
Then drop line that are continous.
# Create the dataframe
df = pd.DataFrame({
"Date" : ["2022/01/01", "2022/01/03", "2022/01/05", "2022/01/06", "2022/01/07", "2022/01/08", "2022/01/09"],
"Value" : [5, 4, 3, 3, 3, 4, 3]
})
# Create a flag
df['Diff'] = df['Value'].diff(periods = -1).fillna(1)
df = df.loc[df['Diff'] != 0, :].drop('Diff', axis = 1)

Try this with sql
SELECT distinct date, VALUE,
max(case
when value=lead(value) then
lead(date) else date end)
Over (order by Null) from table;

in power Query: How to create conditional column that removes numbers and keeps text

col 1 contains rows that have just numbers and just text. example:
row 1 = 70
row 2 = RS
row 3= abcddkss
row 5 = 5
row 6 = 88
and so on
What I want to do is add a column using logic like this: if Col1 not a number then Col1 else null.
what I have so far:
=let mylist=List.RemoveItems(List.Transform({1..126}, each Character.FromNumber(_)),{"0".."9"})
in
if List.Contains(mylist,Text.From([Column1])) then [Column1] else null
however, this will not work for rows that have more than one letter and will only work on ones that have one letter

You can use this:
if Value.Is(Value.FromText([dat]), type number) then null else [dat]

You could also check if the string is purely digit characters.
if [Column1] = Text.Select([Column1], {"0".."9"}) then null else [Column1]

Search column name with multiple conditions pandas

I have a query to retrieve all columns with the name date in them as below
date_raw_cols = [col for col in df_raw.columns if 'date' in col]
That is also picking up columns with updated which I want to exclude. I've also tried a regex filter as below with same problem of returning updated
df_dates = df_raw.filter(regex='date', axis='columns')
How do I combine conditions to filter column names. i.e.
Where column name is date but not update, but could be date1, _date, date_

Instead of searching for 'date' in a column name, you can be more explicit:
# Assume example df_raw
>>> df_raw
date date1 prev_date update
0 1 2 3 200
1 4 2 5 300
2 5 5 3 100
>>> date_raw_cols = [col for col in df_raw.columns if col == 'date']
>>> print(date_raw_cols)
['date']
EDIT: If your question fully covers your data at hand, you can add an extra condition in the list comprehension with len() < 6, which will only grab column names with number of characters less than 6. This way you don't have to explicitly deal with underscores or digits.
>>> df_raw
date date1 prev_date update _date date_
0 1 2 3 200 a g
1 4 2 5 300 s h
2 5 5 3 100 v a
>>> date_raw_cols = [col for col in df_raw.columns if 'date' in col and len(col) < 6]
>>> print(date_raw_cols)
['date', 'date1', '_date', 'date_']

Try following regex:
\b(\w*(?=[^a-z]date)|(?=date[^a-z]))\w*\b
It will find all words that contain "date" which are bounded with numbers or punctuation marks:
re.findall(r'\b\w*(?=[^a-z]date)\w*\b|\b(?=date[^a-z])\w*\b',
'date1 date date_1 update new_date 234_date 33date datetime ')
['date1', '', 'date', '', 'date_1', 'new_date', '234_date', '33date', '']

SWITCH True Multiple column and criteria in Power BI

I have a two different columns are one is name and another one is ID.
The Name column contain text and number and the length of characters not always same.
ID column contain number only.
1.If ID column contain 38 and Name column contain "-" then Train.
2.If ID column contain 56 and Name column contain "-" then Air.
3.If ID column contain 38 and Name column does not contain "-" then Road.
4.If ID column contain 56 and Name column does not contain "-" then Road.
In Excel I am applying the following formula
=IF(A3="","",IFERROR(IF(REPLACE(A3,1,SEARCH("-",A3),)+0,IF(B3&""="38","TRAIN",IF(B3&""="56","AIR","ROAD"))),"ROAD"))
in order to get the result.
I want calculated column.

this might be implemented in DAX as a calculated column that uses the SWITCH statement
SWITCH(
TRUE(),
T[ID] = 38
&& SEARCH( "-", T[NAME], 1, 0 ) > 0, "Train",
T[ID] = 56
&& SEARCH( "-", T[NAME], 1, 0 ) > 0, "Air",
T[ID]
IN { 38, 56 }
&& SEARCH( "-", T[NAME], 1, 0 ) = 0, "Road"
)
The default (no matching conditions) is to return BLANK()
Nested IFs could also be used, it is a matter of taste

How to test element in one Excel array is present in another

I've got a spreadsheet of time series data with a series of columns that mark the presence of certain events that took place.
Like so:
A B C D
1 Date Event1 Event2 Event3
2 24/10/2016 T NULL NULL
3 31/10/2016 S NULL NULL
4 06/10/2016 NULL NULL NULL
5 20/10/2016 V NULL NULL
6 20/10/2016 T S V
7 01/12/2016 T NULL NULL
8 01/12/2016 S T NULL
9 29/11/2016 NULL NULL NULL
10 10/10/2016 T NULL NULL
I've then got a lookup table with a column of the events:
A
1 T
2 S
3 V
What I'd like to do is create a new column in the time series to flag a single value, say 1, if at least one, but possibly more, of the events in the lookup has taken place.
What's an effective way of doing that?
UPDATE:
The problem is more complicated that there may be non-NULL event types that don't appear in my look up list and for which I wouldn't want to trigger the flag.
For instance if I had:
9 29/11/2016 G NULL NULL
I would want to flag 0, but
10 10/10/2016 G T NULL
I would want to flag 1.

My understanding is that it is sufficient to check if the Event1 column has one of the values in your lookup table.
For simplicity I will imply that both tables are in distinct sheets.
Hence, all you need to do is apply the VLOOKUP formula on the newly created column in the first sheet (let's simply call it "Flag").
The formula for the first cell (E2) should be:
=VLOOKUP(B2, Sheet2!A:A, 1, false)
Just drag the same formula to the rest of the rows or double-click the bottom-right corner of this cell and you should be good to go.
This will also show you what the first event for that date is. If you only want a bit value (1/0), you can encase the formula in a simple IF.
Hope this helps!
EDIT:
After the new info, the solution doesn't change much:
=IF(IsNA(VLOOKUP(B2, Sheet2!A:A, 1, false)), 0, 1) + IF(IsNA(VLOOKUP(C2, Sheet2!A:A, 1, false)), 0, 1) + IF(IsNA(VLOOKUP(D2, Sheet2!A:A, 1, false)), 0, 1)
This will even tell you how many of those values each row has.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing duplicates based on 2 columns but keeping the row where the 3rd column is not null - powerbi-desktop

Related

Keep only the last record if the value occurs continuously

in power Query: How to create conditional column that removes numbers and keeps text

Search column name with multiple conditions pandas

SWITCH True Multiple column and criteria in Power BI

How to test element in one Excel array is present in another

Categories

Resources