How to extract rows with some processing steps using python pandas? - python-3.x

My dataframe:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A1 | [3-5] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |
I would like to remove those rows with (i)same query_name and (ii) overlap entirely for the position_description?
Desired output:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |

If there can be no more than one row contained in another we can use:
from ast import literal_eval
df2 = pd.DataFrame(df['position_description'].str.replace('-', ',')
.apply(literal_eval).tolist(),
index=df.index).sort_values(0)
print(df2)
0 1
0 1 10
2 1 20
3 1 15
5 1 15
1 3 5
4 10 20
check = df2.groupby(df['query_name']).shift()
df.loc[~(df2[0].gt(check[0]) & df2[1].lt(check[1]))]
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]

This should work for any number of ranges being contained by some ranges:
First, extract the boundaries
df = pd.DataFrame({
'query_name': ['A1', 'A1', 'A2', 'A3', 'A4', 'A4'],
'position_description': ['[1-10]', '[3-5]', '[1-20]', '[1-15]', '[10-20]', '[1-15]'],
})
df[['pos_x', 'pos_y']] = df['position_description'].str.extract(r'\[(\d+)-(\d+)\]').astype(int)
Then we will define the function that can choose what ranges to keep:
def non_contained_ranges(df):
df = df.drop_duplicates('position_description', keep='first') #Duplicated ranges will be seen as being contained by one another and thus all wouldn't pass this check. Drop all but one duplicate here.
range_min = df['pos_x'].min()
range_max = df['pos_y'].max()
range_size = range_max - range_min + 1
b = np.zeros((len(df), range_size))
for i, (x, y) in enumerate(df[['pos_x', 'pos_y']].values - range_min):
b[i, x: y+1] = 1.
b2 = np.logical_and(np.logical_xor(b[:, np.newaxis], b), b).any(axis=2)
np.fill_diagonal(b2, True)
b3 = b2.all(axis=0)
return df[b3]
If there are N ranges within a group (query_name), this function will do N x N comparisons, using boolean array operations.
Then we can do groupby and apply the function to yield the expected result
df.groupby('query_name')\
.apply(non_contained_ranges)\
.droplevel(0, axis=0).drop(columns=['pos_x', 'pos_y'])
Outcome:
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]

Related

Splitting a column into multiple columns

I have a pandas dataframe as below :
| A | Value |
+----------+--------+
|ABC001035 | 34 |
|USN001185 | 45 |
|UCT010.75 | 23 |
|ATC001070 | 21 |
+----------+--------+
I want to split the column in A (based on last three digits in A) into columns X and Y, and it should look like below
| A | Value | X | Y |
+----------+--------+---------+-----+
|ABC001035 | 34 | ABC001 | 035 |
|USN001185 | 45 | USN001 | 185 |
|UCT010.75 | 23 | UCT01 | 0.75|
|ATC001070 | 21 | ATC001 | 070 |
+----------+--------+---------+-----+
So how to split the column A ?
You can index all strings in a series with the .str accessor:
>>> df['X'] = df['A'].str[:-3]
>>> df['Y'] = df['A'].str[-3:]
>>> df
A Value X Y
0 ABC001035 34.0 ABC001 035
1 USN001185 45.0 USN001 185
2 UCT010.75 23.0 UCT010 .75
3 ATC001070 21.0 ATC001 070
Split your problem into smaller ones, easier to solve! :)
How to split a string (take the last 3 characters):
'Hello world!'[-3:0]
# Returns: ld!
How to apply a function over a DataFrame value?
df.A.apply(lambda x: x[-3:])
# Returns pandas.Series: [035, 185, 0.75, 070]
How to save a Series to a new DataFrame column?
# Create Y column.
df['Y'] = df.A.apply(lambda x: x[-3:])

Pandas: for each row count occurrence in another df within specific dates

I have the following 2 dfs:
df1
|company|company_id| date | max_date |
| A21 | 5 |2021-02-04| 2021-02-11|
| A21 | 10 |2020-10-04| 2020-10-11|
| B52 | 8 |2021-03-04| 2021-04-11|
| B52 | 6 |2020-04-04| 2020-04-11|
-------------------------------------------
and
df2:
|company|company_id| date_df2 |
| A21 | 5 |2021-02-05|
| A21 | 5 |2021-02-08|
| A21 | 5 |2021-02-12|
| A21 | 5 |2021-02-11|
| A21 | 10 |2020-10-07|
| B52 | 8 |2021-03-07|
| B52 | 6 |2020-04-08|
| B52 | 6 |2020-04-12|
| B52 | 6 |2020-04-05|
-------------------------------
Logic:
For each company and company_id in df1 i want to count how many occurence are in df2 where the date_df2 in df2 is between the date and max_date from df1
Expected results:
|company|company_id| date | max_date |count|
| A21 | 5 |2021-02-04| 2021-02-11| 3 |
| A21 | 10 |2020-10-04| 2020-10-11| 1 |
| B52 | 8 |2021-03-04| 2021-04-11| 1 |
| B52 | 6 |2020-04-04| 2020-03-11| 2 |
------------------------------------------------
How can this be achieved in pandas?
Code to reproduce the df:
#df1
list_columns = ['company','company_id','date','max_date']
list_data = [
['A21',5,'2021-02-04','2021-02-11'],
['A21',10,'2020-10-04','2020-10-11'],
['B52',8,'2021-03-04','2021-04-11'],
['B52',6,'2020-04-04','2020-04-11']
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
#df2
list_columns = ['company','company_id','date']
list_data = [
['A21',5,'2021-02-05'],
['A21',5,'2021-02-08'],
['A21',5,'2021-02-12'],
['A21',5,'2021-02-11'],
['A21',10,'2020-10-07'],
['B52',8,'2021-03-07'],
['B52',6,'2020-04-08'],
['B52',6,'2020-04-12'],
['B52',6,'2020-04-05']
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Use DataFrame.merge with default inner join, then filter matched valeus by Series.between, aggregate counts by GroupBy.size and append new column with repalce missing values if necessary:
df1['date'] = pd.to_datetime(df1['date'])
df1['max_date'] = pd.to_datetime(df1['max_date'])
df2['date'] = pd.to_datetime(df2['date'])
df = df1.merge(df2, on=['company','company_id'], suffixes=('','_'))
s = (df[df['date_'].between(df['date'], df['max_date'])]
.groupby(['company','company_id'])
.size())
df1 = df1.join(s.rename('count'), on=['company','company_id']).fillna({'count':0})
print (df1)
company company_id date max_date count
0 A21 5 2021-02-04 2021-02-11 3
1 A21 10 2020-10-04 2020-10-11 1
2 B52 8 2021-03-04 2021-04-11 1
3 B52 6 2020-04-04 2020-04-11 2

How to count only same line specific column (A) is true and column (B) not empty

| A | B |
1 | Boolean | number |
2 | TRUE | 0
3 | FALSE | 1
4 | TRUE |
5 | FALSE | 1
- - - - - - - - -
6 | 2 COUNTIF(A2:A5,TRUE) | ?
B6 How to count only same line specific column (A) is true and column (B) not empty.
e.g A2, A4 is true, B2 not empty B4 is empty so only count B2 = 1
=COUNTIF(A2:A5,TRUE,COUNTIF(...))
Use
=COUNTIFS(A2:A5,TRUE,B2:B5,"<>")
See image for reference:

How to reference a cell where sheet name is cited as a value from a different workbook?

I have two workbooks data.xlsx (which is a readonly and contains sheets mainsheet, a, b,c,d...) andresult.xlsx` (where I will put all my computations and formula).
data.xlsx!mainsheet contains:
-------
A | B |
-------
1 | c |
2 | b |
3 | a |
.
.
.
-------
and results.xlsx contains
-------------------
| A | B | C |
-------------------
1 |S1 | S2 | Sum |
2 | 3 | 1 | |
3 | 2 | 3 | |
4 | 1 | 2 | |
Values of cell A1 of sheets a, b, c are 10, 5 and 50 respectively.
What should be the formula so that:
Sheet C2 should be the sum of A1 values of sheet a and c
Sheet C3 should be the sum of A1 values of sheet b and a
Sheet C2 should be the sum of A1 values of sheet c and b
So the expected result will be cell C2 = 10+50=60, C3=5+10=15, C4=50+5=55.
Use vlookup to get the sheenames which needs to be summed.
Use indirect to get the values
=INDIRECT(VLOOKUP(A4,[data.xlsx]mainsheet!$A:$B,2,0)&"!A1")+INDIRECT(VLOOKUP(B4,[data.xlsx]mainsheet!$A:$B,2,0)&"!A1")

Excel - How to highlight duplicate values in their own row when values aren't adjacent

In my spreadsheet, I am trying to highlight duplicate values in a row.
Catch #1 is, every row is assessed differently.
Catch #2 is, the values are not adjacent in their rows.
Example:
A | B | C | D | E | F | G | H | I |
1 Bob | 1 | Jim | 2 | Pat | 3 | Sam | 4 | |
2 Bob | 3 | Pat | 1 | Sam | 1 | Jim | 2 | |
3 Jim | 2 | Bob | 2 | Pat | 3 | Sam | 2 | |
4 Pat | 3 | Pat | 3 | | | | | |
5 | | | | | | | | |
In the example, I am checking each row for duplicates in columns B, D, F and H. Basically the number columns are being assessed against each other.
Row 1: None are highlighted.
Row 2: D2 and F2 are highlighted
Row 3: B3, D3, H3 are highlighted
Row 4: B4 and D4 are highlighted, but F4 and H4 are not because they're empty. A4 and C4 aren't highlighted either because columns A, C, E and G aren't being checked.
Try this formula in conditional formatting's formula based rule:
=AND(ISNUMBER(A1), COUNTIF(1:1,A1)>1)
Hope this helps!

Resources