Unable to understand DataFrame method "loc" logic , If we use incorrect names of labels - python-3.x

I am using the method loc for extracting the columns with the use of labels. I encountered an issue while using incorrect names of labels resulting in some output as follows. PLease help me to understand the logic behind the loc method in terms of labels use.
import pandas as pd
Dic={'empno':(101,102,103,104),'name':('a','b','c','d'),'salary':(3000,5000,8000,9000)}
df=pd.DataFrame(Dic)
print(df)
print()
print(df.loc[0:2,'empsfgsdzfsdfsdaf':'salary'])
print(df.loc[0:2,'empno':'salarysadfsa'])
print(df.loc[0:2,'name':'asdfsdafsdaf'])
print(df.loc[0:2,'sadfsadfsadf':'sasdfsdflasdfsdfsdry'])
print(df.loc[0:2,'':'nasdfsd'])
OUTPUT:
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
3 104 d 9000
name salary
0 a 3000
1 b 5000
2 c 8000
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
Empty DataFrame
Columns: []
Index: [0, 1, 2]
salary
0 3000
1 5000
2 8000
empno name
0 101 a
1 102 b
2 103 c

.loc[A : B, C : D] will select:
index (row) labels from (and including) A to (and including) B; and
column labels from (and including) C to (and including) D.
Let's look at the column label slice 'a' : 'salary'. Since a is before the first column label, we get empno, name, salary.
print(df.loc[0:2, 'a':'salary'])
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
It works the same way at the upper end of the slice:
print(df.loc[0:2, 'name':'z'])
name salary
0 a 3000
1 b 5000
2 c 8000
Here is a list comprehension that shows how the second slice works:
# code
[col for col in df.columns if 'name' <= col <= 'z']
# result
['name', 'salary']

There is a good description for all most used subsetting methods here:
https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

filtering rows in one dataframe based on two columns of another dataframe

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

Indexing based on multiple columns

I'm new to python and below mentioned is an ongoing data engineering issue I'm currently trying to resolve.
Table structure
Data:
Index 1 :
Is sequential and would increment by 1 as rows are added.
Index 2 : The problem <<-- To tabulate index 2
This is dependent on values stored in the columns [A,B,C,D,E]. If the value remains the same, we need to assign a single index for these rows.
eg: Rows 1,2,3 have 567 as a value for A,B,C respectively.
Therefore, index 2 is 100 for these 3 rows.
Record types :
1 - A
2 - B
3 - C
4 - D
5 - E
Code
data = [(100, 100, 1 , 567,'','','','') ,
(101, 100, 2 , '',567,'','','') ,
(102, 100, 3 , '','',567,'','') ,
(103, 101, 3 , '','',568,'','') ,
(104, 101, 4 , '','','',568,'') ,
(105, 101, 5 , '','','','',568) ]
#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)
#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)
# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)
Attempting to use the combined_cleaned column to calculate index2.
Not sure if this is the right approach, open to suggestions.
A few assumptions here, but seem to fit your problem.
If there is only ever 1 value over those columns for each row then you can take the max along the row, and then find consecutive groups checking whether that Series is equal to itself, shifted.
We add 99 because by definition the counting will start at 1, but you seem to want 100.
val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0 567.0
#1 567.0
#2 567.0
#3 568.0
#4 568.0
#5 568.0
#dtype: float64
df['index2'] = s.ne(s.shift()).cumsum() + 99
print(df)
index1 record_type A B C D E index2
0 100 1 567 100
1 101 2 567 100
2 102 3 567 100
3 103 3 568 101
4 104 4 568 101
5 105 5 568 101
If instead of a single value, 'record_type' points to the appropriate column you can use numpy indexing.
import numpy as np
arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()
vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)
The combined_cleaned column could be generated directly using
cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)
You can also try with stack followed by factorize:
cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s
print(df)
index1 index2 record_type A B C D E index2_new
0 100 100 1 567 100
1 101 100 2 567 100
2 102 100 3 567 100
3 103 101 3 568 101
4 104 101 4 568 101
5 105 101 5 568 101

Mean imputation based on certain Conditions

I have the below dataframe,
Category Value
A 100
A -
B -
C 50
D 200
D 400
D -
As you can see, there are some values which have the hyphen symbol '-'. I want to replace those hyphons with the means of the corresponding category.
In the example, there are two entries for "A" - One row with value 100 and other with hyphen. So the mean would be 100 itself. For B, since there are no valid values, the mean would be the mean of the entire column which would be (100+50+200+400/4 = 187.5). For C, no changes and for D, the hyphen will be replaced by 300 (same logic as for "A").
Output:
Category Value
A 100
A 100
B 187.5
C 50
D 200
D 400
D 300
Try:
df = df.replace("-", np.nan)
df["Value"] = pd.to_numeric(df["Value"])
avg = df["Value"].mean()
df["Value"] = df["Value"].fillna(
df.groupby("Category")["Value"].transform(
lambda x: avg if x.isna().all() else x.mean()
)
)
print(df)
Prints:
Category Value
0 A 100.0
1 A 100.0
2 B 187.5
3 C 50.0
4 D 200.0
5 D 400.0
6 D 300.0

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Resources