Find a row matching multiple column criteria - python-3.x

I have a dataframe with 2M rows which is in the below format:
ID Number
1 30
1 40
1 60
2 10
2 30
3 60
I need to select the IDs have the number 30 and 40 present (in this case, output should be 1).
I know we can create a new DF having only numbers 30 & 40 and then groupby to see which IDs have more than count 1. But is there a way we can to do both in the groupby statement ?
My code:
a=df[(df['Number']==30) | (df['Number']==40) ]
b=a.groupby('ID')['Number'].nunique().to_frame(name='tt').reset_index()
b[b['tt'] > 1]

Use groupby filter and issubset
s = {30, 40}
df.groupby('ID').filter(lambda x: s.issubset(set(x.Number)))
Out[158]:
ID Number
0 1 30
1 1 40
2 1 60

I find the fact that the describe() method of Groupby objects returns a dataframe to be extremely helpful.
Output temp1 = a.groupby("ID").describe() and temp2 = a.groupby("ID").describe()["Number"] to a Jupyter notebook to see what they look like, then the following code (which follows on from yours) should make sense.
summary = a.groupby("ID").describe()["Number"]
summary.loc[summary["count"] > 1].index

I would create a df for each condition and then inner join them:
df1 = df[df.Number == 30][['Number']]
df2 = df[df.Number == 40][['Number']]
df3 = df1.join(df2,how='inner',on='Number')

Related

How to filter a dataframe using a cumulative sum of a column as parameter

I have this df:
df=pd.DataFrame({'Name':['John','Mike','Lucy','Mary','Andy'],
'Age':[10,23,13,12,15],
'%':[20,20,10,25,25]})
I want to filter this df by taking from row 0 to row n until the sum of column % = 50
I don't want to sort the % column or the df, I just need to get it's first row where % column sums 50
The output is:
filtered=pd.DataFrame({'Name':['John','Mike','Lucy'],'Age':[10,23,13],'%':[20,20,10]})
cumsum, boolean index and slice using the loc or iloc accessor
df.iloc[:(df['%'].cumsum()==50).idxmax()+1,:]
Name Age %
0 John 10 20
1 Mike 23 20
2 Lucy 13 10

pandas explode multi column

I must use pandas 1.2.5 which supports explode() only on 1 column. The dataframe has several columns where each can have a single value or a list. In one row, several columns can have lists but it is guaranteed that all the lists in that row are the same length.
What is the best way to make the dataframe explode?
Example of what I mean is to make this dataframe:
a
b
1
1
20
[10,20,30]
[100,200,300]
[100,200,300]
Look like this dataframe:
a
b
1
1
20
10
20
20
20
30
100
100
200
200
300
300
Since you are using old pandas version and your column does not have matching element counts therefore multi column explode is not an available option. Here is one approach which involving reshaping the dataframe into a series in order to use the single column explode, then creating a new index using groupby + cumcount and reshaping back to dataframe
s = df.stack().explode()
i = s.groupby(level=[0, 1]).cumcount()
s.to_frame().set_index(i, append=True)[0].unstack(1).ffill().droplevel(1)
Result
a b
0 1 1
1 20 10
1 20 20
1 20 30
2 100 100
2 200 200
2 300 300

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Using sc.parallelize inside map() or any other solution?

I have following issue: i need to find all combinations of values in the column B per each id from the column A and return the results as DataFrame
In example below of the input DataFrame
A B
0 5 10
1 1 20
2 1 15
3 3 50
4 5 14
5 1 30
6 1 15
7 3 33
I need to get the following output DataFrame (it is for GraphX\GraphFrame)
src dist A
0 10 14 5
1 50 33 3
2 20 15 1
3 30 15 1
4 20 30 1
The one solution that I thought till now it is:
df_result = df.drop_duplicates().\
map(lambda (A,B):(A,[B])).\
reduceByKey(lambda p, q: p + q).\
map(lambda (A,B_values_array):(A,[k for k in itertools.combinations(B_values_array,2)]))
print df_result.take(3)
output: [(1, [(20,15),(30,20),(30,15)]),(5,[(10,14)]),(3,[(50,33)])]
And here I'm stuck :( how to return it to the data frame that I need? One idea was to use parallelize:
import spark_sc
edges = df_result.map(lambda (A,B_pairs): spark_sc.sc.parallelize([(k[0],k[1],A) for k in B_pairs]))
For spark_sc I have other file with name spark_sc.py
def init():
global sc
global sqlContext
sc = SparkContext(conf=conf,
appName="blablabla",
pyFiles=['my_file_with_code.py'])
sqlContext = SQLContext(sc)
but my code it failed:
AttributeError: 'module' object has no attribute 'sc'
if I use the spark_sc.sc() not into map() it works.
Any idea what I miss in the last step? is it possible at all to use parallelize()? or I need completely different solution?
Thanks!
You definitely need another solution which could be as simple as:
from pyspark.sql.functions import greatest, least, col
df.alias("x").join(df.alias("y"), ["A"]).select(
least("x.B", "y.B").alias("src"), greatest("x.B", "y.B").alias("dst"), "A"
).where(col("src") != col("dst")).distinct()
where:
df.alias("x").join(df.alias("y"), ["A"])
joins table with itself by A,
least("x.B", "y.B").alias("src")
and
greatest("x.B", "y.B")
choose value with a lower id as the source, and higher id as a destination. Finally:
where(col("src") != col("dst"))
drops self loops.
In general it is not possible to use SparkContext from an action or a transformation (not that it would make any sense to do this in your case).

Resources