df = pd.DataFrame({"ID":['A','B','C','D','E','F'],
"IPaddress":['12.345.678.01','12.345.678.02','12.345.678.01','12.345.678.18','12.345.678.02','12.345.678.01'],
"score":[8,9,5,10,3,7]})
I'm using Python, and Pandas library. For those rows with duplicate IP addresses, I want to select only one row with highest score (score being from 0-10), and drop all duplicates.
I'm having a difficult time in turning this logic into a Python function.
Step 1: Using the groupby function of Pandas, split the df into groups of IPaddress.
df.groupby('IPaddress')
Result of this will create an groupby object. Once you check the type of this object, it will be the following: pandas.core.groupby.groupby.DataFrameGroupBy
Step 2: With the Pandas groupby object created from step1, using .idxmax() over the score, will return the Pandas series with maximum scores of each IPaddress
df.groupby('IPaddress').score.idxmax()
(Optional) Step 3: If you want to transform the above series to dataframe, you can do below:
df.loc[df.groupby('IPaddress').score.idxmax(),['IPaddress','score']]
Here, you are selecting all the rows with max scores, and showing the IPaddress, score columns.
Useful reference:
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/
Related
I'm using Python 3.9 with Pandas and Numpy.
Every day I receive a df with orders from the company I work for. Each day, this df comes from a different country that I don't know the language, and this dataframes don't have a pattern. In this case, I don't know what's the column name nor the index.
I just know that the orders follows a patter: 3 numbers + 2 letters like 000AA, 149KL, 555EE etc.
I saw that with strings is possible, but with pandas I just found commands that needs the name of the column.
df.column_name.str.contains(pat=r'\d\d\d\w\w', regex=True)
If I can find the column that only have this pattern, I know what the orders column is.
I started with a synthetic data set
import pandas
df = pandas.DataFrame([{'a':3,'b':4,'c':'222BB','d':'2asf'},
{'a':2,'b':1,'c':'111AA','d':'942'}])
I then cycle through each column. If the datatype is object, then I test whether all the elements in the Series match the regex
for column_id in df.columns:
if df[column_id].dtype=='object':
if all(df[column_id].str.contains(pat=r'\d\d\d\w\w', regex=True)):
print("matching column:",column_id)
I want to assign NaNs to the rows of a column in a Pandas dataframe when some conditions are met.
For a reproducible example here are some data:
'{"Price":{"1581292800000":21.6800003052,"1581379200000":21.6000003815,"1581465600000":21.6000003815,"1581552000000":21.6000003815,"1581638400000":22.1599998474,"1581984000000":21.9300003052,"1582070400000":22.0,"1582156800000":21.9300003052,"1582243200000":22.0200004578,"1582502400000":21.8899993896,"1582588800000":21.9699993134,"1582675200000":21.9599990845,"1582761600000":21.8500003815,"1582848000000":22.0300006866,"1583107200000":21.8600006104,"1583193600000":21.8199996948,"1583280000000":21.9699993134,"1583366400000":22.0100002289,"1583452800000":21.7399997711,"1583712000000":21.5100002289},"Target10":{"1581292800000":22.9500007629,"1581379200000":23.1000003815,"1581465600000":23.0300006866,"1581552000000":22.7999992371,"1581638400000":22.9599990845,"1581984000000":22.5799999237,"1582070400000":22.3799991608,"1582156800000":22.25,"1582243200000":22.4699993134,"1582502400000":22.2900009155,"1582588800000":22.3248996735,"1582675200000":null,"1582761600000":null,"1582848000000":null,"1583107200000":null,"1583193600000":null,"1583280000000":null,"1583366400000":null,"1583452800000":null,"1583712000000":null}}'
In this particular toy example, I want to assign NaNs to the column 'Price' when the column 'Target10' has NaNs. (in the general case the condition may be more complex)
This line of code achieves that specific objective:
toy_data.Price.where(toy_data.Target10.notnull(), toy_data.Target10)
However when I attempt to use a query and assign NaNs to the targeted column I fail:
toy_data.query('Target10.isnull()', engine = 'python').Price = np.nan
The above line leaves toy_data intact.
Why is that and how I should use query to replace values in particular rows?
One way to do it is -
import numpy as np
toy_data['Price'] = np.where(toy_data['Target10'].isna(), np.nan, toy_data['Price'])
At the moment I have 9 functions which do specific calculations to a data frame - average balance per month included, rolling P&L, period start balances, ratio calculation.
Each of those functions produce the following:
the first columns are the group by columns which the function accepts and the final column is the statistic calculation.
I.e.
Each of those functions produce a spark data frame that has the same group by variables(same first columns - 1 column if the group by variables is only 1, 2 columns if the group by variables are 2, etc.) and 1 column where the values are the specific calculation - examples of which I listed at the beginning.
Because each of those functions do different calculations, I need to produce a data frame for each one and then join them to produce a report
I join them on the group by variables because they are common in all of them(each individual statistic report).
But doing 7-8 and even more joins is very slow.
Is there a way to add those columns together without using join?
Thank you.
I can think of multiple approaches. But this looks like a good use case for a new pandas udf spark api.
You can define one group_by udf. The udf will receive the aggregated group as a pandas dataframe. You apply 9 aggregate functions on the group and return a pandas dataframe with additional 9 aggregated columns. Spark will combine each new returned pandas dataframe into a large spark dataframe.
e.g
# given you want to aggregate average and ratio
#pandas_udf("month long, avg double, ratio dbl", PandasUDFType.GROUPED_MAP)
def compute(pdf):
# pdf is a pandas.DataFrame
pdf['avg'] = compute_avg(pdf)
pdf['ratio'] = compute_ratio(pdf)
return pdf
df.groupby("month").apply(compute).show()
See Pandas-UDF#Grouped-Map
If you cluster is on a lower version you have 2 options:
Stick to dataframe api and write custom aggregate functions. See answer. They have a horrible api but usage would look like this.
df.groupBy(df.month).agg(
my_avg_func(df.amount).alias('avg'),
my_ratio_func(df.amount).alias('ratio'),
Fall back to good ol' rdd map reduce api
#pseudocode
def _create_tuple_key(record):
return (record.month, record)
def _compute_stats(acc, record):
acc['count'] += 1
acc['avg'] = _accumulate_avg(acc['count'], record)
acc['ratio'] = _accumulate_ratio(acc['count'], record)
return acc
df.toRdd.map(__create_tuple_key).reduceByKey(_compute_stats)
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)
I want to select the values of a variable in a pandas dataframe above a certain percentile. I have tried using binned data with pd.cut, but the result of cut is a pandas interval (I think unordered) and I don't know how to select values
df = pd.DataFrame(np.random.randint(0,100,size=150), columns=['whatever'])
df_bins=np.linspace(df.min(),df.max(),101)
df['bin']=pd.cut(df.iloc[:,0],df_bins)
how to select the rows based on values of the column bin ? e.g. >0.95 ?
You don't need bins, Pandas already have quantile function.
df[df.whatever > df.whatever.quantile(0.95)]