indexing interval data with pandas - python-3.x

I want to select the values of a variable in a pandas dataframe above a certain percentile. I have tried using binned data with pd.cut, but the result of cut is a pandas interval (I think unordered) and I don't know how to select values
df = pd.DataFrame(np.random.randint(0,100,size=150), columns=['whatever'])
df_bins=np.linspace(df.min(),df.max(),101)
df['bin']=pd.cut(df.iloc[:,0],df_bins)
how to select the rows based on values of the column bin ? e.g. >0.95 ?

You don't need bins, Pandas already have quantile function.
df[df.whatever > df.whatever.quantile(0.95)]

Related

How to drop entire record if more than 90% of features have missing value in pandas

I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]

How I can get the same result using iloc in Pandas in PySpark?

In Pandas dataframe I can get the first 1000 rows with data.iloc[1:1000,:]. How I can do it in PySpark?
You can use df.limit(1000) to get 1000 rows from your dataframe. Note that Spark does not have a concept of index, so it will just return 1000 random rows. If you need a particular ordering, you can assign a row number based on a certain column, and filter the row numbers. e.g.
import pyspark.sql.functions as F
df2 = df.withColumn('rn', F.row_number().over(Window.orderBy('col_to_order'))) \
.filter('rn <= 1000')

Assigning np.nans to rows of a Pandas column using a query

I want to assign NaNs to the rows of a column in a Pandas dataframe when some conditions are met.
For a reproducible example here are some data:
'{"Price":{"1581292800000":21.6800003052,"1581379200000":21.6000003815,"1581465600000":21.6000003815,"1581552000000":21.6000003815,"1581638400000":22.1599998474,"1581984000000":21.9300003052,"1582070400000":22.0,"1582156800000":21.9300003052,"1582243200000":22.0200004578,"1582502400000":21.8899993896,"1582588800000":21.9699993134,"1582675200000":21.9599990845,"1582761600000":21.8500003815,"1582848000000":22.0300006866,"1583107200000":21.8600006104,"1583193600000":21.8199996948,"1583280000000":21.9699993134,"1583366400000":22.0100002289,"1583452800000":21.7399997711,"1583712000000":21.5100002289},"Target10":{"1581292800000":22.9500007629,"1581379200000":23.1000003815,"1581465600000":23.0300006866,"1581552000000":22.7999992371,"1581638400000":22.9599990845,"1581984000000":22.5799999237,"1582070400000":22.3799991608,"1582156800000":22.25,"1582243200000":22.4699993134,"1582502400000":22.2900009155,"1582588800000":22.3248996735,"1582675200000":null,"1582761600000":null,"1582848000000":null,"1583107200000":null,"1583193600000":null,"1583280000000":null,"1583366400000":null,"1583452800000":null,"1583712000000":null}}'
In this particular toy example, I want to assign NaNs to the column 'Price' when the column 'Target10' has NaNs. (in the general case the condition may be more complex)
This line of code achieves that specific objective:
toy_data.Price.where(toy_data.Target10.notnull(), toy_data.Target10)
However when I attempt to use a query and assign NaNs to the targeted column I fail:
toy_data.query('Target10.isnull()', engine = 'python').Price = np.nan
The above line leaves toy_data intact.
Why is that and how I should use query to replace values in particular rows?
One way to do it is -
import numpy as np
toy_data['Price'] = np.where(toy_data['Target10'].isna(), np.nan, toy_data['Price'])

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

Python Pandas unique values

df = pd.DataFrame({"ID":['A','B','C','D','E','F'],
"IPaddress":['12.345.678.01','12.345.678.02','12.345.678.01','12.345.678.18','12.345.678.02','12.345.678.01'],
"score":[8,9,5,10,3,7]})
I'm using Python, and Pandas library. For those rows with duplicate IP addresses, I want to select only one row with highest score (score being from 0-10), and drop all duplicates.
I'm having a difficult time in turning this logic into a Python function.
Step 1: Using the groupby function of Pandas, split the df into groups of IPaddress.
df.groupby('IPaddress')
Result of this will create an groupby object. Once you check the type of this object, it will be the following: pandas.core.groupby.groupby.DataFrameGroupBy
Step 2: With the Pandas groupby object created from step1, using .idxmax() over the score, will return the Pandas series with maximum scores of each IPaddress
df.groupby('IPaddress').score.idxmax()
(Optional) Step 3: If you want to transform the above series to dataframe, you can do below:
df.loc[df.groupby('IPaddress').score.idxmax(),['IPaddress','score']]
Here, you are selecting all the rows with max scores, and showing the IPaddress, score columns.
Useful reference:
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/

Resources