python pandas dataframe calculation - python-3.x

I have a dataframe with two numeric columns item_cnt_day and item_price.
I want to create a new column called rev in my dataframe which is calculated by (item_cnt_day * item_price). However, I want to add the condition rev = item_cnt_day * item_price only if item_cnt_day is larger or equal to 0, otherwise, rev = 0. Could you help me write the code for this condition when creating new column rev?

You can use loc accessor with boolean masking:
df['rev']=0
df.loc[df['item_cnt_day'].ge(0),'rev']=df['item_cnt_day'].mul(df['item_price'])
OR
You can use where():
df['rev']=df['item_cnt_day'].mul(df['item_price']).where(df['item_cnt_day'].ge(0),0)
#df['item_cnt_day'].mul(df['item_price']).mask(df['item_cnt_day'].lt(0),0)
OR
via np.where():
import numpy as np
df['rev']=np.where(df['item_cnt_day'].ge(0),df['item_cnt_day'].mul(df['item_price']),0)

Related

Understanding an example of window function

I was running a code script to get the following result. The code is shown below. I don't understand why I got the xyz1 column as shown in the image. For example, why the first row of xyz1 is 0. According to the windows function, its corresponding group should be the first two rows, but why F.count(F.col("xyz")).over(w) get 0 here.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz"))
df.withColumn("xyz1",F.count(F.col("xyz")).over(w)).show()
Note that count only counts non-null items, and that the grouping is only defined by the partitionBy clause, but not the orderBy clause.
When you specify an ordering column, the default window range is (according to the docs)
(rangeFrame, unboundedPreceding, currentRow)
So your window defintion is actually
w = (Window().partitionBy("I_id","p_id")
.orderBy(F.col("xyz"))
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
)
And so the window only includes the rows from xyz = -infinity to the value of xyz in the current row. That's why the first row has a count of zero because it counts non-null items from xyz = -infinity to xyz = null, i.e. the first two rows of the dataframe.
For the row where xyz = 2, the count includes non-null items from xyz = -infinity to xyz = 2, i.e. the first four rows. That's why you got a count of 2, because the non-null items are 1 and 2.

How I can get the same result using iloc in Pandas in PySpark?

In Pandas dataframe I can get the first 1000 rows with data.iloc[1:1000,:]. How I can do it in PySpark?
You can use df.limit(1000) to get 1000 rows from your dataframe. Note that Spark does not have a concept of index, so it will just return 1000 random rows. If you need a particular ordering, you can assign a row number based on a certain column, and filter the row numbers. e.g.
import pyspark.sql.functions as F
df2 = df.withColumn('rn', F.row_number().over(Window.orderBy('col_to_order'))) \
.filter('rn <= 1000')

Assigning np.nans to rows of a Pandas column using a query

I want to assign NaNs to the rows of a column in a Pandas dataframe when some conditions are met.
For a reproducible example here are some data:
'{"Price":{"1581292800000":21.6800003052,"1581379200000":21.6000003815,"1581465600000":21.6000003815,"1581552000000":21.6000003815,"1581638400000":22.1599998474,"1581984000000":21.9300003052,"1582070400000":22.0,"1582156800000":21.9300003052,"1582243200000":22.0200004578,"1582502400000":21.8899993896,"1582588800000":21.9699993134,"1582675200000":21.9599990845,"1582761600000":21.8500003815,"1582848000000":22.0300006866,"1583107200000":21.8600006104,"1583193600000":21.8199996948,"1583280000000":21.9699993134,"1583366400000":22.0100002289,"1583452800000":21.7399997711,"1583712000000":21.5100002289},"Target10":{"1581292800000":22.9500007629,"1581379200000":23.1000003815,"1581465600000":23.0300006866,"1581552000000":22.7999992371,"1581638400000":22.9599990845,"1581984000000":22.5799999237,"1582070400000":22.3799991608,"1582156800000":22.25,"1582243200000":22.4699993134,"1582502400000":22.2900009155,"1582588800000":22.3248996735,"1582675200000":null,"1582761600000":null,"1582848000000":null,"1583107200000":null,"1583193600000":null,"1583280000000":null,"1583366400000":null,"1583452800000":null,"1583712000000":null}}'
In this particular toy example, I want to assign NaNs to the column 'Price' when the column 'Target10' has NaNs. (in the general case the condition may be more complex)
This line of code achieves that specific objective:
toy_data.Price.where(toy_data.Target10.notnull(), toy_data.Target10)
However when I attempt to use a query and assign NaNs to the targeted column I fail:
toy_data.query('Target10.isnull()', engine = 'python').Price = np.nan
The above line leaves toy_data intact.
Why is that and how I should use query to replace values in particular rows?
One way to do it is -
import numpy as np
toy_data['Price'] = np.where(toy_data['Target10'].isna(), np.nan, toy_data['Price'])

indexing interval data with pandas

I want to select the values of a variable in a pandas dataframe above a certain percentile. I have tried using binned data with pd.cut, but the result of cut is a pandas interval (I think unordered) and I don't know how to select values
df = pd.DataFrame(np.random.randint(0,100,size=150), columns=['whatever'])
df_bins=np.linspace(df.min(),df.max(),101)
df['bin']=pd.cut(df.iloc[:,0],df_bins)
how to select the rows based on values of the column bin ? e.g. >0.95 ?
You don't need bins, Pandas already have quantile function.
df[df.whatever > df.whatever.quantile(0.95)]

Assign values to a datetime column in Pandas / Rename a datetime column to a date column

Dataframe image
I have created the following dataframe 'user_char' in Pandas with:
## Create a new workbook User Char with empty datetime columns to import data from the ledger
user_char = all_users[['createdAt', 'uuid','gasType','role']]
## filter on consumers in the user_char table
user_char = user_char[user_char.role == 'CONSUMER']
user_char.set_index('uuid', inplace = True)
## creates datetime columns that need to be added to the existing df
user_char_rng = pd.date_range('3/1/2016', periods = 25, dtype = 'period[M]', freq = 'MS')
## converts date time index to a list
user_char_rng = list(user_char_rng)
## adds empty cols
user_char = user_char.reindex(columns = user_char.columns.tolist() + user_char_rng)
user_char
and I am trying to assign a value to the highlighted column using the following command:
user_char['2016-03-01 00:00:00'] = 1
but this keeps creating a new column rather than editing the existing one. How do I assign the value 1 to all the indices without adding a new column?
Also how do I rename the datetime column that excludes the timestamp and only leaves the date field in there?
Try
user_char.loc[:, '2016-03-01'] = 1
Because your column index is a DatetimeIndex, pandas is smart enough to translate the string '2016-03-01' into datetime format. Using loc[c] seems to hint to pandas to first look for c in the index, rather than create a new column named c.
Side note: the DatetimeIndex of time-series data is conventionally used as the (row) index of a DataFrame, not in the columns. (There's no technical reason why you can't use time in the columns, of course!) In my experience, most of the PyData stack is built to expect "tidy data", where each variable (like time) forms a column, and each observation (timestamp value) forms a row. The way you're doing it, you'll need to transpose your DataFrame before calling plot() on it, for example.

Resources