Create a new column in a dataframe pandas - python-3.x

How do i create a new column in data frame that will say "Cheap" if the price is below 50000, "Fair" is the price is between 50000 and 100000 and "Expensive" if the price is over 100000enter image description here

Although I think #mozway's solution is the cleanest one, here is another way using numpy.select
import numpy as np
df['new_column'] = np.select([df['selling_price'] < 50_000,
df['selling_price'] <= 100_000],
['Cheap', 'Fair'], 'Expensive')

There are many options. A nice one is pandas.cut:
df['new'] = pd.cut(df['selling_price'],
bins=[0,50000,100000, float('inf')],
labels=['cheap', 'fair', 'expensive'])

You could use numpy.where() for this kind of data processing:
import numpy as np
df['Cheap']=np.where(df['selling_price']<=50000,'Cheap', #When selling_price <50k, 'Cheap', otherwise...
np.where((df['selling_price']>50000) & (df['selling_price']<100000) ,'Fair', #When selling_price >50k and <100k, 'Fair', otherwise...
np.where(df['selling_price']>=100000,'Expensive',#When selling_price >100k, Expensive
'N/A')))#Otherwise N/A - in case you have some string or other data type in your data

Another way with apply() and lambda function :
df["new"] = df.selling_price.apply(
lambda x: "cheap" if x < 50000 else
"fair" if x < 100000 else
"expensive"
)
Or in a general way which allows you to include multiple columns in the condition :
df["new"] = df.apply(
lambda x: "cheap"
if x["selling_price"] < 50000
else "fair"
if x["selling_price"] < 100000
else "expensive",
axis=1,
)

Related

Trying to use pandas to group all data in Column B [duplicate]

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

How to use max in a lambda function applied to a pandas DataFrame

I'm trying to apply a lambda function to a pandas Dataframe that return the difference between each row and the max of that column.
It can be easily achieved by using a separate variable and setting that to the max of the column, but I'm curious how it can be done in a single line of code.
import pandas as pd
df = pd.DataFrame({
'measure': [i for i in range(0,10)]
})
col_max = df.measure.max()
df['diff_from_max'] = df.apply(lambda x: col_max - x['measure'], axis=1)
We usually do
max_df=df.max()-df
df=df.join(max_df.add_prefix('diff_max_')
To fix your code since the apply is not need here
col_max = df.measure.max()
df['diff_from_max'] = col_max-df['measure']
I think apply() is not required here. You can simply use following line :
df['diff_from_max'] = df['measure'].max() - df['measure']

Pyspark: using udf within window

I need to detect threshold values on timeseries with Pyspark.
On the example graph below I want to detect (by storing the associated timestamp) each occurrence of the parameter ALT_STD being larger than 5000 and then lower than 5000.
For this simple case I can run simple queries such as
t_start = df.select('timestamp')\
.filter(df.ALT_STD > 5000)\
.sort('timestamp')\
.first()
t_stop = df.select('timestamp')\
.filter((df.ALT_STD < 5000)\
& (df.timestamp > t_start.timestamp))\
.sort('timestamp')\
.first()
However, in some cases, the event can by cyclic and I may have several curves (i.e. several times ALT_STD will raise above or below 5000). Of course, if I use the queries above I will only be able to detect the first occurrences.
I guess I should use window function with an udf, but I can't find a working solution.
My guess is that the algorithm should be something like:
windowSpec = Window.partitionBy('flight_hash')\
.orderBy('timestamp')\
.rowsBetween(Window.currentRow, 1)
def detect_thresholds(x):
if (x['ALT_STD'][current_row]< 5000) and (x['ALT_STD'][next_row] > 5000):
return x['timestamp'] #Or maybe simply 1
if (x['ALT_STD'][current_row]> 5000) and (x['ALT_STD'][current_row] > 5000):
return x['timestamp'] #Or maybe simply 2
else:
return 0
import pyspark.sql.functions as F
detect_udf = F.udf(detect_threshold, IntegerType())
df.withColumn('Result', detect_udf(F.Struct('ALT_STD')).over(windowSpec).show()
Is such an algorithm feasible in Pyspark ? How ?
Post-scriptum:
As a side note, I have understood how to use udf or udf and built-in sql window functions but not how to combine udf AND window.
e.g. :
# This will compute the mean (built-in function)
df.withColumn("Result", F.mean(df['ALT_STD']).over(windowSpec)).show()
# This will also work
divide_udf = F.udf(lambda x: x[0]/1000., DoubleType())
df.withColumn('result', divide_udf(F.struct('timestamp')))
No need for udf here (and python udfs cannot be used as window functions). Just use lead / lag with when:
from pyspark.sql.functions import col, lag, lead, when
result = (when((col('ALT_STD') < 5000) & (lead(col('ALT_STD'), 1) > 5000), 1)
.when(col('ALT_STD') > 5000) & (lead(col('ALT_STD'), 1) < 5000), 1)
.otherwise(0))
df.withColum("result", result)
Thanks to user9569772 answer I found out. His solution did not work because .lag() or .lead() are window functions.
from pyspark.sql.functions import when
from pyspark.sql import functions as F
# Define conditions
det_start = (F.lag(F.col('ALT_STD')).over(windowSpec) < 100)\
& (F.lead(F.col('ALT_STD'), 0).over(windowSpec) >= 100)
det_end = (F.lag(F.col('ALT_STD'), 0).over(windowSpec) > 100)\
& (F.lead(F.col('ALT_STD')).over(windowSpec) < 100)
# Combine conditions with .when() and .otherwise()
result = (when(det_start, 1)\
.when(det_end, 2)\
.otherwise(0))
df.withColumn("phases", result).show()

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Resources