Drop rows in dask dataFrame on condition - python-3.x

I'm trying to drop some rows in my dask dataframe with :
df.drop(df[(df.A <= 3) | (df.A > 1000)].index)
But this one doesn't work and return NotImplementedError: Drop currently only works for axis=1
I really need help

You can remove rows from a Pandas/Dask dataframe as follows:
df = df[condition]
In your case you might do something like the following:
df = df[(df.A > 3) & (df.A <= 1000)]

Related

Data to explode between two columns

My current dataframe looks as below:
existing_data = {'STORE_ID': ['1234','5678','9876','3456','6789'],
'FULFILLMENT_TYPE': ['DELIVERY','DRIVE','DELIVERY','DRIVE','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05'],
'DAY_OF_WEEK':['SATURDAY','SUNDAY','MONDAY','TUESDAY','WEDNESDAY'],
'START_HOUR':[8,8,6,7,9],
'END_HOUR':[19,19,18,19,17]}
existing = pd.DataFrame(data=existing_data)
I would need the data to be exploded between the start and end hour such that each hour is a different row like below:
needed_data = {'STORE_ID': ['1234','1234','1234','1234','1234'],
'FULFILLMENT_TYPE': ['DELIVERY','DELIVERY','DELIVERY','DELIVERY','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-01','2020-08-01','2020-08-01','2020-08-01'],
'DAY_OF_WEEK': ['SATURDAY','SATURDAY','SATURDAY','SATURDAY','SATURDAY'],
'HOUR':[8,9,10,11,12]}
required = pd.DataFrame(data=needed_data)
Not sure how to achieve this ..I know it should be with explode() but unable to achieve it.
If small DataFrame or performance is not important use range per both columns with DataFrame.explode:
existing['HOUR'] = existing.apply(lambda x: range(x['START_HOUR'], x['END_HOUR']+1), axis=1)
existing = (existing.explode('HOUR')
.reset_index(drop=True)
.drop(['START_HOUR','END_HOUR'], axis=1))
If performance is important use Index.repeat by subtract both columns and then add counter by GroupBy.cumcount to START_HOUR:
s = existing["END_HOUR"].sub(existing["START_HOUR"]) + 1
df = existing.loc[existing.index.repeat(s)].copy()
add = df.groupby(level=0).cumcount()
df['HOUR'] = df["START_HOUR"].add(add)
df = df.reset_index(drop=True).drop(['START_HOUR','END_HOUR'], axis=1)

Pandas derived column for number of work days between 2 dates

The numpy busdays_count works but when I apply it to the dataframe I get errors because some of the dates are NaT (correctly).
If it was a normal array I could iterate each row, check if NaT and then apply the formulae but not sure here ...
data_raw['due'] = pd.to_datetime(data_raw['Due Date'], format="%Y%m%d")
data_raw['clo'] = pd.to_datetime(data_raw['Closed Date'], format="%Y%m%d")
data_raw['perf'] = data_raw.apply(lambda row: np.busday_count(row['due'].values.astype('datetime64[D]'),
row['clo'].values.astype('datetime64[D]')
if pd.isnull(row['clo'])
else '',
axis=1
))
The error is KeyError: 'due'
This works below but not sure on joining:
p_df = data_raw[pd.notna(data_raw.clo)]
p_df['perf'] = np.busday_count(p_df['due'].values.astype('datetime64[D]'), p_df['clo'].values.astype('datetime64[D]'))
I found a work around but pretty sure it is not the best way...
# split the dataframe
not_na = data_raw[pd.notna(data_raw.clo)]
is_na = data_raw[pd.isna(data_raw.clo)]
# do the calc without the NaNs
not_na['perf'] =
np.busday_count(not_na['due'].values.astype('datetime64[D]'),
not_na['clo'].values.astype('datetime64[D]'))
# lastly, join the dataframes back
new_df = pd.concat([is_na, not_na], axis=0)

pyspark sql: how to count the row with mutiple conditions

I have a dataframe like this after some operations;
df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])
print(df_new_1.count(), df_new_2.count())
>> 10, 15
I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;
Row(check1=10, check2=15)
Since you tagged pyspark-sql, you can do the following:
df_old.createOrReplaceTempView("df_table")
spark.sql("""
SELECT sum(int(col1 >= col2)) as check1
, sum(int(col1 < col2)) as check2
FROM df_table
""").collect()
Or use the API functions:
from pyspark.sql.functions import expr
df_old.agg(
expr("sum(int(col1 >= col2)) as check1"),
expr("sum(int(col1 < col2)) as check2")
).collect()

Multiple nested if conditions

I have a dataframe (DF) with size (2000 rows x 10 columns).
The structure of the code is multiple nested if conditions:
DF['NewColumn']=''
for i in range (0, len(DF))
if condition
define variable etc
if condition
DF['NewColumn'].values[i]= some value
else:
DF['NewColumn'].values[i]= some value
etc
Basically, I loop over each row of the dataframe, check the conditions and populate each row of a new column according to the set of if conditions.
Apologies if my question is not specific enough but I am looking for way to code this problem more efficiently. I am keen to hear your thoughts.
Can i use a class or vectorise? I am not sure how to restructure my problem
Many Thanks
You can vectorize your loop like this
temp = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
condition1 = (temp["A"] > 20) & (temp["B"] < 20)
temp["NewColumn"] = condition1.astype(int)
condition2 = (temp["C"] > 20) & (temp["A"] < 50)
temp["NewColumn2"] = np.where(condition2, "between", "out")

Fitter Spark RDD based on result from filtering of different RDD

conf = SparkConf().setAppName("my_app")
with SparkContext(conf=conf) as sc:
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(*s3keys)
# this gives me distinct values as list
rdd = df.filter(
(1442170800000 <= df.timestamp) & (
df.timestamp <= 1442185200000) & (
df.lat > 40.7480) & (df.lat < 40.7513) & (
df.lon > -73.8492) & (
df.lon < -73.8438)).map(lambda p: p.userid).distinct()
# how do I apply the above list to filter another rdd?
df2 = sqlContext.read.parquet(*s3keys_part2)
# example:
rdd = df2.filter(df2.col1 in (rdd values from above))
As mentioned by Matthew Graves what you need here is a join. It means more or less something like this:
pred = ((1442170800000 <= df.timestamp) &
(df.timestamp <= 1442185200000) &
(df.lat > 40.7480) &
(df.lat < 40.7513) &
(df.lon > -73.8492) &
(df.lon < -73.8438))
users = df.filter(pred).select("userid").distinct()
users.join(df2, users.userid == df2.col1)
This is Scala code, instead of Python, but hopefully it can still serve as an example.
val x = 1 to 9
val df2 = sc.parallelize(x.map(a => (a,a*a))).toDF()
val df3 = sc.parallelize(x.map(a => (a,a*a*a))).toDF()
This gives us two dataframes, each with columns named _1 and _2, which are the first nine natural numbers and their squares/cubes.
val fil = df2.filter("_1 < 5") // Nine is too many, let's go to four.
val filJoin = fil.join(df3,fil("_1") === df3("_1")
filJoin.collect
This gets us:
Array[org.apache.spark.sql.Row] = Array([1,1,1,1], [2,4,2,8], [3,9,3,27], [4,16,4,64])
To apply this to your problem, I would start with something like the following:
rdd2 = rdd.join(df2, rdd.userid == df2.userid, 'inner')
But notice that we need to tell it what columns to join on, which might be something other than userid for df2. I'd also recommend, instead of map(lambda p: p.userid) you use .select('userid').distinct() so that it's still a dataframe.
You can find out more about join here.

Resources