How to exclude multiple columns in Spark dataframe in Python - apache-spark

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?
df.drop(['col1','col2'])
TypeError Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])
/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
1257 jdf = self._jdf.drop(col._jc)
1258 else:
-> 1259 raise TypeError("col should be a string or a Column")
1260 return DataFrame(jdf, self.sql_ctx)
1261
TypeError: col should be a string or a Column

In PySpark 2.1.0 method drop supports multiple columns:
PySpark 2.0.2:
DataFrame.drop(col)
PySpark 2.1.0:
DataFrame.drop(*cols)
Example:
df.drop('col1', 'col2')
or using the * operator as
df.drop(*['col1', 'col2'])

Simply with select:
df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])
or if you really want to use drop then reduce should do the trick:
from functools import reduce
from pyspark.sql import DataFrame
reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)
Note:
(difference in execution time):
There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.
There is a difference however when we analyze driver-side code:
the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
finally comprehensions are significantly faster in Python than methods like map or reduce
Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

The right way to do this is:
df.drop(*['col1', 'col2', 'col3'])
The * needs to come outside of the brackets if there are multiple columns to drop.

In case non of the above works for you, try this:
df.drop(col("col1")).drop(col("col2))
My spark version is 3.1.2.

Related

Spark: Use aggregation function on all columns [duplicate]

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Pyspark dataframe - select rows that fails

Is that possible to perform set of operations on dataframe (adding new columns, replace some existing values, etc) and do not fast fail on first failed rows, but instead perform full transformation and separately return rows that has been processed with errors?
Example:
it's more like pseudocode, but the idea must be clear:
df.withColumn('PRICE_AS_NUM', to_num(df["PRICE_AS_STR"]))
to_num - is my custom function of transformation string to number.
assuming I have some records where price can't be cast to number - I want to get those records in separate dataframe.
I see an approach, but it will make code a little ugly (and not quite productive):
do a filter with try catch - if exception happen - filter those records into separate df..
What if I have many of such transformations... Is any better way?
I think one approach would be to wrap your transformation with a try/except function that returns a boolean. Then use when() and otherwise() to filter on the boolean. For example:
def to_num_wrapper(inputs):
try:
to_num(inputs)
return True
except:
return False
from pyspark.sql.functions import when
df.withColumn('PRICE_AS_NUM',
when(
to_num_wrapper(df["PRICE_AS_STR"]),
to_num(df["PRICE_AS_STR"])
).otherwise('FAILED')
)
Then you can filter on the columns where the value is 'FAILED'.
Preferred option
Always prefer built-in SQL functions over UDF. There safe to execute and much faster than a Python UDF. As a bonus they follow SQL semantics - if there is a problem on the line, the output is NULL - undefined.
If you go with UDF
Follow the same approach as built-in functions.
def safe_udf(f, dtype):
def _(*args):
try:
return f(*args)
except:
pass
return udf(_, dtype)
to_num_wrapper = safe_udf(lambda x: float(x), "float")
df = spark.createDataFrame([("1.123", ), ("foo", )], ["str"])
df.withColumn("num", to_num_wrapper("str")).show()
# +-----+-----+
# | str| num|
# +-----+-----+
# |1.123|1.123|
# | foo| null|
# +-----+-----+
While swallowing exception might be counter-intuitive it just a matter of following SQL conventions.
No matter which one you choose:
Once you adjust you with one of the above, handling malformed data is just a matter of applying DataFrameNaFunctions (.na.drop, .na.replace).

pandas group by in parallel

I'm having some trouble splitting the aggregation step of a group-by operation across multiple cores. I have the following working code, and would like to apply it over several processors:
import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count
mydf = pd.DataFrame({'v1':[1,2,3,4]*6,'v2':['a','b','c']*8,'v3':np.arange(20,44)})
Which I can then apply the following GroupBy operation:
(the step I wish to do in parallel)
pd.groupby(mydf,by=['v1','v2']).apply(lambda x: np.percentile(x['v3'],[20,30]))
yielding the series:
1 a [22.4, 23.6]
b [26.4, 27.6]
c [30.4, 31.6]
2 a [31.4, 32.6]
b [23.4, 24.6]
c [27.4, 28.6]
I Tried the following, with reference to:parallel groupby
def applyParallel(dfGrouped, func):
with Pool(1) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pd.concat(ret_list)
def myfunc(df):
df['pct1'] = df.loc[:,['v3']].apply(np.percentile,args=([20],))
df['pct2'] = df.loc[:,['v3']].apply(np.percentile,args=([80],))
return(df)
grouped = pd.groupby(mydf,by=['v1','v2'])
applyParallel(grouped,myfunc)
But I'm losing the index structure and getting duplicates. I could probably solve this step with a further group by operation, but I think it shouldn't be too difficult to avoid it entirely. Any suggestions?
Not that I'm still looking for an answer, but It'd probably be better to use a library that handles parallel manipulations of pandas DataFrames, rather than trying to do so manually.
Dask is one option which is intended to scale Pandas operations with little code modification.
Another option (but is maybe a little more difficult to set up) is PySpark

Multiple if elif conditions to be evaluated for each row of pyspark dataframe

I need help in pyspark dataframe topic.
I have a dataframe of say 1000+ columns and 100000+ rows.Also I have 10000+ if elif conditions are there,under each if else condition there are few global variables getting incremented by some values.
Now my question is how can I achieve this in pyspark only.
I read about filter and where functions which return rows based on condition by I need to check those 10000+ if else conditions and perform some manipulations.
Any help would be appreciated.
If you could give an example with small dataset that would be of great help.
Thankyou
You can define a function to contain all of you if elif conditions, then apply this function into each row of the DataFrame.
Just use .rdd to convert the DataFrame to a normal RDD, then use map() function.
e.g, DF.rdd.map(lambda row: func(row))
Hope it can help you.
As I understand it, you just want to update some global counters while iterating over your DataFrame. For this, you need to:
1) Define one or more accumulators:
ac_0 = sc.accumulator(0)
ac_1 = sc.accumulator(0)
2) Define a function to update your accumulators for a given row, e.g:
def accumulate(row):
if row.foo:
ac_0.add(1)
elif row.bar:
ac_1.add(row.baz)
3) Call foreach on your DataFrame:
df.foreach(accumulate)
4) Inspect the accumulator values
> ac_0.value
>>> 123

Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Resources