Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it.
I want to know what Job has the best pay. To do that I need the median() because I want to know the average.
The methods for groupBy in Pyspark are these:
agg, avg, count, max, mean, min, pivot, sum
When I try the .mean() method it looks like this:
mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)
# +--------------------+-----------------+
# | JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+
Here is what it looks like with the .avg() method:
average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)
# +--------------------+-----------------+
# | JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+
They return the exact same values. What's the difference between mean() and avg()?
I also want to find the median, so that one person doesn't have too much of an impact.
Since there is no median() method in PySpark I don't know what to do here.
Both avg and mean documentation tell this:
mean() is an alias for avg()
Both of these functions are identical. Both names are needed, so that developers coming from different backgrounds would feel comfortable.
Regarding the median:
Approximate (efficient) median: F.expr('percentile_approx(col_name, .5) over()')
Accurate (inefficient) median: F.expr('percentile(col_name, .5) over()')
Related
I have an input of multiple files and I need to apply different processing rules and output to multiple files.
How can do it so that the files are only read once, but in a way that I can apply different filtering, grouping, etc and save to different output files?
Something similar to the diagram below:
INPUT
FILES
|
|
/ \
/ \
/ \
FILTER X FILTER Y
| |
| |
GROUP BY A GROUP BY B
| |
| |
OUTPUT OUTPUT
FILE 1 FILE 2
I tried to use a code similar as below, but it seems to be reading the input files multiple times.
rd = spark.read.format('dbf').load(os.path.join(
sih_data,
'RD??{10,11,12,13,14,15,16,17,18,19,20,21}??.{DBC,dbc}'
))
out1 = rd.where(rd['UF_ZI'] == '52')
out1 = out1.groupBy('UF_ZI').count()
out1.write.format("com.databricks.spark.csv")\
.option("header", "true").save("out1")
out2 = rd.where(rd['IDENT'] == '1')
out2 = out2.groupBy('MUNIC_RES').count()
out2.write.format("com.databricks.spark.csv")\
.option("header", "true").save("out2")
To answer your question specifically, as mentioned in the comments, you can use .cache() once you have read the data and use your exact same code. cache() is just a shortcut for persist('MEMORY_AND_DISK') so your data will be stored in memory if possible, and the rest on the disk.
That being said, it is worth thinking about what you are doing before deciding anything. I don't know about the dbf format but if it allows predicate pushdown and that your filters are quite restrictive, it could be better to leave your code as it is to load less data (predicate pushdown on cached data is less effective in spark). If the source does not allow predicate pushdown or if you have many filters, other approaches could be interesting.
For instance, you could wrap all your filters into one job to 1. avoid the overhead of each job 2. read the data only once.
# you could define as many filters as a list of 3-tuples.
# the 1st element is the id of the filter, the second the filter itself
# and the third, the grouping column.
filters = [(1, rd['UF_ZI'] == '52', 'UF_ZI'), (2, rd['IDENT'] == '1', 'MUNIC_RES')]
# Then you define an array of structs. If the filter passes, the element is
# a struct with the grouping column and the filter. If not, it is null.
cols = [F.when(filter,
F.struct(F.col(group).alias("group"),
F.lit(id).alias("filter")))
for (id, filter, group) in filters]
# Finally, we explode the array of filters defined just before and filter out
# null values that correspond to non matching filters.
rd\
.withColumn("s", F.explode(F.array(*cols)))\
.where(F.col('s').isNotNull())\
.groupBy(F.col("s.group").alias("group"), F.col("s.filter").alias("filter"))\
.count()\
.write.partitionBy("filter").csv("test.csv")
# In spark>=2.4, you could avoid the filter and use array_except to remove null
# values before exploding the array. It might be more efficient.
rd\
.withColumn("s", F.explode(F.array_except(
F.array(*cols),
F.array(F.lit(None))
)))\
.groupBy(F.col("s.group").alias("group"), F.col("s.filter").alias("filter"))\
.count()\
.write.partitionBy("filter").csv("test.csv")
Having a look at test.csv, it looks like this:
> ls test.csv
'filter=1' 'filter=2'
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
MultivariateStatisticalSummary
http://spark.apache.org/docs/latest/mllib-statistics.html
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
}
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
}
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)
I am new to using python pandas, and have the below script to pull in time series data from an excel file, set the dates = index, and then will want to perform various calculations on the data referencing by date. Script:
df = pd.read_excel("myfile.xls")
df = df.set_index(df.Date)
df = df.drop("Date",1)
df.index.name = None
df.head()
The output of that (to give you a sense of the data) is:
Px1 Px2 Px3 Px4 Px5 Px6 Px7
2015-08-12 19.850000 10.25 7.88 10.90 109.349998 106.650002 208.830002
2015-08-11 19.549999 10.16 7.81 10.88 109.419998 106.690002 208.660004
2015-08-10 19.260000 10.07 7.73 10.79 109.059998 105.989998 210.630005
2015-08-07 19.240000 10.08 7.69 10.92 109.199997 106.430000 207.919998
2015-08-06 19.250000 10.09 7.76 10.96 109.010002 106.010002 208.350006
When I try to retrieve data based on one date like df.loc['20150806'] that works, but when I try to retrieve a slice like df.loc['20150806':'20150812'] I return Empty DataFrame.
Again, the index is a DateTimeIndex with dtype = 'datetime64[ns]', length = 1412, freq = None, tz = None
Like I said, my ultimate goal is to be able to group the data by Day, Month, Year, different periods etc., and perform calculations on the data. I want to give that context, but don't even want to get into that here since I'm clearly stuck on something more basic - perhaps misunderstanding how to operate with a DateTimeIndex
Thank you.
EDIT: Meant to also include, I think the main problem I referenced with indexing has something to do with freq=0, bc when I tried simpler examples with contiguous date series, I did not have this problem.
df.loc['2015-08-12':'2015-08-10'] and df.loc['2015-08-10':'2015-08-12':-1] both work. df = df.sort_index() and slicing the way I was trying also works. Thank you all. Was missing the forest for the trees there I think.