Related
I want to validate a date column for a PySpark dataframe. I know how to do it for pandas, but can't make it work for PySpark.
import pandas as pd
import datetime
from datetime import datetime
data = [['Alex',10, '2001-01-12'],['Bob',12, '2005-10-21'],['Clarke',13, '2003-12-41']]
df = pd.DataFrame(data,columns=['Name','Sale_qty', 'DOB'])
sparkDF =spark.createDataFrame(df)
def validate(date_text):
try:
if date_text != datetime.strptime(date_text, "%Y-%m-%d").strftime('%Y-%m-%d'):
raise ValueError
return True
except ValueError:
return False
df = df['DOB'].apply(lambda x: validate(x))
print(df)
It works for pandas dataframe. But I can't make it work for PySpark. Getting the following error:
sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError Traceback (most recent call last)
<ipython-input-83-5f5f1db1c7b3> in <module>
----> 1 sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError: 'Column' object is not callable
You could use the following column expression:
F.to_date('DOB', 'yyyy-M-d').isNotNull()
Full test:
from pyspark.sql import functions as F
data = [['Alex', 10, '2001-01-12'], ['Bob', 12, '2005'], ['Clarke', 13, '2003-12-41']]
df = spark.createDataFrame(data, ['Name', 'Sale_qty', 'DOB'])
validation = F.to_date('DOB', 'yyyy-M-d').isNotNull()
df.withColumn('validation', validation).show()
# +------+--------+----------+----------+
# | Name|Sale_qty| DOB|validation|
# +------+--------+----------+----------+
# | Alex| 10|2001-01-12| true|
# | Bob| 12| 2005| false|
# |Clarke| 13|2003-12-41| false|
# +------+--------+----------+----------+
you can use a to_date() with the required source date format. It returns null where the format is incorrect, which can be used for validation.
see below example.
spark.sparkContext.parallelize([('01-12-2001',), ('2001-01-12',)]).toDF(['dob']). \
withColumn('correct_date_format', func.to_date('dob', 'yyyy-MM-dd').isNotNull()). \
show()
# +----------+-------------------+
# | dob|correct_date_format|
# +----------+-------------------+
# |01-12-2001| false|
# |2001-01-12| true|
# +----------+-------------------+
I would like to create a data frame from the all possible combination of values of each of the categories listed in the dictionary.
I tried the below code, it is working fine for small dictionary with lesser key and values. But it is not getting executed for larger dictionary as i have given below.
import itertools as it
import pandas as pd
my_dict= {
"A":[0,1,.....25],
"B":[4,5,.....35],
"C":[0,1,......30],
"D":[0,1,........35],
.........
"Y":[0,1,........35],
"Z":[0,1,........35],
}
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
This is the error i get, how to handle this problem with large dictionary.
Traceback (most recent call last):
File "<ipython-input-11-723405257e95>", line 1, in <module>
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
MemoryError
How to handle with the large dictionary to create data frame
If you have a sufficiently large [1] Spark cluster, each list in dictionary can be used as a Spark dataframe and then all these dataframes can be cross-joined:
def to_spark_dfs(dict):
for key in dict:
l=[[e] for e in dict[key]]
yield spark.createDataFrame(l, schema=[key])
dfs=to_spark_dfs(my_dict)
from functools import reduce
res=reduce(lambda df1,df2: df1.crossJoin(df2),dfs)
If the original my_dict is not too large
my_dict= {
"A":[0,1,2],
"B":[4,5,6],
"C":[0,1,2],
"D":[0,1],
"Y":[0,1,2],
"Z":[0,1],
}
the code produces the expected result:
res.show()
#+---+---+---+---+---+---+
#| A| B| C| D| Y| Z|
#+---+---+---+---+---+---+
#| 0| 4| 0| 0| 0| 0|
#| 0| 4| 0| 0| 0| 1|
#| 0| 4| 0| 0| 1| 0|
#| 0| 4| 0| 0| 1| 1|
#...
res.count()
#324
[1]
Using the numbers given in the comment (80 keys and approx 30 values per key) you would need a really large Spark cluster: 30 ^ 80 gives 1.5*10^118 different combination. This is more than the estimated number of atoms (10^80) in the known, observable universe.
In this case, we have a huge number of possible combinations. For example, if columns (A, B, C... Z) can take values [1...10] then total count of rows are equal 10^26, or 100000000000000000000000000.
In my mind there are 2 main directions to solve this issue:
Horizontal scaling: calculate and store results using frameworks for distributed computing (such as Apache Spark or Hadoop)
Vertical scaling: optimize CPU/RAM utilization using:
Vectorization (e.g. avoid loops)
Data types with minimal impact to RAM allocation (use as minimal precision as you need, use factorize() for strings)
mini-batching and download intermediate results (data frames) from RAM to disc in zipped format (e.g. parquet)
benchmark the execution time and object size in RAM.
Let me introduce the code that implements some of the concepts of the vertical scaling approach.
Define the following functions:
create_data_frame_baseline(): data frame generator with loop, not optimal data types (baseline)
create_data_frame_no_loop(): no loop, not optimal data types
create_data_frame_optimize_data_type(): no loop, optimal data types.
import itertools as it
import pandas as pd
import numpy as np
from string import ascii_uppercase
def create_letter_dict(cols_n: int = 10, levels_n: int = 6) -> dict:
letter_dict = {letter: list(range(levels_n)) for letter in ascii_uppercase[0:cols_n]}
return letter_dict
def create_data_frame_baseline(dict: dict) -> pd.DataFrame:
df = pd.DataFrame(columns=dict.keys())
for row in it.product(*dict.values()):
df.loc[len(df.index)] = row
return df
def create_data_frame_no_loop(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
list(it.product(*dict.values())),
columns=dict.keys()
)
def create_data_frame_optimize_data_type(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
np.int8(list(it.product(*dict.values()))),
columns=dict.keys()
)
Benchmarks:
import sys
import timeit
cols_n = 7
levels_n = 5
iteration_n = 2
# Baseline
def create_data_frame_baseline_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_baseline(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_baseline_test).timeit(number=iteration_n))
# No loop, not optimal data types
def create_data_frame_no_loop_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_no_loop(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_no_loop_test).timeit(number=iteration_n))
# No loop, optimal data types.
def create_data_frame_optimize_data_type_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_optimize_data_type(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_optimize_data_type_test).timeit(number=iteration_n))
Outputs*:
Function
Dataframe shape
RAM size, Mb
Execution time, sec
create_data_frame_baseline_test
78125x7
19
485
create_data_frame_no_loop_test
78125x7
4.4
0.20
create_data_frame_optimize_data_type_test
78125x7
0.55
0.16
Using create_data_frame_optimize_data_type_test() I generated* 100M rows in less than 100 seconds.
* Ubuntu Server 20.04, Intel(R) Xeon(R) 8xCPU # 2.60GHz, 32GB RAM
In your case you can not generate all possible combination at once, by using the list() but do it in loop, for example:
import itertools as it
import pandas as pd
from string import ascii_uppercase
N = 36
my_dict = {x: list(range(N)) for x in ascii_uppercase}
df = pd.DataFrame(columns=my_dict.keys())
for row in it.product(*my_dict.values()):
df.loc[len(df.index)] = row
but of cause it take a long time
I'd like to convert a float to a currency using Babel and PySpark
sample data:
amount currency
2129.9 RON
1700 EUR
1268 GBP
741.2 USD
142.08091153 EUR
4.7E7 USD
0 GBP
I tried:
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), F.col('currency'),locale='be_BE'))
or
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), 'EUR',locale='be_BE'))
They both give me an error:
To use Python libraries with Spark dataframes, you need to use an UDF:
from babel.numbers import format_currency
import pyspark.sql.functions as F
format_currency_udf = F.udf(lambda a, c: format_currency(a, c))
df2 = df.withColumn(
'amount',
format_currency_udf('amount', 'currency')
)
df2.show()
+----------------+--------+
| amount|currency|
+----------------+--------+
| RON2,129.90| RON|
| €1,700.00| EUR|
| £1,268.00| GBP|
| US$741.20| USD|
| €142.08| EUR|
|US$47,000,000.00| USD|
+----------------+--------+
There seems a problem in pre-processing the amount column of your dataframe. From the error it is evident that value after converting to string is not just numeric which it has to be according to this tableand has has some additional characters as well. You can check on this column to find that and remove unnecessary character to fix this. As as example:
>>> import decimal
>>> value = '10.0'
>>> value = decimal.Decimal(str(value))
>>> value
Decimal('10.0')
>>> value = '10.0e'
>>> value = decimal.Decimal(str(value))
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
value = decimal.Decimal(str(value))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>] # as '10.0e' is not just numeric
I am trying to groupBy and then calculate percentile on PySpark dataframe. I've tested the following piece of code according to this Stack Overflow post:
from pyspark.sql.types import FloatType
import pyspark.sql.functions as func
import numpy as np
qt_udf = func.udf(lambda x,qt: float(np.percentile(x,qt)), FloatType())
df_out = df_in.groupBy('Id').agg(func.collect_list('value').alias('data'))\
.withColumn('median', qt_udf(func.col('data'),func.lit(0.5)).cast("string"))
df_out.show()
But get the following error:
Traceback (most recent call last): > df_out.show() ....> return lambda *a: f(*a) AttributeError: 'module' object has no attribute 'percentile'
This is because of numpy version (1.4.1), the percentile function was added from version 1.5. It is not possible to update numpy version in the short term.
Define a window and use the inbuilt percent_rank function to compute percentile values.
from pyspark.sql import Window
from pyspark.sql import functions as func
w = Window.partitionBy(df_in.Id).orderBy(df_in.value) #assuming default ascending order
df_out = df_in.withColumn('percentile_col',func.percent_rank().over(w))
Question's title suggests that OP wanted to calculate percentiles. But the body shows that he needed to calculate median in groups.
Test dataset:
from pyspark.sql import SparkSession, functions as F, Window as W, Window
spark = SparkSession.builder.getOrCreate()
df_in = spark.createDataFrame(
[('1', 10),
('1', 11),
('1', 12),
('1', 13),
('2', 20)],
['Id', 'value']
)
Percentiles of given data points in groups:
w = W.partitionBy('Id').orderBy('value')
df_in = df_in.withColumn('percentile_of_value_by_Id', F.percent_rank().over(w))
df_in.show()
#+---+-----+-------------------------+
#| Id|value|percentile_of_value_by_Id|
#+---+-----+-------------------------+
#| 1| 10| 0.0|
#| 1| 11| 0.3333333333333333|
#| 1| 12| 0.6666666666666666|
#| 1| 13| 1.0|
#| 2| 20| 0.0|
#+---+-----+-------------------------+
Median (accurate and approximate):
df_out = (df_in.groupBy('Id').agg(
F.expr('percentile(value, .5)').alias('median_accurate'), # for small-mid dfs
F.percentile_approx('value', .5).alias('median_approximate') # for mid-large dfs
))
df_out.show()
#+---+---------------+------------------+
#| Id|median_accurate|median_approximate|
#+---+---------------+------------------+
#| 1| 11.5| 11|
#| 2| 20.0| 20|
#+---+---------------+------------------+
How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.
There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.
I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf
You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.