I'd like to convert a float to a currency using Babel and PySpark
sample data:
amount currency
2129.9 RON
1700 EUR
1268 GBP
741.2 USD
142.08091153 EUR
4.7E7 USD
0 GBP
I tried:
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), F.col('currency'),locale='be_BE'))
or
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), 'EUR',locale='be_BE'))
They both give me an error:
To use Python libraries with Spark dataframes, you need to use an UDF:
from babel.numbers import format_currency
import pyspark.sql.functions as F
format_currency_udf = F.udf(lambda a, c: format_currency(a, c))
df2 = df.withColumn(
'amount',
format_currency_udf('amount', 'currency')
)
df2.show()
+----------------+--------+
| amount|currency|
+----------------+--------+
| RON2,129.90| RON|
| €1,700.00| EUR|
| £1,268.00| GBP|
| US$741.20| USD|
| €142.08| EUR|
|US$47,000,000.00| USD|
+----------------+--------+
There seems a problem in pre-processing the amount column of your dataframe. From the error it is evident that value after converting to string is not just numeric which it has to be according to this tableand has has some additional characters as well. You can check on this column to find that and remove unnecessary character to fix this. As as example:
>>> import decimal
>>> value = '10.0'
>>> value = decimal.Decimal(str(value))
>>> value
Decimal('10.0')
>>> value = '10.0e'
>>> value = decimal.Decimal(str(value))
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
value = decimal.Decimal(str(value))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>] # as '10.0e' is not just numeric
Related
I want to validate a date column for a PySpark dataframe. I know how to do it for pandas, but can't make it work for PySpark.
import pandas as pd
import datetime
from datetime import datetime
data = [['Alex',10, '2001-01-12'],['Bob',12, '2005-10-21'],['Clarke',13, '2003-12-41']]
df = pd.DataFrame(data,columns=['Name','Sale_qty', 'DOB'])
sparkDF =spark.createDataFrame(df)
def validate(date_text):
try:
if date_text != datetime.strptime(date_text, "%Y-%m-%d").strftime('%Y-%m-%d'):
raise ValueError
return True
except ValueError:
return False
df = df['DOB'].apply(lambda x: validate(x))
print(df)
It works for pandas dataframe. But I can't make it work for PySpark. Getting the following error:
sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError Traceback (most recent call last)
<ipython-input-83-5f5f1db1c7b3> in <module>
----> 1 sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError: 'Column' object is not callable
You could use the following column expression:
F.to_date('DOB', 'yyyy-M-d').isNotNull()
Full test:
from pyspark.sql import functions as F
data = [['Alex', 10, '2001-01-12'], ['Bob', 12, '2005'], ['Clarke', 13, '2003-12-41']]
df = spark.createDataFrame(data, ['Name', 'Sale_qty', 'DOB'])
validation = F.to_date('DOB', 'yyyy-M-d').isNotNull()
df.withColumn('validation', validation).show()
# +------+--------+----------+----------+
# | Name|Sale_qty| DOB|validation|
# +------+--------+----------+----------+
# | Alex| 10|2001-01-12| true|
# | Bob| 12| 2005| false|
# |Clarke| 13|2003-12-41| false|
# +------+--------+----------+----------+
you can use a to_date() with the required source date format. It returns null where the format is incorrect, which can be used for validation.
see below example.
spark.sparkContext.parallelize([('01-12-2001',), ('2001-01-12',)]).toDF(['dob']). \
withColumn('correct_date_format', func.to_date('dob', 'yyyy-MM-dd').isNotNull()). \
show()
# +----------+-------------------+
# | dob|correct_date_format|
# +----------+-------------------+
# |01-12-2001| false|
# |2001-01-12| true|
# +----------+-------------------+
I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.
This way the number gets truncated:
df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+---------------------+----------------------+
#|numb |numb_dec |
#+---------------------+----------------------+
#|1.0234567891023456E19|10234567891023456000.0|
#+---------------------+----------------------+
This fails:
df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)
TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e+19 in type <class 'float'>
How to correctly provide big decimal numbers so that they wouldn't get truncated?
Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:
df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+----------------------+----------------------+
#|numb |numb_dec |
#+----------------------+----------------------+
#|10234567891023456789.5|10234567891023456789.5|
#+----------------------+----------------------+
Try something as below -
from pyspark.sql.types import *
from decimal import *
schema = StructType([StructField('numb', DecimalType(30,1))])
data = [( Context(prec=30, Emax=999, clamp=1).create_decimal('10234567891023456789.5'), )]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
+----------------------+
|numb |
+----------------------+
|10234567891023456789.5|
+----------------------+
I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.
You get this error because you are calling the python function format_column instead of the registered UDF format_str.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
#F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.
well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
#f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :
Main question:
When processing data batch per batch, how to handle changing schema in pyarrow ?
Long story:
As an example, I have the following data
| col_a | col_b |
-----------------
| 10 | 42 |
| 41 | 21 |
| 'foo' | 11 |
| 'bar' | 99 |
I'm working with python 3.7 and using pandas 1.1.0.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
col_a col_b
0 10 42
1 41 21
2 foo 11
3 bar 99
>>> df.dtypes
col_a object
col_b int64
dtype: object
>>>
I need to start working with Apache Arrow using PyArrow 1.0.1 implementation. In my application, we are working batch per batch. This means that we see part of the data, thus part of data types.
>>> dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
>>> dfi
<pandas.io.parsers.TextFileReader object at 0x7fabae915c50>
>>> dfg = next(dfi)
>>> dfg
col_a col_b
0 10 42
1 41 21
>>> sub_1 = next(dfi)
>>> sub_2 = next(dfi)
>>> sub_1
col_a col_b
2 foo 11
3 bar 99
>>> dfg2
col_a col_b
2 foo 11
3 bar 99
>>> sub_1.dtypes
col_a int64
col_b int64
dtype: object
>>> sub_2.dtypes
col_a object
col_b int64
dtype: object
>>>
My goal is to persist this whole dataframe using parquet format of Apache Arrow in the constraint of working batch per batch. It requires us to correctly fill the schema. How does one handle dtypes that change over batchs ?
Here's the full code to reproduce the problem using above data.
from pyarrow import RecordBatch, RecordBatchFileWriter, RecordBatchFileReader
import pandas as pd
pd.DataFrame([['10', 42], ['41', 21], ['foo', 11], ['bar', 99]], columns=['col_a', 'col_b']).to_csv('test.csv')
dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
sub_1 = next(dfi)
sub_2 = next(dfi)
# No schema provided here. Pyarrow should infer the schema from data. The first column is identified as a col of int.
batch_to_write_1 = RecordBatch.from_pandas(sub_1)
schema = batch_to_write_1.schema
writer = RecordBatchFileWriter('test.parquet', schema)
writer.write(batch_to_write_1)
# We expect to keep the same schema but that is not true, the schema does not match sub_2 data. So the
# following line launch an exception.
batch_to_write_2 = RecordBatch.from_pandas(sub_2, schema)
# writer.write(batch_to_write_2) # This will fail bcs batch_to_write_2 is not defined
We get the following exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 858, in pyarrow.lib.RecordBatch.from_pandas
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in <listcomp>
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)
This behavior is intended. Some alternatives to try (I believe they should work but I haven't tested all of them):
If you know the final schema up front construct it by hand in pyarrow instead of relying on inferred one from the first record batch.
Go through all the data and compute a final schema. Then reprocess the data with the new schema.
Detect a schema change and recast previous record batches.
Detect the schema change and start a new table (you will would then end up with one parquet file per schema and you would need another process to unify the schemas).
Lastly if it works, and you are trying to tranform CSV data, you might consider using the built in Arrow CSV parser.
Using this code to find modal :
import numpy as np
np.random.seed(1)
df2 = sc.parallelize([
(int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])
cnts = df2.groupBy("x").count()
mode = cnts.join(
cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
from Calculate the mode of a PySpark DataFrame column?
returns error :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-53-2a9274e248ac> in <module>()
8 cnts = df.groupBy("x").count()
9 mode = cnts.join(
---> 10 cnts.agg(max("count").alias("max_")), col("count") == col("max_")
11 ).limit(1).select("x")
12 mode.first()[0]
AttributeError: 'str' object has no attribute 'alias'
Instead of this solution I'm attempting this custom one:
df.show()
cnts = df.groupBy("c1").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1], ascending=False).first()
cnts = df.groupBy("c2").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1] , ascending=False).first()
which returns :
So modal of c1 & c2 are 2.0 and 3.0 respectively
Can this be applied to all columns c1,c2,c3,c4,c5 in dataframe instead of explicitly selecting each column as I have done ?
It looks like you're using built-in max, not a SQL function.
import pyspark.sql.functions as F
cnts.agg(F.max("count").alias("max_"))
To find mode over multiple columns of the same type you can reshape to long (melt as defined in Pandas Melt function in Apache Spark):
(melt(df, [], df.columns)
# Count by column and value
.groupBy("variable", "value")
.count()
# Find mode per column
.groupBy("variable")
.agg(F.max(F.struct("count", "value")).alias("mode"))
.select("variable", "mode.value"))
+--------+-----+
|variable|value|
+--------+-----+
| c5| 6.0|
| c1| 2.0|
| c4| 5.0|
| c3| 4.0|
| c2| 3.0|
+--------+-----+