Pyspark Streaming with Pandas UDF - apache-spark

I am new to Spark Streaming and Pandas UDF. I am working on pyspark consumer from kafka, payload is of xml format and trying to parse the incoming xml by applying pandas udf
#pandas_udf("col1 string, col2 string",PandasUDFType.GROUPED_MAP)
def test_udf(df):
import xmltodict
from collections import MutableMapping
xml_str=df.iloc[0,0]
df_col=['col1', 'col2']
doc=xmltodict.parse(xml_str,dict_constructor=dict)
extract_needed_fields = { k:doc[k] for k in df_col }
return pd.DataFrame( [{'col1': 'abc', 'col2': 'def'}] , index=[0] , dtype="string" )
data=df.selectExpr("CAST(value AS STRING) AS value")
data.groupby("value").apply(test_udf).writeStream.format("console").start()
I get the below error
File "pyarrow/array.pxi", line 859, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Is this the right approach ? What am i doing wrong

It looks like, as if this is a more kind of undocumented limitation than a bug.
You cannot use any pandas type which will be stored as an array object, which has a method named __arrow_array__, because pyspark always defines a mask. The string type you used, is stored in a StringArray, which is such a case. After I converted the string dtype into object, the error went away.

While converting a pandas dataframe to a pyspark one, I stumbled upon this error as well :
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol
My pandas dataframe had datetime-like values that I tried to convert to "string". I initially used the astype("string") method, which looked like this :
df["time"] = (df["datetime"].dt.time).astype("string")
When I tried to get the info of this dataframe, it seemed like it was indeed converted to a string type :
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null string
But the error kept coming back to me.
Solution
To avoid it, I instead went on to use the apply(str) method :
df["time"] = (df["datetime"].dt.time).apply(str)
Which gave me a type of object
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null object
After that, the conversion was successful
spark.createDataFrame(df)
# DataFrame[datetime: string, date: string, year: bigint, month: bigint, day: bigint, day_name: string, time: string, hour: bigint, minute: bigint]

Related

Pyspark change string to timestamptype

I am new to PySpark and I am trying to create a function that can be used across when inputted a column from String type to a timestampType.
This the input column string looks like: 23/04/2021 12:00:00 AM
I want this to be turned in to timestampType so I can get latest date using pyspark.
Below is the function I so far created:
def datetype_change(self, key, col):
self.log.info("datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn("column_name", F.unix_timestamp(F.col("column_name"), 'yyyy-MM-dd HH:mm:ss').cast(TimestampType()))
When I run it I'm getting an error:
NameError: name 'TimestampType' is not defined
How do I change this function so it can take the intended output?
Found my answer:
def datetype_change(self,key,col):
self.log.info("-datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn(col, F.unix_timestamp(self.df[key][col], 'dd/MM/yyyy hh:mm:ss aa').cast(TimestampType()))

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:
reads in a hdfs parquet file
converts it to a pandas dataframe
loops through specific columns and changes some values
writes the dataframe back to a parquet file
Then the parquet file is imported back into hdfs using impala-shell.
The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
fields that should be an Int with a value of 0 in the database are changed to "0.00000" and turned into a float in the dataframe.
It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:
WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]
I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.
I have been using this reference page:
http://arrow.apache.org/docs/python/parquet.html
It uses
pq.read_table(in_file)
to read the parquet file and then
df = table2.to_pandas()
to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?
If I query the database, the data would look like this:
tport
0
1
My print(df) line for the same thing looks like this:
tport
0.00000
nan
nan
1.00000
Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path
# <CLI PROCESSING SECTION HERE>
# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r')
contents = field_file.read()
scrambler_columns = contents.split('\n')
def scramble_str(xstr):
#print(xstr + '_scrambled!')
return xstr + '_scrambled!'
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)
df = table2.to_pandas() #dataframe
print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
#df.fillna(value='', inplace=True) # np.nan # \xa0
print(df) # print before making any changes
cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
#print(cols[index])
if col_name in scrambler_columns:
print('scrambling values in column ' + col_name)
for i, val in col_data.items():
df.at[i, col_name] = scramble_str(str(val))
print(df) # print after making changes
print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))
# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
writer.write_table(new_table)
writer.close()
if os.path.isfile(out_file) == True:
print('wrote ' + out_file)
else:
print('error writing file ' + out_file)
# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)
EDIT
Here are the datatypes for the 1st few columns in hdfs
and here are the same ones that are in the pandas dataframe:
id object
col1 float64
col2 object
col3 object
col4 float64
col5 object
col6 object
col7 object
It appears to convert
String to object
Int to float64
bigint to float64
How can I tell pandas what data types the columns should be?
Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
This is expected. It's just how pandas print function is defined.
It appears to convert String to object
This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.
It appears to convert Int to float64
This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...
def lookup(t):
if pa.types.is_integer(t):
return pd.Int64Dtype()
df = table.to_pandas(types_mapper=lookup)
Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

DF - data conversion issues

I have a question regarding a project that I am doing. I am getting a conversion error, that cannot convert the series to <class 'int'> and cannot see the reason why. the values that I've got are int64 meanwhile the system tries to convert to a base 10.
I have a csv file called "test.csv" and it is structured like this:
date,value
2016-05-09,1201
2016-05-10,2329
2016-05-11,1716
2016-05-12,10539
...
I import the data, parse the dates and set the index column to 'date'.
df = pd.read_csv("test.csv", parse_dates=True)
df = df.set_index('date')
Afterwards I clean the data of the first and last 2.5%
df = df[(df['value'] >= (df['value'].quantile(0.025))) &(df['value'] <= (df['value'].quantile(0.975)))]
I print the data types that I've got and find only one:
print (df.dtypes)
value int64
dtype: object
If I run it against this code (as part of a test):
actual = int(time_series_visualizer.df.count(numeric_only=True))
I get this error:
TypeError: cannot convert the series to <class 'int'>
I was tried to convert to another type to see if it was an issue with int64.
tried:
df.value.astype(int)
df.value.astype(float)
but both didn't work.
Does anyone have any suggestions that I could try?
thanks

I would like to convert an int to string in dataframe

I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
data['id']=data['id'].to_string()
and
data['id']=data['id'].astype(str)
got dtype('O')
I expect to receive string
This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])
df[0].dtype
#Output:
dtype('O')
You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.
str(12)
>>'12'
Can easily convert to a String

Processing json strings in a Spark dataframe column

When I look for ways to parse json within a string column of a dataframe, I keep running into results that more simply read json file sources. My source is actually a hive ORC table with some strings in one of the columns which is in a json format. I'd really like to convert that to something parsed like a map.
I'm having trouble finding a way to do this:
import java.util.Date
import org.apache.spark.sql.Row
import scala.util.parsing.json.JSON
val items = sql("select * from db.items limit 10")
//items.printSchema
val internal = items.map {
case Row(externalid: Long, itemid: Long, locale: String,
internalitem: String, version: Long,
createdat: Date, modifiedat: Date)
=> JSON.parseFull(internalitem)
}
I thought this should work, but maybe there's a more Spark way of doing this instead because I get the following error:
java.lang.ClassNotFoundException: scala.Any
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
Specifically, my input data looks approximately like this:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{'name':'thing','attr':{
21:{'attrname':'size','attrval':'big'},
42:{'attrname':'color','attrval':'red'}
}}", 1, 2017-05-05…, 2017-05-06…
Yes it's not RFC 7158 exactly.
The attr keys can be 5 to 30 of any 80,000 values, so I wanted get to something like this instead:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{"name':"thing","attr":[
{"id":21,"attrname':"size","attrval":"big"},
{"id":42,"attrname":"color","attrval":"red"}
]}", 1, 2017-05-05…, 2017-05-06…
Then flatten the internalitem to fields and explode the attr array:
externalid, itemid, locale, name, attrid, attrname attrval, version, createdat, modifiedat
123, 321, "en_us", "thing", 21, "size", "big", 1, 2017-05-05…, 2017-05-06…
123, 321, "en_us", "thing", 21, "color", "red", 1, 2017-05-05…, 2017-05-06…
I've never been using such computations, but I have an advice for you :
Before doing any operation on columns on your own, just check the sql.functions package which contain a whole bunch of helpful functions to work with columns like date extracting and formatting, string concatenation and spliting, ... and it also provide a couple of functions to work with json objects like : from_json and json_tuple.
To use those methods you simply need to import them and call them inside a select method like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val transofrmedDf = df.select($"externalid", $"itemid", … from_json($"internalitem", schema), $"version" …)
First of all you have to create a schema for your json column and put it in the schema variable
Hope it helps.

Resources