Non-consistent schema in apache arrow - python-3.x

Main question:
When processing data batch per batch, how to handle changing schema in pyarrow ?
Long story:
As an example, I have the following data
| col_a | col_b |
-----------------
| 10 | 42 |
| 41 | 21 |
| 'foo' | 11 |
| 'bar' | 99 |
I'm working with python 3.7 and using pandas 1.1.0.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
col_a col_b
0 10 42
1 41 21
2 foo 11
3 bar 99
>>> df.dtypes
col_a object
col_b int64
dtype: object
>>>
I need to start working with Apache Arrow using PyArrow 1.0.1 implementation. In my application, we are working batch per batch. This means that we see part of the data, thus part of data types.
>>> dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
>>> dfi
<pandas.io.parsers.TextFileReader object at 0x7fabae915c50>
>>> dfg = next(dfi)
>>> dfg
col_a col_b
0 10 42
1 41 21
>>> sub_1 = next(dfi)
>>> sub_2 = next(dfi)
>>> sub_1
col_a col_b
2 foo 11
3 bar 99
>>> dfg2
col_a col_b
2 foo 11
3 bar 99
>>> sub_1.dtypes
col_a int64
col_b int64
dtype: object
>>> sub_2.dtypes
col_a object
col_b int64
dtype: object
>>>
My goal is to persist this whole dataframe using parquet format of Apache Arrow in the constraint of working batch per batch. It requires us to correctly fill the schema. How does one handle dtypes that change over batchs ?
Here's the full code to reproduce the problem using above data.
from pyarrow import RecordBatch, RecordBatchFileWriter, RecordBatchFileReader
import pandas as pd
pd.DataFrame([['10', 42], ['41', 21], ['foo', 11], ['bar', 99]], columns=['col_a', 'col_b']).to_csv('test.csv')
dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
sub_1 = next(dfi)
sub_2 = next(dfi)
# No schema provided here. Pyarrow should infer the schema from data. The first column is identified as a col of int.
batch_to_write_1 = RecordBatch.from_pandas(sub_1)
schema = batch_to_write_1.schema
writer = RecordBatchFileWriter('test.parquet', schema)
writer.write(batch_to_write_1)
# We expect to keep the same schema but that is not true, the schema does not match sub_2 data. So the
# following line launch an exception.
batch_to_write_2 = RecordBatch.from_pandas(sub_2, schema)
# writer.write(batch_to_write_2) # This will fail bcs batch_to_write_2 is not defined
We get the following exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 858, in pyarrow.lib.RecordBatch.from_pandas
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in <listcomp>
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)

This behavior is intended. Some alternatives to try (I believe they should work but I haven't tested all of them):
If you know the final schema up front construct it by hand in pyarrow instead of relying on inferred one from the first record batch.
Go through all the data and compute a final schema. Then reprocess the data with the new schema.
Detect a schema change and recast previous record batches.
Detect the schema change and start a new table (you will would then end up with one parquet file per schema and you would need another process to unify the schemas).
Lastly if it works, and you are trying to tranform CSV data, you might consider using the built in Arrow CSV parser.

Related

Import data in a csv file that is structured as list

I'm trying to import data into a pandas dataframe. The file type is 'csv' but the data in the file is structured as a python list. The below code is only returning the column headers. Any suggestions? What am I doing wrong?
import pandas as pd
data_path = pd.read_csv(r'C:\Users\john_smith\file_name.csv')
df = pd.DataFrame(data_path, columns=["article_id","author_id","viewer_id","view_date"])
df
An example of the data in the file is below. There aren't any headers in the file.
[[126,17,62,"2019-07-02"],[149,42,22,"2019-06-23"],[138,39,33,"2019-07-26"]]
Example of what is returned is below:
The returned output
It's really not clear, but if you have a file that literally looks like:
file.csv
[[126,17,62,"2019-07-02"],[149,42,22,"2019-06-23"],[138,39,33,"2019-07-26"]]
We can attempt to read that with ast.literal_eval
from ast import literal_eval
with open('file.csv') as f:
data = literal_eval(f.read())
print(data)
print(type(data))
# Output:
[[126, 17, 62, '2019-07-02'], [149, 42, 22, '2019-06-23'], [138, 39, 33, '2019-07-26']]
<class 'list'>
Now we can work with pandas:
df = pd.DataFrame(data, columns=["article_id","author_id","viewer_id","view_date"])
print(df)
# Output:
article_id author_id viewer_id view_date
0 126 17 62 2019-07-02
1 149 42 22 2019-06-23
2 138 39 33 2019-07-26
You can use this :
import pandas as pd
path_file = r'C:\Users\john_smith\file_name.csv'
df = pd.read_csv(path_file, delimiter=';', names=['data'])
df['data'] = df['data'][0][1:-1]
df = df.assign(**{'data':df['data'].str.split('\],\[')})
out = df.explode('data').replace({'\[': '', '\]': '', '"': ''}, regex=True)
out = (out['data'].str.split(',', expand=True)
.rename(columns={0:'article_id', 1:'author_id', 2:'viewer_id', 3:'view_date'})
.reset_index(drop=True)
)
>>> print(out)

GroupBy with ApplyInPandas in PySpark - how to implement a UDF correctly?

I am trying to use a PandasUDF in PySpark to find the 'longest unique tail' in a hierarchy.
For example, if my input is:
1.2
1.2.3
then the longest tail is '1.2.3'
I may also have multiple unique sets, for example:
1.2
1.2.3
5.6.7
5.6
in which case the output should be:
1.2.3
5.6.7
The approach I am using is:
sort the input so that like rows are listed so that if a preceding row is 'contained' in a following row I can filter it out and return the longest unique rows only.
example input:
1.2.3
5.6.7
5.6
1.2
sorted becomes:
1.2
1.2.3
5.6
5.6.7
when I filter line on line, my output should be
1.2.3
5.6.7
I have tried two approaches.
First is to write a function that loops through a DF sent into it as follows:
def getLongestTail(key, pdf) -> pd.DataFrame:
sortedData = pdf.sort_values(by='value')
for i in range(len(sortedData)-1):
if sortedData.index(i+1).loc['value'].startswith(sortedData.loc['value']):
sortedData.index(i+1) = False
return pd.DataFrame(sortedData)
Second is to use a lambda function inline
def getLongestTail(pdf) -> pd.DataFrame:
pdf = pdf.sort
return (lambda x: pdf.shift(1).loc['value'].startswith(pdf.loc['value']))
I have also tried to decorate as follows:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Here is my overall code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
from pyspark.sql.types import *
simpleData = [
('A', '1.2.3'),
('A', '1.2'),
('B', '9.8'),
('A', '5.6.7.8'),
('B', '9'),
('B', '9.8.7'),
('A', '5')]
schema = StructType([
StructField("letter", StringType()),
StructField("value", StringType())
])
def getLongestTail(pdf) -> pd.DataFrame:
pdf = pdf.sort
return pd.DataFrame((lambda x: pdf.loc['value'].startswith(pdf.shift(1).loc['value'])))
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data=simpleData, schema = schema)
df_result = df.groupby('letter').applyInPandas(getLongestTail, schema=schema).show()
The errors being shown in my Jupyter notebook are showing worker crashed and errors relating to Py4JJavaError.
a
I am sure there is something basic I am missing - any comments appreciated.
Thank you.
===
error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/tmp/ipykernel_34305/1009949605.py in <module>
3 # df_grouped.show()
4
----> 5 df_result = df.groupby('letter').applyInPandas(getLongestTailL, schema=schema).show()
6
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Great question. After passing data to applyInPandas, one expects to have some new variable added to output_schema: simply add a result variable to your input_schema and pass the extended output schema to applyInPandas.

Format float to currency using PySpark and Babel

I'd like to convert a float to a currency using Babel and PySpark
sample data:
amount currency
2129.9 RON
1700 EUR
1268 GBP
741.2 USD
142.08091153 EUR
4.7E7 USD
0 GBP
I tried:
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), F.col('currency'),locale='be_BE'))
or
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), 'EUR',locale='be_BE'))
They both give me an error:
To use Python libraries with Spark dataframes, you need to use an UDF:
from babel.numbers import format_currency
import pyspark.sql.functions as F
format_currency_udf = F.udf(lambda a, c: format_currency(a, c))
df2 = df.withColumn(
'amount',
format_currency_udf('amount', 'currency')
)
df2.show()
+----------------+--------+
| amount|currency|
+----------------+--------+
| RON2,129.90| RON|
| €1,700.00| EUR|
| £1,268.00| GBP|
| US$741.20| USD|
| €142.08| EUR|
|US$47,000,000.00| USD|
+----------------+--------+
There seems a problem in pre-processing the amount column of your dataframe. From the error it is evident that value after converting to string is not just numeric which it has to be according to this tableand has has some additional characters as well. You can check on this column to find that and remove unnecessary character to fix this. As as example:
>>> import decimal
>>> value = '10.0'
>>> value = decimal.Decimal(str(value))
>>> value
Decimal('10.0')
>>> value = '10.0e'
>>> value = decimal.Decimal(str(value))
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
value = decimal.Decimal(str(value))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>] # as '10.0e' is not just numeric

How to read excel table with one column?

I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?
Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)
Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()

map in dataframe - pyspark [duplicate]

I wanted to convert the spark data frame to add using the code below:
from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
The detailed error message is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
1 from pyspark.mllib.clustering import KMeans
2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
842 if name not in self.columns:
843 raise AttributeError(
--> 844 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
845 jc = self._jdf.apply(name)
846 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'map'
Does anyone know what I did wrong here? Thanks!
You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.
You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:
Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.
What should you do instead?
Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.
Another example is using explode instead of flatMap(which existed in RDD):
df.select($"name",explode($"knownLanguages"))
.show(false)
Result:
+-------+------+
|name |col |
+-------+------+
|James |Java |
|James |Scala |
|Michael|Spark |
|Michael|Java |
|Michael|null |
|Robert |CSharp|
|Robert | |
+-------+------+
You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Resources