map in dataframe - pyspark [duplicate]

map in dataframe - pyspark [duplicate] - apache-spark

I wanted to convert the spark data frame to add using the code below:
from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
The detailed error message is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
1 from pyspark.mllib.clustering import KMeans
2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
842 if name not in self.columns:
843 raise AttributeError(
--> 844 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
845 jc = self._jdf.apply(name)
846 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'map'
Does anyone know what I did wrong here? Thanks!

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:
Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.
What should you do instead?
Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.
Another example is using explode instead of flatMap(which existed in RDD):
df.select($"name",explode($"knownLanguages"))
.show(false)
Result:
+-------+------+
|name |col |
+-------+------+
|James |Java |
|James |Scala |
|Michael|Spark |
|Michael|Java |
|Michael|null |
|Robert |CSharp|
|Robert | |
+-------+------+
You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Related

Unable to use pyspark udf

I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.

You get this error because you are calling the python function format_column instead of the registered UDF format_str.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
#F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.

well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
#f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :

Non-consistent schema in apache arrow

Main question:
When processing data batch per batch, how to handle changing schema in pyarrow ?
Long story:
As an example, I have the following data
| col_a | col_b |
-----------------
| 10 | 42 |
| 41 | 21 |
| 'foo' | 11 |
| 'bar' | 99 |
I'm working with python 3.7 and using pandas 1.1.0.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
col_a col_b
0 10 42
1 41 21
2 foo 11
3 bar 99
>>> df.dtypes
col_a object
col_b int64
dtype: object
>>>
I need to start working with Apache Arrow using PyArrow 1.0.1 implementation. In my application, we are working batch per batch. This means that we see part of the data, thus part of data types.
>>> dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
>>> dfi
<pandas.io.parsers.TextFileReader object at 0x7fabae915c50>
>>> dfg = next(dfi)
>>> dfg
col_a col_b
0 10 42
1 41 21
>>> sub_1 = next(dfi)
>>> sub_2 = next(dfi)
>>> sub_1
col_a col_b
2 foo 11
3 bar 99
>>> dfg2
col_a col_b
2 foo 11
3 bar 99
>>> sub_1.dtypes
col_a int64
col_b int64
dtype: object
>>> sub_2.dtypes
col_a object
col_b int64
dtype: object
>>>
My goal is to persist this whole dataframe using parquet format of Apache Arrow in the constraint of working batch per batch. It requires us to correctly fill the schema. How does one handle dtypes that change over batchs ?
Here's the full code to reproduce the problem using above data.
from pyarrow import RecordBatch, RecordBatchFileWriter, RecordBatchFileReader
import pandas as pd
pd.DataFrame([['10', 42], ['41', 21], ['foo', 11], ['bar', 99]], columns=['col_a', 'col_b']).to_csv('test.csv')
dfi = pd.read_csv('test.csv', iterator=True, chunksize=2)
sub_1 = next(dfi)
sub_2 = next(dfi)
# No schema provided here. Pyarrow should infer the schema from data. The first column is identified as a col of int.
batch_to_write_1 = RecordBatch.from_pandas(sub_1)
schema = batch_to_write_1.schema
writer = RecordBatchFileWriter('test.parquet', schema)
writer.write(batch_to_write_1)
# We expect to keep the same schema but that is not true, the schema does not match sub_2 data. So the
# following line launch an exception.
batch_to_write_2 = RecordBatch.from_pandas(sub_2, schema)
# writer.write(batch_to_write_2) # This will fail bcs batch_to_write_2 is not defined
We get the following exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 858, in pyarrow.lib.RecordBatch.from_pandas
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 579, in <listcomp>
for c, f in zip(columns_to_convert, convert_fields)]
File "/mnt/e/miniconda/envs/pandas/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)

This behavior is intended. Some alternatives to try (I believe they should work but I haven't tested all of them):
If you know the final schema up front construct it by hand in pyarrow instead of relying on inferred one from the first record batch.
Go through all the data and compute a final schema. Then reprocess the data with the new schema.
Detect a schema change and recast previous record batches.
Detect the schema change and start a new table (you will would then end up with one parquet file per schema and you would need another process to unify the schemas).
Lastly if it works, and you are trying to tranform CSV data, you might consider using the built in Arrow CSV parser.

PySpark: Invalid returnType with scalar Pandas UDFs

I'm trying to return a specific structure from a pandas_udf. It worked on one cluster but fails on another.
I try to run a udf on groups, which requires the return type to be a data frame.
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql.types import *
schema = StructType([
StructField("Distance", FloatType()),
StructField("CarId", IntegerType())
])
def haversine(lon1, lat1, lon2, lat2):
#Calculate distance, return scalar
return 3.5 # Removed logic to facilitate reading
#pandas_udf(schema)
def totalDistance(oneCar):
dist = haversine(oneCar.Longtitude.shift(1),
oneCar.Latitude.shift(1),
oneCar.loc[1:, 'Longitude'],
oneCar.loc[1:, 'Latitude'])
return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index = [0])
## Calculate the overall distance made by each car
distancePerCar= df.groupBy('CarId').apply(totalDistance)
This is the exception I'm getting:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
114 try:
--> 115 to_arrow_type(self._returnType_placeholder)
116 except TypeError:
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
During handling of the above exception, another exception occurred:
NotImplementedError Traceback (most recent call last)
<ipython-input-35-4f2194cfb998> in <module>()
18 km = 6367 * c
19 return km
---> 20 #pandas_udf("CarId: int, Distance: float")
21 def totalDistance(oneUser):
22 dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType, evalType)
62 udf_obj = UserDefinedFunction(
63 f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
---> 64 return udf_obj._wrapped()
65
66
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
184
185 wrapper.func = self.func
--> 186 wrapper.returnType = self.returnType
187 wrapper.evalType = self.evalType
188 wrapper.deterministic = self.deterministic
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
117 raise NotImplementedError(
118 "Invalid returnType with scalar Pandas UDFs: %s is "
--> 119 "not supported" % str(self._returnType_placeholder))
120 elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
121 if isinstance(self._returnType_placeholder, StructType):
NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true))) is not supported
I've also tried changing the schema to
#pandas_udf("<CarId:int,Distance:float>")
and
#pandas_udf("CarId:int,Distance:float")
but get the same exception. I suspect it has to do with my pyarrow version, which isn't compatible with my pyspark version.
Any help would be appreciated. Thanks!

As reported in the error message ("Invalid returnType with scalar Pandas UDFs"), you are trying to create a SCALAR vectorized pandas UDF, but using a StructType schema and returning a pandas DataFrame.
You should rather declare your function as a GROUPED MAP pandas UDF, i.e.:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Difference between scalar and grouped vectorized UDFs is explained in the pyspark doc: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf.
A scalar UDF defines a transformation: One or more pandas.Series -> A pandas.Series. The returnType should be a primitive data type, e.g., DoubleType(). The length of the returned pandas.Series must be of the same as the input pandas.Series.
To summarize, a scalar pandas UDF processes a column at a time (a pandas Series), leading to better performance than traditional UDFs that process one row element at a time. Note that the performance improvement is due to efficient python serialization using PyArrow.
A grouped map UDF defines transformation: A pandas.DataFrame -> A pandas.DataFrame The returnType should be a StructType describing the schema of the returned pandas.DataFrame. The length of the returned pandas.DataFrame can be arbitrary and the columns must be indexed so that their position matches the corresponding field in the schema.
A grouped pandas UDF processes multiple rows and columns at a time (using a pandas DataFrame, not to be confused with a Spark DataFrame), and is extremely useful and efficient for multivariate operations (especially when using local python numerical analysis and machine learning libraries like numpy, scipy, scikit-learn etc.). In this case, the output is a single-row DataFrame with several columns.
Note that I did not check the internal logic of the code, only the methodology.

Combining multiple groupBy functions into 1

Using this code to find modal :
import numpy as np
np.random.seed(1)
df2 = sc.parallelize([
(int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])
cnts = df2.groupBy("x").count()
mode = cnts.join(
cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
from Calculate the mode of a PySpark DataFrame column?
returns error :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-53-2a9274e248ac> in <module>()
8 cnts = df.groupBy("x").count()
9 mode = cnts.join(
---> 10 cnts.agg(max("count").alias("max_")), col("count") == col("max_")
11 ).limit(1).select("x")
12 mode.first()[0]
AttributeError: 'str' object has no attribute 'alias'
Instead of this solution I'm attempting this custom one:
df.show()
cnts = df.groupBy("c1").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1], ascending=False).first()
cnts = df.groupBy("c2").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1] , ascending=False).first()
which returns :
So modal of c1 & c2 are 2.0 and 3.0 respectively
Can this be applied to all columns c1,c2,c3,c4,c5 in dataframe instead of explicitly selecting each column as I have done ?

It looks like you're using built-in max, not a SQL function.
import pyspark.sql.functions as F
cnts.agg(F.max("count").alias("max_"))
To find mode over multiple columns of the same type you can reshape to long (melt as defined in Pandas Melt function in Apache Spark):
(melt(df, [], df.columns)
# Count by column and value
.groupBy("variable", "value")
.count()
# Find mode per column
.groupBy("variable")
.agg(F.max(F.struct("count", "value")).alias("mode"))
.select("variable", "mode.value"))
+--------+-----+
|variable|value|
+--------+-----+
| c5| 6.0|
| c1| 2.0|
| c4| 5.0|
| c3| 4.0|
| c2| 3.0|
+--------+-----+

Spark Pivot String in PySpark [duplicate]

This question already has answers here:
Pivot String column on Pyspark Dataframe
(2 answers)
Closed 6 years ago.
I have a problem restructuring data using Spark. The original data looks like this:
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
This is the structure I try to achieve:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
My idea was to use:
df.groupBy("ID").pivot("VAR").show()
But I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
Any suggestions! Thanks!

You need to add an aggregation after pivot(). If you are sure there is only one "VAL" for each ("ID", "VAR") pair, you can use first():
from pyspark.sql import functions as f
result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string