Spark Pivot String in PySpark [duplicate]

Spark Pivot String in PySpark [duplicate] - apache-spark

This question already has answers here:
Pivot String column on Pyspark Dataframe
(2 answers)
Closed 6 years ago.
I have a problem restructuring data using Spark. The original data looks like this:
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
This is the structure I try to achieve:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
My idea was to use:
df.groupBy("ID").pivot("VAR").show()
But I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
Any suggestions! Thanks!

You need to add an aggregation after pivot(). If you are sure there is only one "VAL" for each ("ID", "VAR") pair, you can use first():
from pyspark.sql import functions as f
result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+

Related

Validate date column for PySpark dataframe

I want to validate a date column for a PySpark dataframe. I know how to do it for pandas, but can't make it work for PySpark.
import pandas as pd
import datetime
from datetime import datetime
data = [['Alex',10, '2001-01-12'],['Bob',12, '2005-10-21'],['Clarke',13, '2003-12-41']]
df = pd.DataFrame(data,columns=['Name','Sale_qty', 'DOB'])
sparkDF =spark.createDataFrame(df)
def validate(date_text):
try:
if date_text != datetime.strptime(date_text, "%Y-%m-%d").strftime('%Y-%m-%d'):
raise ValueError
return True
except ValueError:
return False
df = df['DOB'].apply(lambda x: validate(x))
print(df)
It works for pandas dataframe. But I can't make it work for PySpark. Getting the following error:
sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError Traceback (most recent call last)
<ipython-input-83-5f5f1db1c7b3> in <module>
----> 1 sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError: 'Column' object is not callable

You could use the following column expression:
F.to_date('DOB', 'yyyy-M-d').isNotNull()
Full test:
from pyspark.sql import functions as F
data = [['Alex', 10, '2001-01-12'], ['Bob', 12, '2005'], ['Clarke', 13, '2003-12-41']]
df = spark.createDataFrame(data, ['Name', 'Sale_qty', 'DOB'])
validation = F.to_date('DOB', 'yyyy-M-d').isNotNull()
df.withColumn('validation', validation).show()
# +------+--------+----------+----------+
# | Name|Sale_qty| DOB|validation|
# +------+--------+----------+----------+
# | Alex| 10|2001-01-12| true|
# | Bob| 12| 2005| false|
# |Clarke| 13|2003-12-41| false|
# +------+--------+----------+----------+

you can use a to_date() with the required source date format. It returns null where the format is incorrect, which can be used for validation.
see below example.
spark.sparkContext.parallelize([('01-12-2001',), ('2001-01-12',)]).toDF(['dob']). \
withColumn('correct_date_format', func.to_date('dob', 'yyyy-MM-dd').isNotNull()). \
show()
# +----------+-------------------+
# | dob|correct_date_format|
# +----------+-------------------+
# |01-12-2001| false|
# |2001-01-12| true|
# +----------+-------------------+

Unable to use pyspark udf

I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.

You get this error because you are calling the python function format_column instead of the registered UDF format_str.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
#F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.

well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
#f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :

map in dataframe - pyspark [duplicate]

I wanted to convert the spark data frame to add using the code below:
from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
The detailed error message is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
1 from pyspark.mllib.clustering import KMeans
2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
842 if name not in self.columns:
843 raise AttributeError(
--> 844 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
845 jc = self._jdf.apply(name)
846 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'map'
Does anyone know what I did wrong here? Thanks!

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:
Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.
What should you do instead?
Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.
Another example is using explode instead of flatMap(which existed in RDD):
df.select($"name",explode($"knownLanguages"))
.show(false)
Result:
+-------+------+
|name |col |
+-------+------+
|James |Java |
|James |Scala |
|Michael|Spark |
|Michael|Java |
|Michael|null |
|Robert |CSharp|
|Robert | |
+-------+------+
You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

convert dataframe to libsvm format

I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pyspark.ml.classification.LogisticRegression
I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2
df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm
I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error
import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils
Question 1: Is my above approach to convert dataframe to libsvm format in the right direction.
Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format

I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):
This is my way to convert dataframe to libsvm format:
# ... your previous imports
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 3| 6|
| 4| 5| 20|
| 7| 8| 8|
+---+---+---+
# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
What you will see on the "/your/Path/nameFolder/part-0000*" files is:
1.0 1:3.0 2:6.0
4.0 1:5.0 2:20.0
7.0 1:8.0 2:8.0
See here for LabeledPoint docs

I had to do this for it to work
D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))

If you want to convert sparse vectors to a 'sparse' libsvm which is more efficient, try this:
from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
df = spark.createDataFrame([
(0, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)]))
], ["label", "features"])
df.show()
# +-----+-------------------+
# |label| features|
# +-----+-------------------+
# | 0|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# +-----+-------------------+
MLUtils.saveAsLibSVMFile(df.rdd.map(lambda x: LabeledPoint(x.label, MLLibVectors.fromML(x.features))), './libsvm')

Combining multiple groupBy functions into 1

Using this code to find modal :
import numpy as np
np.random.seed(1)
df2 = sc.parallelize([
(int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])
cnts = df2.groupBy("x").count()
mode = cnts.join(
cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
from Calculate the mode of a PySpark DataFrame column?
returns error :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-53-2a9274e248ac> in <module>()
8 cnts = df.groupBy("x").count()
9 mode = cnts.join(
---> 10 cnts.agg(max("count").alias("max_")), col("count") == col("max_")
11 ).limit(1).select("x")
12 mode.first()[0]
AttributeError: 'str' object has no attribute 'alias'
Instead of this solution I'm attempting this custom one:
df.show()
cnts = df.groupBy("c1").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1], ascending=False).first()
cnts = df.groupBy("c2").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1] , ascending=False).first()
which returns :
So modal of c1 & c2 are 2.0 and 3.0 respectively
Can this be applied to all columns c1,c2,c3,c4,c5 in dataframe instead of explicitly selecting each column as I have done ?

It looks like you're using built-in max, not a SQL function.
import pyspark.sql.functions as F
cnts.agg(F.max("count").alias("max_"))
To find mode over multiple columns of the same type you can reshape to long (melt as defined in Pandas Melt function in Apache Spark):
(melt(df, [], df.columns)
# Count by column and value
.groupBy("variable", "value")
.count()
# Find mode per column
.groupBy("variable")
.agg(F.max(F.struct("count", "value")).alias("mode"))
.select("variable", "mode.value"))
+--------+-----+
|variable|value|
+--------+-----+
| c5| 6.0|
| c1| 2.0|
| c4| 5.0|
| c3| 4.0|
| c2| 3.0|
+--------+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Pivot String in PySpark [duplicate] - apache-spark

Related

Validate date column for PySpark dataframe

Unable to use pyspark udf

map in dataframe - pyspark [duplicate]

convert dataframe to libsvm format

Combining multiple groupBy functions into 1

Categories

Resources