Pyspark 'PipelinedRDD' object has no attribute 'show' - attributes

I
I want to find out what all the items in df which are not in df1 , also items in df1 but not in df
df =sc.parallelize([1,2,3,4 ,5 ,6,7,8,9])
df1=sc.parallelize([4 ,5 ,6,7,8,9,10])
df2 = df.subtract(df1)
df2.show()
df3 = df1.subtract(df)
df3.show()
Just want to check the result to see if I understand the function well.
But got this error
'PipelinedRDD' object has no attribute 'show'
any suggestion?

print(df2.take(10))
df.show() is only for spark DataFrame

Convert an rdd to a spark dataframe with createDataFrame

Related

Create column of decimal type

I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.
This way the number gets truncated:
df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+---------------------+----------------------+
#|numb |numb_dec |
#+---------------------+----------------------+
#|1.0234567891023456E19|10234567891023456000.0|
#+---------------------+----------------------+
This fails:
df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)
TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e+19 in type <class 'float'>
How to correctly provide big decimal numbers so that they wouldn't get truncated?
Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:
df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+----------------------+----------------------+
#|numb |numb_dec |
#+----------------------+----------------------+
#|10234567891023456789.5|10234567891023456789.5|
#+----------------------+----------------------+
Try something as below -
from pyspark.sql.types import *
from decimal import *
schema = StructType([StructField('numb', DecimalType(30,1))])
data = [( Context(prec=30, Emax=999, clamp=1).create_decimal('10234567891023456789.5'), )]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
+----------------------+
|numb |
+----------------------+
|10234567891023456789.5|
+----------------------+

How to join a hive table with a pandas dataframe in pyspark?

I have a hive table db.hive_table and a pandas dataframe df. I wish to use pyspark.SparkSession.builder.enableHiveSupport().getOrCreate().sql to join them. How can I do this?
Convert both to pyspark dataframe and then join dfs.
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
# Convert hive table to df - sqlContext is of type HiveContext
df_hive = sqlContext.table(tablename)
Join the two dfs.
joined_df = df_sp.join(df_hive, df_sp.id == df_hive.id).select('df_sp.*', 'df_hive.*')

unbound method createDataFrame()

I am facing error when trying to create a DataFrame from an RDD.
My code:
from pyspark import SparkConf, SparkContext
from pyspark import sql
conf = SparkConf()
conf.setMaster('local')
conf.setAppName('Test')
sc = SparkContext(conf = conf)
print sc.version
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sql.SQLContext.createDataFrame(rdd, ["id", "score"]).collect()
print df
Error:
df = sql.SQLContext.createDataFrame(rdd, ["id", "score"]).collect()
TypeError: unbound method createDataFrame() must be called with SQLContext
instance as first argument (got RDD instance instead)
I accomplished the same task in spark shell where a straight forward last three lines of code will print the values. I mainly suspect the import statements because that is where the difference comes between IDE and Shell.
You need to use an instance of SQLContext. So you could try something like the following:
sqlContext = sql.SQLContext(sc)
df = sqlContext.createDataFrame(rdd, ["id", "score"]).collect()
More details in pyspark documentation.

How to insert the rdd data into a dataframe in pyspark?

Please find below the psuedocode:
source dataframe with 5 columns
creating a target dataframe with schema(6 columns)
For item in source_dataframe:
#adding a column to the list buy checking item.coulmn2
list = [item.column1,item.column2,newcolumn]
#creating an rdd out of this list
#now i need to add this rdd to a target dataframe?????
You could definately explain your question a bit more in detail or give some sample code. I'm interested how others will solve that. My proposed solution is this one:
df = (
sc.parallelize([
(134, "2016-07-02 12:01:40"),
(134, "2016-07-02 12:21:23"),
(125, "2016-07-02 13:22:56"),
(125, "2016-07-02 13:27:07")
]).toDF(["itemid", "timestamp"])
)
rdd = df.map(lambda x: (x[0], x[1], 10))
df2 = rdd.toDF(["itemid", "timestamp", "newCol"])
df3 = df.join(df2, df.itemid == df2.itemid and df.timestamp == df2.timestamp, "inner").drop(df2.itemid).drop(df2.timestamp)
I'm converting the RDD to a Dataframe. Afterwards I join both Dataframes, which duplicates some columns. So finally I drop those duplicated columns.

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
I tried:
(df
.groupby("id")
.agg(F.first(F.coalesce("code")),
F.first(F.coalesce("name")))
.collect())
DESIRED OUTPUT
[Row(id='a', code='code1', name='name2')]
For Spark 1.3 - 1.5, this could do the trick:
from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()
+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
| a| code1| name2|
+---+-----------+-----------+
Edit
Apparently, in version 1.6 they have changed the way the first aggregate function is processed. Now, the underlying class First should be constructed with a second argument ignoreNullsExpr parameter, which is not yet used by the first aggregate function (as can bee seen here). However, in Spark 2.0 it will be able to call agg(F.first(col, True)) to ignore nulls (as can be checked here).
Therefore, for Spark 1.6 the approach must be different and a little more inefficient, unfornately. One idea is the following:
from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()
+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
| a| code1| name2|
+---+-------------+-------------+
Maybe there is a better option. I'll edit the answer if I find one.
Because I only had one non-null value for every grouping, using min / max in 1.6 worked for my purposes:
(df
.groupby("id")
.agg(F.min("code"),
F.min("name"))
.show())
+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
| a| code1| name2|
+---+---------+---------+
The first method accept an argument ignorenulls, that can be set to true,
Python:
df.groupby("id").agg(first(col("code"), ignorenulls=True).alias("code"))
Scala:
df.groupBy("id").agg(first(col("code"), ignoreNulls = true).alias("code"))

Resources