How can I use reduceByKey for RDD? - apache-spark

I have a RDD:
[{'date': '27/07/2022', 'user': 'User_83031', 'number_of_emails': 96},
{'date': '27/07/2022', 'user': 'User_45839', 'number_of_emails': 110},
{'date': '14/12/2022', 'user': 'User_15817', 'number_of_emails': 49}]
The code is:
from pyspark import SparkContext
sc = SparkContext(appName = "app-name")
raw_data=sc.textFile("emails.txt")
def formatEmail (row):
return {
"date": row.split(',')[0],
"user": row.split(',')[1],
"number_of_emails": int(row.split(',')[2])
}
emailsRDD=raw_data.map(lambda r: formatEmail(r))
emailsRDD.take(3)`
I run into problem when I try to use reduceByKey.
test=emailsRDD.map(lambda x: (x.get("date"),1)) \
.reduceByKey(lambda x,y: x+y)
test.first()`
The output is give me an error:
ValueError: RDD is empty
Does anybody knows why the error is occurred?
I am expecting to get paired RDD with date as a key and the value is the number of key occurences, like below:
('27/07/2022', 2)

It's very inefficient to use RDD with Python. Really, in 2023rd you should use DataFrame API that is more efficient. Plus you get things like, loading data as CSV file instead of manually parsing your lines.
With DataFrame API code will look as following:
import pyspark.sql.functions as F
df = spark.read.csv("emails.txt", schema="date string, user string, num int")
df2 = df.groupBy("date").agg(F.sum("num"))
df2.show()
will give you as expected:
+----------+--------+
| date|sum(num)|
+----------+--------+
|27/04/2021| 106|
|17/08/2022| 54|
|14/12/2022| 49|
|27/07/2022| 206|
+----------+--------+
In this case you work with high-level constructs, like:
loading data as CSV using the spark.read.csv
Summarizing your data for each date occurrence
Such code is much easier to read, plus it's more efficient as Spark won't need to serialize/deserialize data between JVM and Python.

Related

How do I explode this column of type array json in a pyspark dataframe?

I am trying to explode this column into multiple columns, but it seems there is an issue with the datatype even though I have specified it to be an array datatype.
This is what the column looks like:
Column_x
[[{"Key":"a","Value":"40000.0"},{"Key":"b","Value":"0.0"},{"Key":"c","Value":"0.0"},{"Key":"f","Value":"false"},{"Key":"e","Value":"ADB"},{"Key":"d","Value":"true"}]]
[[{"Key":"a","Value":"100000.0"},{"Key":"b","Value":"1.5"},{"Key":"c","Value":"1.5"},{"Key":"d","Value":"false"},{"Key":"e","Value":"Rev30"},{"Key":"f","Value":"true"},{"Key":"g","Value":"48600.0"},{"Key":"g","Value":"0.0"},{"Key":"h","Value":"0.0"}],[{"Key":"i","Value":"100000.0"},{"Key":"j","Value":"1.5"},{"Key":"k","Value":"1.5"},{"Key":"l","Value":"false"},{"Key":"m","Value":"Rev30"},{"Key":"n","Value":"true"},{"Key":"o","Value":"48600.0"},{"Key":"p","Value":"0.0"},{"Key":"q","Value":"0.0"}]]
To something like this:
Key Value
a 10000
b 200000
.
.
.
.
a 100000.0
b 1.5
This is my work so far:
from pyspark.sql.types import *
schema = ArrayType(ArrayType(StructType([StructField("Key", StringType()),
StructField("Value", StringType())])))
kn_sx = kn_s\
.withColumn("Keys", F.explode((F.from_json("Column_x", schema))))\
.withColumn("Key", col("Keys.Key"))\
.withColumn("Values", F.explode((F.from_json("Column_x", schema))))\
.withColumn("Value", col("Values.Value"))\
.drop("Values")
Here is the error:
AnalysisException: u"cannot resolve 'jsontostructs(`Column_x`)' due to data type mismatch: argument 1 requires string type, however, '`Column_x`' is of array<array<struct<Key:string,Value:string>>> type
Really appreciate the help.
refer this for documentation of get_json_object
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
>>> df = spark.createDataFrame(data, ("key", "jstring"))
>>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), \
... get_json_object(df.jstring, '$.f2').alias("c1") ).collect()
[Row(key=u'1', c0=u'value1', c1=u'value2'), Row(key=u'2', c0=u'value12', c1=None)]
This is what I did to make it work:
# Took out a single array element
df = df.withColumn('Column_x', F.col('Column_x.MetaData.Parameters').getItem(0))
# can be modified for additional array elements
# Used Explode on the dataframe to make it work
df = df\
.withColumn("Keys", F.explode(F.col("Column_x")))\
.withColumn("Key", col("Keys.Key"))\
.withColumn("Value", col("Keys.Value"))\
.drop("Keys")\
.dropDuplicates()
I hope it finds anyone looking for help with this problem.

How to reverse and combine string columns in a spark dataframe?

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.
For example, my data looks like:
+---+---+
|id1|id2|
+---+---+
| a|one|
| b|two|
+---+---+
the corresponding code is:
df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])
The returned value should be
+---+---+----+
|id1|id2| val|
+---+---+----+
| a|one|enoa|
| b|two|owtb|
+---+---+----+
My code is:
#udf(string)
def reverse_value(value):
return value[::-1]
df.withColumn('val', reverse_value(lit('id1' + 'id2')))
My errors are:
TypeError: Invalid argument, not a string or column: <function
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
Should be:
from pyspark.sql.functions import col, concat
df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))
Explanation:
lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)
Additionally it is not clear if argument of udf is correct. It should be either:
from pyspark.sql.functions import udf
#udf
def reverse_value(value):
...
or
#udf("string")
def reverse_value(value):
...
or
from pyspark.sql.types import StringType
#udf(StringType())
def reverse_value(value):
...
Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.
The answer by #user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.
You will achieve much better performance by using pyspark.sql.functions.reverse:
from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#| a|one|enoa|
#| b|two|owtb|
#+---+---+----+

More convenient way to reproduce pyspark sample

Most of the questions about spark are used show as code example without the code that generates the dataframe, like this:
df.show()
+-------+--------+----------+
|USER_ID|location| timestamp|
+-------+--------+----------+
| 1| 1001|1265397099|
| 1| 6022|1275846679|
| 1| 1041|1265368299|
+-------+--------+----------+
How can I reproduce this code in my programming environment without rewriting it manually? pyspark have some equivalent of read_clipboard in pandas?
Edit
The lack of a function to import data into my environment is a big obstacle for me to help others with pyspark in Stackoverflow.
So my question is:
What is the most convenient way to reproduce data pasted in stackoverflow from show command into my environment?
You can always use the following function :
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
It's one of the suggestions given in the following question : How to make good reproducible Apache Spark examples.
Note 1: Sometimes, there might be special cases where this might not apply for some reason or the other and which can generate in errors/issues i.e Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord").
So please use it with caution !
Note 2: (Disclaimer) I'm not the original author of the code. Thanks to #MaxU for the code. I just made some modifications on it.
Late answer, but I often face the same issue so wrote a small utility for this https://github.com/ollik1/spark-clipboard
It basically allows copy-pasting data frame show strings to spark. To install it, add jcenter dependency com.github.ollik1:spark-clipboard_2.12:0.1 and spark config .config("fs.clipboard.impl", "com.github.ollik1.clipboard.ClipboardFileSystem") After this, data frames can be read directly from the system clipboard
val df = spark.read
.format("com.github.ollik1.clipboard")
.load("clipboard:///*")
or alternatively files if you prefer. Installation details and usage are described in the read me file.
You can always read the data in pandas as a pandas dataframe and then convert it back to a spark dataframe. No, there is not a direct equivalent of read_clipboard in pyspark unlike pandas.
The reason is that Pandas dataframes are mostly flat structures where as spark dataframes can have complex structures like struct, arrays etc, since it has a wide variety of data types and those doesn't appear on console output, it is not possible to recreate the dataframe from the output.
You can combine panda read_clipboard, and convert to pyspark dataframe
from pyspark.sql.types import *
pdDF = pd.read_clipboard(sep=',',
index_col=0,
names=['USER_ID',
'location',
'timestamp',
])
mySchema = StructType([ StructField("USER_ID", StringType(), True)\
,StructField("location", LongType(), True)\
,StructField("timestamp", LongType(), True)])
#note: True (implies nullable allowed)
df = spark.createDataFrame(pdDF,schema=mySchema)
Update:
What #terry really want is copy ASCII code table to python , and following is
example. When you parse data into python , then you can convert to anything.
def parse(ascii_table):
header = []
data = []
for line in filter(None, ascii_table.split('\n')):
if '-+-' in line:
continue
if not header:
header = filter(lambda x: x!='|', line.split())
continue
data.append(['']*len(header))
splitted_line = filter(lambda x: x!='|', line.split())
for i in range(len(splitted_line)):
data[-1][i]=splitted_line[i]
return header, data

Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Creating a Spark DataFrame from an RDD of lists

I have an rdd (we can call it myrdd) where each record in the rdd is of the form:
[('column 1',value), ('column 2',value), ('column 3',value), ... , ('column 100',value)]
I would like to convert this into a DataFrame in pyspark - what is the easiest way to do this?
How about use the toDF method? You only need add the field names.
df = rdd.toDF(['column', 'value'])
The answer by #dapangmao got me to this solution:
my_df = my_rdd.map(lambda l: Row(**dict(l))).toDF()
Take a look at the DataFrame documentation to make this example work for you, but this should work. I'm assuming your RDD is called my_rdd
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# You have a ton of columns and each one should be an argument to Row
# Use a dictionary comprehension to make this easier
def record_to_row(record):
schema = {'column{i:d}'.format(i = col_idx):record[col_idx] for col_idx in range(1,100+1)}
return Row(**schema)
row_rdd = my_rdd.map(lambda x: record_to_row(x))
# Now infer the schema and you have a DataFrame
schema_my_rdd = sqlContext.inferSchema(row_rdd)
# Now you have a DataFrame you can register as a table
schema_my_rdd.registerTempTable("my_table")
I haven't worked much with DataFrames in Spark but this should do the trick
In pyspark, let's say you have a dataframe named as userDF.
>>> type(userDF)
<class 'pyspark.sql.dataframe.DataFrame'>
Lets just convert it to RDD (
userRDD = userDF.rdd
>>> type(userRDD)
<class 'pyspark.rdd.RDD'>
and now you can do some manipulations and call for example map function :
newRDD = userRDD.map(lambda x:{"food":x['favorite_food'], "name":x['name']})
Finally, lets create a DataFrame from resilient distributed dataset (RDD).
newDF = sqlContext.createDataFrame(newRDD, ["food", "name"])
>>> type(ffDF)
<class 'pyspark.sql.dataframe.DataFrame'>
That's all.
I was hitting this warning message before when I tried to call :
newDF = sc.parallelize(newRDD, ["food","name"] :
.../spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py:336: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row inst warnings.warn("Using RDD of dict to inferSchema is deprecated. "
So no need to do this anymore...

Resources