I created rdd from CSV
lines = sc.textFile(data)
now I need to convert lines to key value rdd
where value where value will be string (after splitting) and key will be number of column of csv
for example CSV
Col 1
I want to get rdd.take(1):
[(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
But I can’t get the correct result when I try to map this function on RDD

You can use the PySpark's create_map function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df =, col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
|mappedCol1| mappedCol2|
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]

I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
pairs_rdd = line: list_of_tuple(line))


How to iterate a RDD and remove the field if it exist in a list using PySpark

I have a list which contains a couple of string values/field names. I also have a Spark RDD, I'd like to iterate the rdd and remove any field name that exists in the list. For example:
field_list = ["name_1", "name_2"]
RDD looks like this:
[Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))]
I'm not very familiar with RDD, I understand that I can use map() to perform iteration, but how can I add the conditions, if it finds "name_1" or "name_2" which exists in the field_list, then remove the value and the field, so the expected result is a new RDD looks like:
[Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_3='F'), Row(name_3='F')])]))]
You could recreate the whole structure, but without fields which you don't need. I'm not sure, maybe there's a better method, but looking at the Row documentation we see that it's limited on methods.
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))
# [Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))]
field_list = ["name_1", "name_2"]
F4 = Row('field_4')
F3 = Row('field_3')
F2 = Row('field_1', 'field_2')
def transform(row):
f3 = []
for x in row['field_2']['field_3']:
f4 = []
for y in x['field_4']:
Names = Row(*(set(y.asDict()) - set(field_list)))
f4.append(Names(*[y[n] for n in Names]))
return F2(row['field_1'], F3(f3))
rdd =
# [Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_3='F'), Row(name_3='F')])]))]

Spark: extract value from map based on another column

I have the following data frame
|user_id |map_data |key_field. |
|VG1uTie2pzg5E89148k9|[2.0 -> [11.0, another_val_for_key_2], 1.0 -> [22.0, another_val_for_key_1]] |1 |
and the following case class
case class A(d:Double, str: String)
map_data is a column of type Map[Double, A]
I am trying to create a new column that is based on the map_data column and the key_field column.
Something in the form of
When I'm using hardcoded key it works, for example:
so I'm not sure what I am missing
Managed to solved it with a UDF function
val extract = udf( (key: Int, map: Map[Double, GenericRowWithSchema]) =>
.withColumn("value_from_map", extract(col("key_field"), col("map_data")))

How to sum values of Key value pairs inside a List in PySpark

I have an rdd with Key values inside a list
rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]),
('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]),
('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]
I have to add the Values for each Keys in the list of each record. I tried as bellow but it didn't went thru as mapValues wont allow that addition on lists.
newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum))))
My expected results is as bellow
[('12583', ('536370', 11.25)),
('17850', ('536365', 8.39)),
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]
You can define a list aggregation function using collections.defaultdict:
def agg_list(lst):
from collections import defaultdict
agg = defaultdict(lambda : 0)
for k, v in lst:
agg[k] += v
return list(agg.items())
And then map it over the rdd: x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)],
# ['17850', ('536365', 8.69)],
# ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]

How to Flatten spark dataframe Row to multiple Dataframe Rows

Hi I have a spark data frame which prints like this (single row)
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
my output should be
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] =
val udfZip = udf(zipThem)$"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => ( map {
case ((a,b),c) => (a,b,c)
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
Hope that is understandable :)

How to use a function over an RDD and get new column (Pyspark)?

I'm looking for a way to apply a function to an RDD using PySpark and put the result in a new column. With DataFrames, it looks easy :
Given :
rdd = sc.parallelize([(u'1751940903', u'2014-06-19', '2016-10-19'), (u'_guid_VubEgxvPPSIb7W5caP-lXg==', u'2014-09-10', '2016-10-19')])
My code can look like this :
df= rdd.toDF(['gigya', 'inscription','d_date'])
| gigya| inscription| d_date|
| 1751940903| 2014-06-19|2016-10-19|
|_guid_VubEgxvPPSI...| 2014-09-10|2016-10-19|
Then :
from pyspark.sql.functions import split, udf, col
get_period_day = udf(lambda item : datetime.strptime(item, "%Y-%m-%d").timetuple().tm_yday)'d_date', 'gigya', 'inscription', get_period_day(col('d_date')).alias('period_day')).show()
| d_date| gigya|inscription_service_6Play|period_day|
|2016-10-19| 1751940903| 2014-06-19| 293|
|2016-10-19|_guid_VubEgxvPPSI...| 2014-09-10| 293|
Is there a way to do the same thing without the need to convert my RDD to a DataFrame ? Something with map for exemple..
This code can just give me a part from the expected results : x: datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday).cache().collect()
Help ?
Try: x:
x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, ))
def g(x):
return x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, )
