I created rdd from CSV
lines = sc.textFile(data)
now I need to convert lines to key value rdd
where value where value will be string (after splitting) and key will be number of column of csv
for example CSV
Col 1
Col2
73
230666
55
149610
I want to get rdd.take(1):
[(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = lines_data.map(lambda line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
list_tup.append((l[0][i],i))
return(list_tup)
But I can’t get the correct result when I try to map this function on RDD
You can use the PySpark's create_map function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()
+----------+-------------+
|mappedCol1| mappedCol2|
+----------+-------------+
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
+----------+-------------+
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
mapped_df.rdd.take(1)
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]
I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
list_tup.append((l[i],i))
return(list_tup)
pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))
Related
I have a list which contains a couple of string values/field names. I also have a Spark RDD, I'd like to iterate the rdd and remove any field name that exists in the list. For example:
field_list = ["name_1", "name_2"]
RDD looks like this:
[Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))]
I'm not very familiar with RDD, I understand that I can use map() to perform iteration, but how can I add the conditions, if it finds "name_1" or "name_2" which exists in the field_list, then remove the value and the field, so the expected result is a new RDD looks like:
[Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_3='F'), Row(name_3='F')])]))]
You could recreate the whole structure, but without fields which you don't need. I'm not sure, maybe there's a better method, but looking at the Row documentation we see that it's limited on methods.
Inputs:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))
])
print(rdd.collect())
# [Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_1='apple', name_2='banana', name_3='F'), Row(name_1='tomato', name_2='eggplant', name_3='F')])]))]
field_list = ["name_1", "name_2"]
Script:
F4 = Row('field_4')
F3 = Row('field_3')
F2 = Row('field_1', 'field_2')
def transform(row):
f3 = []
for x in row['field_2']['field_3']:
f4 = []
for y in x['field_4']:
Names = Row(*(set(y.asDict()) - set(field_list)))
f4.append(Names(*[y[n] for n in Names]))
f3.append(F4(f4))
return F2(row['field_1'], F3(f3))
rdd = rdd.map(transform)
print(rdd.collect())
# [Row(field_1=1, field_2=Row(field_3=[Row(field_4=[Row(name_3='F'), Row(name_3='F')])]))]
I have the following data frame
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
|user_id |map_data |key_field. |
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
|VG1uTie2pzg5E89148k9|[2.0 -> [11.0, another_val_for_key_2], 1.0 -> [22.0, another_val_for_key_1]] |1 |
+--------------------+-----------------------------------------------------------------------------------------------------+-----------------+
and the following case class
case class A(d:Double, str: String)
map_data is a column of type Map[Double, A]
I am trying to create a new column that is based on the map_data column and the key_field column.
Something in the form of
df
.withColumn("value_from_map",
col("map_data").getItem(col("key_field").cast(DoubleType)).getItem("str"))
When I'm using hardcoded key it works, for example:
df
.withColumn("value_from_map",
col("map_data").getItem(2).getItem("str"))
so I'm not sure what I am missing
Managed to solved it with a UDF function
val extract = udf( (key: Int, map: Map[Double, GenericRowWithSchema]) =>
map(key).getAs[String]("str")
)
...
.withColumn("value_from_map", extract(col("key_field"), col("map_data")))
I have an rdd with Key values inside a list
rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]),
('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]),
('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]
I have to add the Values for each Keys in the list of each record. I tried as bellow but it didn't went thru as mapValues wont allow that addition on lists.
newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum))))
My expected results is as bellow
[('12583', ('536370', 11.25)),
('17850', ('536365', 8.39)),
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]
You can define a list aggregation function using collections.defaultdict:
def agg_list(lst):
from collections import defaultdict
agg = defaultdict(lambda : 0)
for k, v in lst:
agg[k] += v
return list(agg.items())
And then map it over the rdd:
rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)],
# ['17850', ('536365', 8.69)],
# ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]
Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)
I'm looking for a way to apply a function to an RDD using PySpark and put the result in a new column. With DataFrames, it looks easy :
Given :
rdd = sc.parallelize([(u'1751940903', u'2014-06-19', '2016-10-19'), (u'_guid_VubEgxvPPSIb7W5caP-lXg==', u'2014-09-10', '2016-10-19')])
My code can look like this :
df= rdd.toDF(['gigya', 'inscription','d_date'])
df.show()
+--------------------+-------------------------+----------+
| gigya| inscription| d_date|
+--------------------+-------------------------+----------+
| 1751940903| 2014-06-19|2016-10-19|
|_guid_VubEgxvPPSI...| 2014-09-10|2016-10-19|
+--------------------+-------------------------+----------+
Then :
from pyspark.sql.functions import split, udf, col
get_period_day = udf(lambda item : datetime.strptime(item, "%Y-%m-%d").timetuple().tm_yday)
df.select('d_date', 'gigya', 'inscription', get_period_day(col('d_date')).alias('period_day')).show()
+----------+--------------------+-------------------------+----------+
| d_date| gigya|inscription_service_6Play|period_day|
+----------+--------------------+-------------------------+----------+
|2016-10-19| 1751940903| 2014-06-19| 293|
|2016-10-19|_guid_VubEgxvPPSI...| 2014-09-10| 293|
+----------+--------------------+-------------------------+----------+
Is there a way to do the same thing without the need to convert my RDD to a DataFrame ? Something with map for exemple..
This code can just give me a part from the expected results :
rdd.map(lambda x: datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday).cache().collect()
Help ?
Try:
rdd.map(lambda x:
x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, ))
or:
def g(x):
return x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, )
rdd.map(g)