Conver pyspark column to a list - apache-spark

I thought this would be easy but can't find the answer :-)
How do I convert the name column in to a list. I am hoping I can get isin to work rather than a join against another datframe column. But isin seems to require a list (if I understand correctly).
Create the datframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data2 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 3),
('Sally', datetime(2009, 12, 12, 17, 21, 30), 5),
('Frank', datetime(2010, 11, 22, 13, 29, 40), 2),
('Paul', datetime(2010, 2, 8, 3, 31, 23), 8),
('Jesus', datetime(2009, 1, 1, 4, 19, 47), 2),
('Lou', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df2 = sqlContext.createDataFrame(data2, ['name', 'trial_start_time', 'purchase_time'])
df2.show(truncate=False)
Should look like:
+------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+------+-------------------+-------------+
|George|2010-03-24 07:19:58|3 |
|Sally |2009-12-12 22:21:30|5 |
|Frank |2010-11-22 18:29:40|2 |
|Paul |2010-02-08 08:31:23|8 |
|Jesus |2009-01-01 09:19:47|2 |
|Lou |2010-03-02 09:33:51|3 |
+------+-------------------+-------------+
I am not sure if collect is the closest I can come to this.
df2.select("name").collect()
[Row(name='George'),
Row(name='Sally'),
Row(name='Frank'),
Row(name='Paul'),
Row(name='Jesus'),
Row(name='Lou')]
Any suggestions on how to output the name column to a list?
It may need to look something like this:
[George, Sally, Frank, Paul, Jesus, Lou]

Use collect_list function and then collect to get list variable.
Example:
from pyspark.sql.functions import *
df2.agg(collect_list(col("name")).alias("name")).show(10,False)
#+----------------------------------------+
#|name |
#+----------------------------------------+
#|[George, Sally, Frank, Paul, Jesus, Lou]|
#+----------------------------------------+
lst=df2.agg(collect_list(col("name"))).collect()[0][0]
lst
#['George', 'Sally', 'Frank', 'Paul', 'Jesus', 'Lou']

Related

PySpark RDD: Manipulating Inner Array

I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?
I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 = df.select("id", F.transform("arr", lambda x: x*2).alias("arr"))
df2.show()
will give you desired:
+---+---------------+
| id| arr|
+---+---------------+
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|
+---+---------------+

new value is the sum of old values

I have 2 list A and B, list A contain values I want list B to be the sum of value from list a
A = [3,5,7,8,9,12,13,20]
#Wanted result
#B = [3, 8, 15, 23,...77]
#so the new value will be the sum of the old value
# [x1, x2+x1, x3+x2+x1,... xn+xn+xn]
what methods I could use to get the answer, thank you.
The easiest way IMO would be to use numpy.cumsum, to get the cumulative sum of your list:
>>> import numpy as np
>>> np.cumsum(A)
array([ 3, 8, 15, 23, 32, 44, 57, 77])
But you also could do it in a list comprehension like this:
>>> [sum(A[0:x]) for x in range(1, len(A)+1)]
[3, 8, 15, 23, 32, 44, 57, 77]
Another fun way is to use itertools.accumulate, which gives accumulated sums by default:
>>> from itertools import accumulate
>>> list(accumulate(A))
[3, 8, 15, 23, 32, 44, 57, 77]

Retrieving original data from PyTorch nn.Embedding

I'm passing a dataframe with 5 categories (ex. car, bus, ...) into nn.Embedding.
When I do embedding.parameters(), I can see that there are 5tensors but how do I know which index corresponds to the original input (ex. car, bus, ...)?
You can't as tensors are unnamed (only dimensions can be named, see PyTorch's Named Tensors).
You have to keep the names in separate data container, for example (4 categories here):
import pandas as pd
import torch
df = pd.DataFrame(
{
"bus": [1.0, 2, 3, 4, 5],
"car": [6.0, 7, 8, 9, 10],
"bike": [11.0, 12, 13, 14, 15],
"train": [16.0, 17, 18, 19, 20],
}
)
df_data = df.to_numpy().T
df_names = list(df)
embedding = torch.nn.Embedding(df_data.shape[0], df_data.shape[1])
embedding.weight.data = torch.from_numpy(df_data)
Now you can simply use it with any index you want:
index = 1
embedding(torch.tensor(index)), df_names[index]
This would give you (tensor[6, 7, 8, 9, 10], "car") so the data and respective column name.

Pyspark: create spark data frame from nested dictionary

How to create a spark data frame from a nested dictionary? I'm new to spark. I do not want to use the pandas data frame.
My dictionary look like:-
{'prathameshsalap#gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 50)},
'vaishusawant143#gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 35)},
'you#example.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 55)}
}
I want to convert this dict to spark data frame using pyspark data frame.
My expected output:-
Date idle_time
user_name
prathameshsalap#gmail.com 2019-10-21 2019-10-21 01:50:00
vaishusawant143#gmail.com 2019-10-21 2019-10-21 01:35:00
you#example.com 2019-10-21 2019-10-21 01:55:00
You need to redo your dictionary and build rows to properly infer the schema.
import datetime
from pyspark.sql import Row
data_dict = {
'prathameshsalap#gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 50)
},
'vaishusawant143#gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 35)
},
'you#example.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 55)
}
}
data_as_rows = [Row(**{'user_name': k, **v}) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_rows).select('user_name', 'Date', 'idle_time')
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|prathameshsalap#gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#gmail.com|2019-10-21|2019-10-21 01:35:00|
|you#example.com |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Note: if you already have the schema prepared and don't need to infer, you can just supply the schema to the createDataFrame function:
import pyspark.sql.types as T
schema = T.StructType([
T.StructField('user_name', T.StringType(), False),
T.StructField('Date', T.DateType(), False),
T.StructField('idle_time', T.TimestampType(), False)
])
data_as_tuples = [(k, v['Date'], v['idle_time']) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_tuples, schema=schema)
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|prathameshsalap#gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#gmail.com|2019-10-21|2019-10-21 01:35:00|
|you#example.com |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Convert the dictionary to a list of tuples, each tuple will then become a row in Spark DataFrame:
rows = []
for key, value in data.items():
row = (key,value['Date'], value['idle_time'])
rows.append(row)
Define schema for your data:
from pyspark.sql.types import *
sch = StructType([
StructField('user_name', StringType()),
StructField('date', DateType()),
StructField('idle_time', TimestampType())
])
Create the Spark DataFrame:
df = spark.createDataFrame(rows, sch)
df.show()
+--------------------+----------+-------------------+
| user_name| date| idle_time|
+--------------------+----------+-------------------+
|prathameshsalap#g...|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#g...|2019-10-21|2019-10-21 01:35:00|
| you#example.com|2019-10-21|2019-10-21 01:55:00|
+--------------------+----------+-------------------+

Combine multiple lists (of equal length) stored in dict into a single list of list

I have this following dictionary (which essentially resembles a table):
tbl = {'col0':[20, 30, 22, 15, 24],
'col1':[13, 15, 10, 14, 15],
'col2':[52, 12, 14, 36, 23] }
I want to convert this to a list of list that combines all the list across the columns (i.e. same index elements become one list-element in list of list)
It should look somewhat like this:
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]
it should also work for situations where my dict would be something like this:
tbl = {'col0':1.0,
'col1':7.0,
'col2':1.3 }
# converted into
[[1.0, 7.0, 1.3]]
is there a pythonic way of doing this ? I basically need it to print a Table structure row-wise by over-riding a __str__ method for a structure which currently stores table values in dict format
You can always use an unreadable double list comprehension!
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))]
If you might have data without a length, you can use this instead (as long as all columns are the same length):
def len_checker(item):
try:
return len(item)
except:
return 0
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))] if len_checker(tbl[list(tbl.keys())[0]]) else [[tbl[key] for key in tbl]]
Aren't these fun?
Things are a little cleaner if you can guarantee that the key 'col0' is in your table.
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl['col0']))] if len_checker(tbl['col0']) else [[tbl[key] for key in tbl]]
In all seriousness, though, if you want clean code you should be using something like a Pandas DataFrame.
from pandas import DataFrame
try:
df = DataFrame(tbl)
except:
df = DataFrame(tbl,index=[0])
my_list_of_lists = [list(df.iloc[row]) for row in range(df.shape[0])]
You can use numpy too.
import numpy as np
arr = np.vstack([np.array(tbl[key]) for key in tbl])
my_list_of_lists = [list(arr[...,col]) for col in range(arr.shape[1])]
zip is handy for this:
>>> list(zip(*tbl.values()))
[(20, 13, 52), (30, 15, 12), (22, 10, 14), (15, 14, 36), (24, 15, 23)]
For a list of lists instead of tuples, you can use a generator expression:
>>> list(list(x) for x in zip(*tbl.values()))
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]

Resources