Pyspark Dataframe Lambda Map Function of SQL Query - apache-spark

Suppose we have a pyspark.sql.dataframe.DataFrame object:
df = sc.parallelize([['John', 'male', 26],
['Teresa', 'female', 25],
['Jacob', 'male', 6]]).toDF(['name', 'gender', 'age'])
I have a function that runs sql query for each rows of the DataFrame:
def getInfo(data):
param_name = data['name']
param_gender = data['gender']
param_age = data['age']
sql_query = "SELECT * FROM people_info WHERE name = '{0}' AND gender = '{1}' AND age = {2}".format(param_name, param_gender, param_age)
info = info.append(spark.sql(sql_query))
return info
I am trying to run function each rows by map:
df_info = df.rdd.map(lambda x: getInfo(x))
I got errors
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

The error message is actually telling you what exactly what is wrong. Your function is trying to access SparkContext(sparck.sql(sql_query)) from inside a transformation( df.rdd.map(lambda x: getInfo(x))).
Here's what I think you are trying to do:
df = sc.parallelize([['John', 'male', 26],
['Teresa', 'female', 25],
['Jacob', 'male', 6]]).toDF(['name', 'gender', 'age'])
people = spark.table("people_info")
people.join(df, on=[people.name == df.name, people.gender == df.gender, people.age == df.age], how="inner")
Here's a couple other ways to do a join.

Related

How to create a list of non-empty dataframes names?

I have created a list of dataframe and want to run a loop through the list to run some manipulation on some of those dataframes. Note that although this is a list I have manually created, but this dataframes exist in my code.
df_list = [df_1, df_2, df_3, df_4, df_5, ...]
list_df_matching = []
list_non_matching = []
Most of these dataframes are blank. But 2 of them will have some records in them. I want to find the name of those dataframes and create a new list - list_non_matching
for df_name in df_list:
q = df_name.count()
if q > 0:
list_non_matching.append(df_name)
else:
list_df_matching.append(df_name)
My goal is to get a list of dataframe names like [df_4, df_10], but I am getting the following:
[DataFrame[id: string, nbr: string, name: string, code1: string, code2: string],
DataFrame[id: string, nbr: string, name: string, code3: string, code4: string]]
Is the list approach incorrect? Is there a better way of doing it?
Here is an example to illustrate one way to do it with the help of empty property and Python built-in function globals:
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame({"col1": [2, 4], "col2": [5, 9]})
df3 = pd.DataFrame(columns = ["col1", "col2"])
df4 = pd.DataFrame({"col1": [3, 8], "col2": [2, 0]})
df5 = pd.DataFrame({"col1": [], "col2": []})
df_list = [df1, df2, df3, df4, df5]
list_non_matching = [
name
for df in df_list
for name in globals()
if not df.empty and globals()[name] is df
]
print(list_non_matching)
# Output
['df2', 'df4']

pyspark rdd taking the max frequency with the least age

I have an rdd like the following:
[{'age': 2.18430371791803,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{'age': 2.80033330216659,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{'age': 2.8222365762732,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{...}]
I am trying to reduce each id to just 1 record by taking the highest frequency code using code like:
rdd.map(lambda x: (x["id"], [(x["age"], x["code"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
There is one problem with this implementation, it doesn't consider age, so if for example one id had multiple codes with a frequency of 2, it would take the last code.
To illustrate this issue, please consider this reduced id:
(u'"000PZ7S2G"',
[(4.3218651186303, u'"388.400000"'),
(4.34924421126357, u'"388.400000"'),
(4.3218651186303, u'"389.900000"'),
(4.34924421126357, u'"389.900000"'),
(13.3667102491139, u'"794.310000"'),
(5.99897016368982, u'"995.300000"'),
(6.02634923989903, u'"995.300000"'),
(4.3218651186303, u'"V72.19"'),
(4.34924421126357, u'"V72.19"'),
(13.3639723398581, u'"V81.2"'),
(13.3667102491139, u'"V81.2"')])
my code would output:
[(2, u'"V81.2"')]
when I would like for it to output:
[(2, u'"388.400000"')]
because although the frequency is the same for both of these codes, code 388.400000 has a lesser age and appears first.
by adding this line after the .reduceByKey():
.map(lambda x: (x[0], [i for i in x[1] if i[0] == min(x[1])[0]]))
I'm able to filter out those with greater than min age, but then I'm only considering those with min age and not all codes to calculate their frequency. I can't apply the same/ similar logic after [max(zip((x.count(i) for i in set(x)), set(x)))] as the set(x) is the set of x[1], which doesn't consider the age.
I should add, I don't want to just take the first code with the highest frequency, I'd like to take the highest frequency code with the least age, or the code that appears first, if this is possible, using only rdd actions.
equivalent code in SQL of what I'm trying to get would be something like:
SELECT code, count(*) as code_frequency
FROM (SELECT id, code, age
FROM (SELECT id, code, MIN(age) AS age, COUNT(*) as cnt,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY COUNT(*) DESC, MIN(age)) as seqnum
FROM tbl
GROUP BY id, code
) t
WHERE seqnum = 1) a
GROUP BY code
ORDER by code_frequency DESC
LIMIT 5;
and as a DF (though trying to avoid this):
wc = Window().partitionBy("id", "code").orderBy("age")
wc2 = Window().partitionBy("id")
df = rdd.toDF()
df = df.withColumn("count", F.count("code").over(wc))\
.withColumn("max", F.max("count").over(wc2))\
.filter("count = max")\
.groupBy("id").agg(F.first("age").alias("age"),
F.first("code").alias("code"))\
.orderBy("id")\
.groupBy("code")\
.count()\
.orderBy("count", ascending = False)
I'd really appreciate any help with this.
Based on the SQL equivalent of your code, I converted the logic into the following rdd1 plus some post-processing (starting from the original RDD):
rdd = sc.parallelize([{'age': 4.3218651186303, 'code': '"388.400000"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"388.400000"', 'id': '"000PZ7S2G"'},
{'age': 4.3218651186303, 'code': '"389.900000"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"389.900000"', 'id': '"000PZ7S2G"'},
{'age': 13.3667102491139, 'code': '"794.310000"', 'id': '"000PZ7S2G"'},
{'age': 5.99897016368982, 'code': '"995.300000"', 'id': '"000PZ7S2G"'},
{'age': 6.02634923989903, 'code': '"995.300000"', 'id': '"000PZ7S2G"'},
{'age': 4.3218651186303, 'code': '"V72.19"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"V72.19"', 'id': '"000PZ7S2G"'},
{'age': 13.3639723398581, 'code': '"V81.2"', 'id': '"000PZ7S2G"'},
{'age': 13.3667102491139, 'code': '"V81.2"', 'id': '"000PZ7S2G"'}])
rdd1 = rdd.map(lambda x: ((x['id'], x['code']),(x['age'], 1))) \
.reduceByKey(lambda x,y: (min(x[0],y[0]), x[1]+y[1])) \
.map(lambda x: (x[0][0], (-x[1][1] ,x[1][0], x[0][1]))) \
.reduceByKey(lambda x,y: x if x < y else y)
# [('"000PZ7S2G"', (-2, 4.3218651186303, '"388.400000"'))]
Where:
use map to initialize the pair-RDD with key=(x['id'], x['code']), value=(x['age'], 1)
use reduceByKey to calculate min_age and count
use map to reset the pair-RDD with key=id and value=(-count, min_age, code)
use reduceByKey to find the min value of tuples (-count, min_age, code) for the same id
The above steps are similar to:
Step (1) + (2): groupby('id', 'code').agg(min('age'), count())
Step (3) + (4): groupby('id').agg(min(struct(negative('count'),'min_age','code')))
You can then get the derived table a in your SQL by doing rdd1.map(lambda x: (x[0], x[1][2], x[1][1])), but this step is not necessary. the code can be counted directly from the above rdd1 by another map function + countByKey() method and then sort the result:
sorted(rdd1.map(lambda x: (x[1][2],1)).countByKey().items(), key=lambda y: -y[1])
# [('"388.400000"', 1)]
However, if what you are looking for is the sum(count) across all ids, then do the following:
rdd1.map(lambda x: (x[1][2],-x[1][0])).reduceByKey(lambda x,y: x+y).collect()
# [('"388.400000"', 2)]
If converting the rdd to a dataframe is an option, I think this approach may solve your problem:
from pyspark.sql.functions import row_number, col
from pyspark.sql import Window
df = rdd.toDF()
w = Window.partitionBy('id').orderBy('age')
df = df.withColumn('row_number', row_number.over(w)).where(col('row_number') == 1).drop('row_number')

How to appropriately test Pandas dtypes within dataframes?

Objective: to create a function that can match given dtypes to a predfined data type scenario.
Description: I want to be able to classify given datasets based on their attribution into predefined scenario types.
Below are two example datasets (df_a and df_b). df_a has only dtypes that are equal to 'object' while df_b has both 'object' and 'int64':
# scenario_a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data, columns = ['Name', 'Age'])
I want to be able to determine automatically which scenario it is based on a function:
import pandas as pd
import numpy as np
def scenario(data):
if data.dtypes.str.contains('object'):
return scenario_a
if data.dtypes.str.contatin('object', 'int64'):
return scenario_b
Above is what I have so far, but isn't getting the results I was hoping for.
When using the function scenario(df_a) I am looking for the result to be scenario_a and when I pass df_b I am looking for the function to be able to determine, correctly, what scenario it should be.
Any help would be appreciated.
Here is one approach. Create a dict scenarios, with the keys a sorted tuple of predefined dtypes, and the value being what you would want returned by the function.
Using your example, something like:
# scenario a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data_a, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data_b, columns = ['Name', 'Age'])
scenario_a = tuple(sorted(df_a.dtypes.unique()))
scenario_b = tuple(sorted(df_b.dtypes.unique()))
scenarios = {
scenario_a: 'scenario_a',
scenario_b: 'scenario_b'
}
print(scenarios)
# scenarios:
# {(dtype('O'),): 'scenario_a', (dtype('int64'), dtype('O')): 'scenario_b'}
def scenario(data):
dtypes = tuple(sorted(data.dtypes.unique()))
return scenarios.get(dtypes, None)
scenario(df_a)
# 'scenario_a'
scenario(df_b)
# scenario_b

Python dictionary to sqlite database

I'm trying to write a dictionary into an existing sql database, but without success giving me:
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
Based on my minimal example, has anzbody some useful hints? (python3)
Command to create the empty db3 anywhere on your machine:
CREATE TABLE "testTable" (
sID INTEGER NOT NULL UNIQUE PRIMARY KEY,
colA REAL,
colB TEXT,
colC INTEGER);
And the code for putting my dictionary into the database looks like:
import sqlite3
def main():
path = '***anywhere***/test.db3'
data = {'sID': [1, 2, 3],
'colA': [0.3, 0.4, 0.5],
'colB': ['A', 'B', 'C'],
'colC': [4, 5, 6]}
db = sqlite3.connect(path)
c = db.cursor()
writeDict2Table(c, 'testTable', data)
db.commit()
db.close()
return
def writeDict2Table(cursor, tablename, dictionary):
qmarks = ', '.join('?' * len(dictionary))
cols = ', '.join(dictionary.keys())
values = tuple(dictionary.values())
query = "INSERT INTO %s (%s) VALUES (%s)" % (tablename, cols, qmarks)
cursor.execute(query, values)
return
if __name__ == "__main__":
main()
I had already a look at
Python : How to insert a dictionary to a sqlite database?
but unfortunately I did not succeed.
You must not use a dictionary with question marks as parameter markers, because there is no guarantee about the order of the values.
To handle multiple rows, you must use executemany().
And executemany() expects each item to contain the values for one row, so you have to rearrange the data:
>>> print(*zip(data['sID'], data['colA'], data['colB'], data['colC']), sep='\n')
(1, 0.3, 'A', 4)
(2, 0.4, 'B', 5)
(3, 0.5, 'C', 6)
cursor.executemany(query, zip(data['sID'], data['colA'], data['colB'], data['colC']))

pyspark convert transactions into a list of list

I want to use PrefixSpan sequence mining in pyspark. The format of data that I need to have is the following:
[[['a', 'b'], ['c']], [['a'], ['c', 'b'], ['a', 'b']], [['a', 'b'], ['e']], [['f']]]
where the innermost elements are productIds, then there are orders (containing list of products) and then there are clients (containing lists of orders).
My data has transactional format:
clientId orderId product
where orderId has multiple rows for separate products and clientId has multiple rows for separate orders.
Sample data:
test = sc.parallelize([[u'1', u'100', u'a'],
[u'1', u'100', u'a'],
[u'1', u'101', u'b'],
[u'2', u'102', u'c'],
[u'3', u'103', u'b'],
[u'3', u'103', u'c'],
[u'4', u'104', u'a'],
[u'4', u'105', u'b']]
)
My solution so far:
1. Group products in orders:
order_prod = test.map(lambda x: [x[1],([x[2]])])
order_prod = order_prod.reduceByKey(lambda a,b: a + b)
order_prod.collect()
which results in:
[(u'102', [u'c']),
(u'103', [u'b', u'c']),
(u'100', [u'a', u'a']),
(u'104', [u'a']),
(u'101', [u'b']),
(u'105', [u'b'])]
2. Group orders in customers:
client_order = test.map(lambda x: [x[0],[(x[1])]])
df_co = sqlContext.createDataFrame(client_order)
df_co = df_co.distinct()
client_order = df_co.rdd.map(list)
client_order = client_order.reduceByKey(lambda a,b: a + b)
client_order.collect()
which results in:
[(u'4', [u'105', u'104']),
(u'3', [u'103']),
(u'2', [u'102']),
(u'1', [u'100', u'101'])]
Then I want to have a list like this:
[[[u'a', u'a'],[u'b']], [[u'c']], [[u'b', u'c']], [[u'a'],[u'b']]]
Here is the solution using PySpark dataframe (not that I use PySpark 2.1). First, you have to transform RDD to Dataframe.
df = test.toDF(['clientId', 'orderId', 'product'])
And this is the snippet to group the dataframe. Basic idea is to group by clientId and orderId first and aggregate product column together. Then group again by only clientId.
import pyspark.sql.functions as func
df_group = df.groupby(['clientId', 'orderId']).agg(func.collect_list('product').alias('product_list'))
df_group_2 = df_group[['clientId', 'product_list']].\
groupby('clientId').\
agg(func.collect_list('product_list').alias('product_list_group')).\
sort('clientId', ascending=True)
df_group_2.rdd.map(lambda x: x.product_list_group).collect() # collect output here
Result is the following:
[[['a', 'a'], ['b']], [['c']], [['b', 'c']], [['b'], ['a']]]

Resources