pyspark convert transactions into a list of list - apache-spark

I want to use PrefixSpan sequence mining in pyspark. The format of data that I need to have is the following:
[[['a', 'b'], ['c']], [['a'], ['c', 'b'], ['a', 'b']], [['a', 'b'], ['e']], [['f']]]
where the innermost elements are productIds, then there are orders (containing list of products) and then there are clients (containing lists of orders).
My data has transactional format:
clientId orderId product
where orderId has multiple rows for separate products and clientId has multiple rows for separate orders.
Sample data:
test = sc.parallelize([[u'1', u'100', u'a'],
[u'1', u'100', u'a'],
[u'1', u'101', u'b'],
[u'2', u'102', u'c'],
[u'3', u'103', u'b'],
[u'3', u'103', u'c'],
[u'4', u'104', u'a'],
[u'4', u'105', u'b']]
)
My solution so far:
1. Group products in orders:
order_prod = test.map(lambda x: [x[1],([x[2]])])
order_prod = order_prod.reduceByKey(lambda a,b: a + b)
order_prod.collect()
which results in:
[(u'102', [u'c']),
(u'103', [u'b', u'c']),
(u'100', [u'a', u'a']),
(u'104', [u'a']),
(u'101', [u'b']),
(u'105', [u'b'])]
2. Group orders in customers:
client_order = test.map(lambda x: [x[0],[(x[1])]])
df_co = sqlContext.createDataFrame(client_order)
df_co = df_co.distinct()
client_order = df_co.rdd.map(list)
client_order = client_order.reduceByKey(lambda a,b: a + b)
client_order.collect()
which results in:
[(u'4', [u'105', u'104']),
(u'3', [u'103']),
(u'2', [u'102']),
(u'1', [u'100', u'101'])]
Then I want to have a list like this:
[[[u'a', u'a'],[u'b']], [[u'c']], [[u'b', u'c']], [[u'a'],[u'b']]]

Here is the solution using PySpark dataframe (not that I use PySpark 2.1). First, you have to transform RDD to Dataframe.
df = test.toDF(['clientId', 'orderId', 'product'])
And this is the snippet to group the dataframe. Basic idea is to group by clientId and orderId first and aggregate product column together. Then group again by only clientId.
import pyspark.sql.functions as func
df_group = df.groupby(['clientId', 'orderId']).agg(func.collect_list('product').alias('product_list'))
df_group_2 = df_group[['clientId', 'product_list']].\
groupby('clientId').\
agg(func.collect_list('product_list').alias('product_list_group')).\
sort('clientId', ascending=True)
df_group_2.rdd.map(lambda x: x.product_list_group).collect() # collect output here
Result is the following:
[[['a', 'a'], ['b']], [['c']], [['b', 'c']], [['b'], ['a']]]

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

How to divide multilevel columns in Python

I have a df like this:
arrays = [['bar', 'bar', 'baz', 'baz'],
['one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 4), index=['A', 'B', 'C'], columns=index)
df.head()
returning:
I want to add some columns where all second level dimensions are divided by each other - bar one is divided by baz one, and bar two is divided by baz two, etc.
df[["bar"]]/df[["baz"]]
and
df[["bar"]].div(df[["baz"]])
returns NaN's
You can select both levels by only one []:
df1 = df["bar"]/df["baz"]
print (df1)
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
If want add MultiIndex add MultiIndex.from_product:
df1.columns = pd.MultiIndex.from_product([['new'], df1.columns], names=df.columns.names)
print (df1)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
Another idea for MultiIndex in output is use your solution with rename columns to same names, here new:
df2 = df[["bar"]].rename(columns={'bar':'new'})/df[["baz"]].rename(columns={'baz':'new'})
print (df2)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637

Is there a simple way to manually iterate through existing pandas groupby objects?

Is there a simple way to manually iterate through existing pandas groupby objects?
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 3, 4], 'category': ['A', 'A', 'B', 'B', 'B']})
grouped = df.groupby('category')
In the application a for name, group in grouped: loops follows. For manual-testing I would like to do something like group = grouped[0] and run the code within the for-loop. Unfortunately this does not work. The best thing I could find (here) was
group = df[grouped.ngroup()==0]
which relies on the original DataFrame and not soley on the groupby-Object and is therefore not optimal imo.
Any iterable (here the GroupBy object) can be turned into an iterator:
group_iter = iter(grouped)
The line below will be the equivalent of selecting the first group (indexed by 0):
name, group = next(group_iter)
To get the next group, just repeat:
name, group = next(group_iter)
And so on...
Source: https://treyhunner.com/2018/02/python-range-is-not-an-iterator/

Multi-index pandas dataframes: find an index related to the number of unique values a column has

# import Pandas library
import pandas as pd
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,10]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
#df is a dataframe
df
Following is the problem: How to return the ID which has more than one unique values for column 'A'? In the above dataset, ideally it should return B001.
I would appreciate if anyone could help me out with performing operations in multi-index pandas dataframes.
Use GroupBy.transform with nunique and filter by boolean indexing and for values of first levl of MultiIndex add get_level_values with unique:
a = df[df.groupby(level=0)['A'].transform('nunique') > 1].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')
Or use duplicated, but first need columns from MultiIndex by reset_index:
m = df.reset_index().duplicated(subset=['ID','A'], keep=False).values
a = df[~m].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources