How to divide multilevel columns in Python - python-3.x

I have a df like this:
arrays = [['bar', 'bar', 'baz', 'baz'],
['one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 4), index=['A', 'B', 'C'], columns=index)
df.head()
returning:
I want to add some columns where all second level dimensions are divided by each other - bar one is divided by baz one, and bar two is divided by baz two, etc.
df[["bar"]]/df[["baz"]]
and
df[["bar"]].div(df[["baz"]])
returns NaN's

You can select both levels by only one []:
df1 = df["bar"]/df["baz"]
print (df1)
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
If want add MultiIndex add MultiIndex.from_product:
df1.columns = pd.MultiIndex.from_product([['new'], df1.columns], names=df.columns.names)
print (df1)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637
Another idea for MultiIndex in output is use your solution with rename columns to same names, here new:
df2 = df[["bar"]].rename(columns={'bar':'new'})/df[["baz"]].rename(columns={'baz':'new'})
print (df2)
first new
second one two
A 1.564478 -0.115979
B 14.604267 -19.749265
C -0.511788 -0.436637

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

turn three columns into dictionary python

Name = [list(['Amy', 'A', 'Angu']),
list(['Jon', 'Johnson']),
list(['Bob', 'Barker'])]
Other = [list(['Amy', 'Any', 'Anguish']),
list(['Jon', 'Jan']),
list(['Baker', 'barker'])]
import pandas as pd
df = pd.DataFrame({'Other' : Other,
'ID': ['E123','E456','E789'],
'Other_ID': ['A123','A456','A789'],
'Name' : Name,
})
ID Name Other Other_ID
0 E123 [Amy, A, Angu] [Amy, Any, Anguish] A123
1 E456 [Jon, Johnson] [Jon, Jan] A456
2 E789 [Bob, Barker] [Baker, barker] A789
I have the df as seen above. I want to make columns ID, Name and Other into a dictionary with they key being ID. I tried this according to python pandas dataframe columns convert to dict key and value
todict = dict(zip(df.ID, df.Name))
Which is close to what I want
{'E123': ['Amy', 'A', 'Angu'],
'E456': ['Jon', 'Johnson'],
'E789': ['Bob', 'Barker']}
But I would like to get this output that includes values from Other column
{'E123': ['Amy', 'A', 'Angu','Amy', 'Any','Anguish'],
'E456': ['Jon', 'Johnson','Jon','Jan'],
'E789': ['Bob', 'Barker','Baker','barker']
}
And If I put the third column Other it gives me errors
todict = dict(zip(df.ID, df.Name, df.Other))
How do I get the output I want?
Why not just combine the Name and Other column before creating a dict of the Name column.
df['Name'] = df['Name'] + df['Other']
dict(zip(df.ID, df.Name))
Gives
{'E123': ['Amy', 'A', 'Angu', 'Amy', 'Any', 'Anguish'],
'E456': ['Jon', 'Johnson', 'Jon', 'Jan'],
'E789': ['Bob', 'Barker', 'Baker', 'barker']}

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

pyspark convert transactions into a list of list

I want to use PrefixSpan sequence mining in pyspark. The format of data that I need to have is the following:
[[['a', 'b'], ['c']], [['a'], ['c', 'b'], ['a', 'b']], [['a', 'b'], ['e']], [['f']]]
where the innermost elements are productIds, then there are orders (containing list of products) and then there are clients (containing lists of orders).
My data has transactional format:
clientId orderId product
where orderId has multiple rows for separate products and clientId has multiple rows for separate orders.
Sample data:
test = sc.parallelize([[u'1', u'100', u'a'],
[u'1', u'100', u'a'],
[u'1', u'101', u'b'],
[u'2', u'102', u'c'],
[u'3', u'103', u'b'],
[u'3', u'103', u'c'],
[u'4', u'104', u'a'],
[u'4', u'105', u'b']]
)
My solution so far:
1. Group products in orders:
order_prod = test.map(lambda x: [x[1],([x[2]])])
order_prod = order_prod.reduceByKey(lambda a,b: a + b)
order_prod.collect()
which results in:
[(u'102', [u'c']),
(u'103', [u'b', u'c']),
(u'100', [u'a', u'a']),
(u'104', [u'a']),
(u'101', [u'b']),
(u'105', [u'b'])]
2. Group orders in customers:
client_order = test.map(lambda x: [x[0],[(x[1])]])
df_co = sqlContext.createDataFrame(client_order)
df_co = df_co.distinct()
client_order = df_co.rdd.map(list)
client_order = client_order.reduceByKey(lambda a,b: a + b)
client_order.collect()
which results in:
[(u'4', [u'105', u'104']),
(u'3', [u'103']),
(u'2', [u'102']),
(u'1', [u'100', u'101'])]
Then I want to have a list like this:
[[[u'a', u'a'],[u'b']], [[u'c']], [[u'b', u'c']], [[u'a'],[u'b']]]
Here is the solution using PySpark dataframe (not that I use PySpark 2.1). First, you have to transform RDD to Dataframe.
df = test.toDF(['clientId', 'orderId', 'product'])
And this is the snippet to group the dataframe. Basic idea is to group by clientId and orderId first and aggregate product column together. Then group again by only clientId.
import pyspark.sql.functions as func
df_group = df.groupby(['clientId', 'orderId']).agg(func.collect_list('product').alias('product_list'))
df_group_2 = df_group[['clientId', 'product_list']].\
groupby('clientId').\
agg(func.collect_list('product_list').alias('product_list_group')).\
sort('clientId', ascending=True)
df_group_2.rdd.map(lambda x: x.product_list_group).collect() # collect output here
Result is the following:
[[['a', 'a'], ['b']], [['c']], [['b', 'c']], [['b'], ['a']]]

Resources