Related
Anyone know how can I do theses calculations in pyspark?
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18],
'CSP': [2, 6, 8, 7],
'coef': [2, 2, 3, 3]
}
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']
for i in range(len(colsToRecalculate)):
df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]
You can use select() on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:
df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()
Slight variation to bzu's answer which selects non-listed columns manually within the select. We can use dataframe.columns and check the columns against the colsToRecalculate list - If column is in the list, do the calculation, else leave column as is.
data_sdf. \
select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])
I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')
This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9
In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')
Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.
I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')
Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.
I have a pandas dataframe that contains 2 dimensional vector as a column. I would like to groupby one of the columns and add the vectors up.
I have tried groupby then sum as shown in the code below, but the output column is adding dimensions to the vector rather than adding the vectors (similarly to when using np.add).
import pandas as pd
data = pd.DataFrame({'label': ['A', 'B', 'A'], 'label2' : ['X', 'Y', 'Z'],
'output' : [[[1,2,3,4],[5,6,7,8]] ,[[9,10,11,12],[13,14,15,16]],[[17,18,19,20],[21,22,23,24]]] })
data_grouped = data.groupby('label')['output'].sum()
I would like to groupby 'label' and have the outputs aggregated. Given that the output is two dimensional vector, i would like the vectors to be added and not combined. Therfore, my expectation is to have:
label A: output is [[18,20,22,24],[26,28,30,32]]
label B: output is [[9,10,11,12],[13,14,15,16]]
but I am getting:
label A: [[1, 2, 3, 4], [5, 6, 7, 8], [17, 18, 19, 20],[21,22,23,24]]
label B: [[9, 10, 11, 12], [13, 14, 15, 16]]
The solution
import pandas as pd
import numpy as np
data = pd.DataFrame({'label': ['A', 'B', 'A'], 'label2' : ['X', 'Y', 'Z'],
'output' : [[[1,2,3,4],[5,6,7,8]] ,[[9,10,11,12],[13,14,15,16]],[[17,18,19,20],[21,22,23,24]]] })
data['output'] = data['output'].map(np.array)
data_grouped = data[['label', 'output']].groupby('label').sum()
print(data_group)
>>> output
>>> label
>>> A [[18, 20, 22, 24], [26, 28, 30, 32]]
>>> B [[9, 10, 11, 12], [13, 14, 15, 16]]
The explanation
Your output contains python lists. Operation + on 2 lists concatenates the lists together:
print([1, 2] + [3, 4])
>>> [1, 2, 3, 4]
print([[1], [2]] + [[3], [4]])
>>> [[1], [2], [3], [4]]
data['output'].map(np.array) turns your 2D lists into 2D numpy arrays. Numpy arrays + operation (which is used by sum()) sums the values that are on "the same place" in both arrays.
I have a nested list sorted by the first element already, I want to rearrange the nested list based on the unique identifier ('a','b','c') and do some statistic calculations for other corresponding variables. below is what I have and what I want to get. Any hint? thanks.
What I already have:
sortedlist=[['a',5,7],['a',3,5],['a',6,7],['b',3,7],['b',5,5],['b',6,9],['b',5,5],['c',1,9],['c',4,5]]
What I want to get first:
target_list1[[['a','a','a'],[5,3,6],[7,5,7]],[['b','b','b','b'],[3,5,6,5],[7,5,9,5]],[['c','c'],[1,4],[9,5]]]
What I want to get next:
target_list2[['a',median[5,3,6],median[7,5,7]],['b',median[3,5,6,5],median[7,5,9,5]],['c',median[1,4],median[9,5]]]
First, you can use the itertools.groupby method. This will group your list of lists by a key function, in this case the first element of each list:
groups = itertools.groupby(sortedlist, operator.itemgetter(0))
This gives you tuples of keys and groups (which are iterators themselves). The expression [list(g[1]) for g in groups] will evaluate to:
[[['a', 5, 7], ['a', 3, 5], ['a', 6, 7]],
[['b', 3, 7], ['b', 5, 5], ['b', 6, 9], ['b', 5, 5]],
[['c', 1, 9], ['c', 4, 5]]]
Now you want to transpose that list of lists, and this can be done with the zip() function:
target_list1 = [list(zip(*g[1])) for g in groups]
This evaluates to:
[[('a', 'a', 'a'), (5, 3, 6), (7, 5, 7)],
[('b', 'b', 'b', 'b'), (3, 5, 6, 5), (7, 5, 9, 5)],
[('c', 'c'), (1, 4), (9, 5)]]
Next you want to apply your function to the result. As this is tagged with python3, we can use the statistics module:
>>> import statistics
>>> [ [l[0][0]] + [statistics.median(x) for x in l[1:]] for l in target_list1]
Bringing this all together:
import itertools
import operator
import statistics
target_list1 = [list(zip(*g[1]))
for g in itertools.groupby(sortedlist, operator.itemgetter(0))]
target_list2 = [ [l[0][0]] + [ statistics.median(x) for x in l[1:]]
for l in target_list1]
I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)
I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)
The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)