Compare multiple dataframes' columns names with original one's in Pandas

Compare multiple dataframes' columns names with original one's in Pandas - python-3.x

Let's say I have a dataframe df with headers a, b, c, d.
I want to compare other dfs (df1, df2, df3, ...) columns name with it. I need all the dfs's columns name should be exactly identical as df (Please note the different order of columns names should be not considered as different column names).
For example:
Original dataframe:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
col = ['a', 'b', 'c']
dfs:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'c', 'b'])
Returns identical columns name;
df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
columns=['a', 'c', 'e', 'b'])
Returns extra columns in dataframe;
df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
columns=['a', 'c'])
Returns missing columns in dataframe;
df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', '*c', 'b'])
Returns errors in dataframe's column names;
df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
columns=['a', 'b', 'b', 'c'])
returns extra columns in dataframe.
If it's too complicated, it's also OK returning columns names are incorrect for all kinds of errors.
How could I do that in Pandas? Thanks.

I think set here is good choice, because order is not important:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
return ('extra columns in dataframe')
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
return ('missing columns in dataframe')
#compared subsets
elif orig.issubset(c):
return ('extra columns in dataframe')
else:
return ('columns names are incorrect')
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe
For returned values:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
col = df1.columns.tolist()
a = set([str(x) for x in col if col.count(x) > 1])
return f'duplicated columns: {", ".join(a)}'
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
a = (str(x) for x in orig - c)
return f'missing columns: {", ".join(a)}'
#compared subsets
elif orig.issubset(c):
a = (str(x) for x in c - orig)
return f'extra columns: {", ".join(a)}'
else:
a = (str(x) for x in c - orig)
return f'incorrect: {", ".join(a)}'
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b

I wrote a normal python function which uses pandas function to get columns and compare them, please see if this helps:
def check_errors(original_df, df1):
original_columns = original_df.columns
columns1 = df1.columns
if len(original_columns) > len(columns1):
print("Columns missing!!")
elif len(original_columns) < len(columns1):
print("Extra Columns")
else:
for i in columns1:
if i not in original_columns:
print("Column names are incorrect")

Related

Filter pandas dataframe in python3 depending on the value of a list

So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!

df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]

Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.

You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')

This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)

This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')

Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')

Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.

(Python) Finding all possible partitions of a list of lists subject to a size limit for a partition

Suppose that I have a list k = [[1,1,1],[2,2],[3],[4]], with size limit c = 4.
Then I will like to find all possible partitions of k subject ot c. Ideally, the result should be:
[ {[[1,1,1],[3]], [[2,2], [4]]}, {[[1,1,1],[4]], [[2,2], [3]]}, {[[1,1,1]], [[2,2], [3], [4]]}, ..., {[[1,1,1]], [[2,2]], [[3]], [[4]]} ]
where I used set notation { } in the above example (actual case its [ ]) to make it clearer as to what a partition is, where each partition contains groups of lists grouped together.
I implemented the following algorithm but my results do not tally:
def num_item(l):
flat_l = [item for sublist in l for item in sublist]
return len(flat_l)
def get_all_possible_partitions(lst, c):
p_opt = []
for l in lst:
p_temp = [l]
lst_copy = lst.copy()
lst_copy.remove(l)
iterations = 0
while num_item(p_temp) <= c and iterations <= len(lst_copy):
for l_ in lst_copy:
iterations += 1
if num_item(p_temp + [l_]) <= c:
p_temp += [l_]
p_opt += [p_temp]
return p_opt
Running get_all_possible_partitions(k, 4), I obtain:
[[[1, 1, 1], [3]], [[2, 2], [3], [4]], [[3], [1, 1, 1]], [[4], [1, 1, 1]]]
I understand that it does not remove duplicates and exhaust the possible combinations, which I am stuck on.
Some insight will be great! P.S. I did not manage to find similar questions :/

I think this does what you want (explanations in comments):
# Main function
def get_all_possible_partitions(lst, c):
yield from _get_all_possible_partitions_rec(lst, c, [False] * len(lst), [])
# Produces partitions recursively
def _get_all_possible_partitions_rec(lst, c, picked, partition):
# If all elements have been picked it is a complete partition
if all(picked):
yield tuple(partition)
else:
# Get all possible subsets of unpicked elements
for subset in _get_all_possible_subsets_rec(lst, c, picked, [], 0):
# Add the subset to the partition
partition.append(subset)
# Generate all partitions that complete the current one
yield from _get_all_possible_partitions_rec(lst, c, picked, partition)
# Remove the subset from the partition
partition.pop()
# Produces all possible subsets of unpicked elements
def _get_all_possible_subsets_rec(lst, c, picked, current, idx):
# If we have gone over all elements finish
if idx >= len(lst): return
# If the current element is available and fits in the subset
if not picked[idx] and len(lst[idx]) <= c:
# Mark it as picked
picked[idx] = True
# Add it to the subset
current.append(lst[idx])
# Generate the subset
yield tuple(current)
# Generate all possible subsets extending this one
yield from _get_all_possible_subsets_rec(lst, c - len(lst[idx]), picked, current, idx + 1)
# Remove current element
current.pop()
# Unmark as picked
picked[idx] = False
# Only allow skip if it is not the first available element
if len(current) > 0 or picked[idx]:
# Get all subsets resulting from skipping current element
yield from _get_all_possible_subsets_rec(lst, c, picked, current, idx + 1)
# Test
k = [[1, 1, 1], [2, 2], [3], [4]]
c = 4
partitions = list(get_all_possible_partitions(k, c))
print(*partitions, sep='\n')
Output:
(([1, 1, 1],), ([2, 2],), ([3],), ([4],))
(([1, 1, 1],), ([2, 2],), ([3], [4]))
(([1, 1, 1],), ([2, 2], [3]), ([4],))
(([1, 1, 1],), ([2, 2], [3], [4]))
(([1, 1, 1],), ([2, 2], [4]), ([3],))
(([1, 1, 1], [3]), ([2, 2],), ([4],))
(([1, 1, 1], [3]), ([2, 2], [4]))
(([1, 1, 1], [4]), ([2, 2],), ([3],))
(([1, 1, 1], [4]), ([2, 2], [3]))

If all elements in the list are unique, then you can use bit.
Assume k = [a,b,c], which length is 3, then there are 2^3 - 1 = 7 partions:
if you use bit to compresent a, b, c, there will be
001 -> [c]
010 -> [b]
011 -> [b, c]
100 -> [a]
101 -> [a,c]
110 -> [a,b]
111 -> [a,b,c]
so, the key to solving this question is obvious now.

Note: This answer is actually for a closed linked question.
If you only want to return the bipartitions of the list you can utilize more_iterools.set_partions:
>>> from more_itertools import set_partitions
>>>
>>> def get_bipartions(lst):
... half_list_len = len(lst) // 2
... if len(lst) % 2 == 0:
... return list(
... map(tuple, [
... p
... for p in set_partitions(lst, k=2)
... if half_list_len == len(p[0])
... ]))
... else:
... return list(
... map(tuple, [
... p
... for p in set_partitions(lst, k=2)
... if abs(half_list_len - len(p[0])) < 1
... ]))
...
>>> get_bipartions(['A', 'B', 'C'])
[(['A'], ['B', 'C']), (['B'], ['A', 'C'])]
>>> get_bipartions(['A', 'B', 'C', 'D'])
[(['A', 'B'], ['C', 'D']), (['B', 'C'], ['A', 'D']), (['A', 'C'], ['B', 'D'])]
>>> get_bipartions(['A', 'B', 'C', 'D', 'E'])
[(['A', 'B'], ['C', 'D', 'E']), (['B', 'C'], ['A', 'D', 'E']), (['A', 'C'], ['B', 'D', 'E']), (['C', 'D'], ['A', 'B', 'E']), (['B', 'D'], ['A', 'C', 'E']), (['A', 'D'], ['B', 'C', 'E'])]

python pandas get column which has first element containing a string

So i've got this DataFrame:
df = pd.DataFrame({'A': ['ex1|context1', 1, 'ex3|context3', 3], 'B': [5, 'ex2|context2', 6, 'data']})
i want to get the column that has '|' in its first element which in my example would be A because ex1|context1 is the first element and contains '|'

If always exist at least one | value in data:
s = df.stack().reset_index(level=0, drop=True)
out = s.str.contains('|', na=False).idxmax()
print (out)
A
General solution working also if no data match:
df = pd.DataFrame({'A': ['ex1context1', 1, 'ex3ontext3', 3],
'B': [5, 'ex2ontext2', 6, 'data']})
print (df)
B A
0 ex1context1 5
1 1 ex2ontext2
2 ex3ontext3 6
3 3 data
s = df.stack().reset_index(level=0, drop=True)
out = next(iter(s.index[s.str.contains('|', na=False, regex=False)]), 'no match')
print (out)
no match

Python Pandas - Update row with dictionary based on index, column

I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)

I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)

The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Compare multiple dataframes' columns names with original one's in Pandas - python-3.x

Related

Filter pandas dataframe in python3 depending on the value of a list

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

(Python) Finding all possible partitions of a list of lists subject to a size limit for a partition

python pandas get column which has first element containing a string

Python Pandas - Update row with dictionary based on index, column

Categories

Resources