So i've got this DataFrame:
df = pd.DataFrame({'A': ['ex1|context1', 1, 'ex3|context3', 3], 'B': [5, 'ex2|context2', 6, 'data']})
i want to get the column that has '|' in its first element which in my example would be A because ex1|context1 is the first element and contains '|'
If always exist at least one | value in data:
s = df.stack().reset_index(level=0, drop=True)
out = s.str.contains('|', na=False).idxmax()
print (out)
A
General solution working also if no data match:
df = pd.DataFrame({'A': ['ex1context1', 1, 'ex3ontext3', 3],
'B': [5, 'ex2ontext2', 6, 'data']})
print (df)
B A
0 ex1context1 5
1 1 ex2ontext2
2 ex3ontext3 6
3 3 data
s = df.stack().reset_index(level=0, drop=True)
out = next(iter(s.index[s.str.contains('|', na=False, regex=False)]), 'no match')
print (out)
no match
Related
Let's say I have a dataframe df with headers a, b, c, d.
I want to compare other dfs (df1, df2, df3, ...) columns name with it. I need all the dfs's columns name should be exactly identical as df (Please note the different order of columns names should be not considered as different column names).
For example:
Original dataframe:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
col = ['a', 'b', 'c']
dfs:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'c', 'b'])
Returns identical columns name;
df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
columns=['a', 'c', 'e', 'b'])
Returns extra columns in dataframe;
df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
columns=['a', 'c'])
Returns missing columns in dataframe;
df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', '*c', 'b'])
Returns errors in dataframe's column names;
df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
columns=['a', 'b', 'b', 'c'])
returns extra columns in dataframe.
If it's too complicated, it's also OK returning columns names are incorrect for all kinds of errors.
How could I do that in Pandas? Thanks.
I think set here is good choice, because order is not important:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
return ('extra columns in dataframe')
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
return ('missing columns in dataframe')
#compared subsets
elif orig.issubset(c):
return ('extra columns in dataframe')
else:
return ('columns names are incorrect')
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe
For returned values:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
col = df1.columns.tolist()
a = set([str(x) for x in col if col.count(x) > 1])
return f'duplicated columns: {", ".join(a)}'
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
a = (str(x) for x in orig - c)
return f'missing columns: {", ".join(a)}'
#compared subsets
elif orig.issubset(c):
a = (str(x) for x in c - orig)
return f'extra columns: {", ".join(a)}'
else:
a = (str(x) for x in c - orig)
return f'incorrect: {", ".join(a)}'
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b
I wrote a normal python function which uses pandas function to get columns and compare them, please see if this helps:
def check_errors(original_df, df1):
original_columns = original_df.columns
columns1 = df1.columns
if len(original_columns) > len(columns1):
print("Columns missing!!")
elif len(original_columns) < len(columns1):
print("Extra Columns")
else:
for i in columns1:
if i not in original_columns:
print("Column names are incorrect")
So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!
df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]
Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]
Let consider the below code:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
x=0
print(df)
x=df.loc[df['A'] == 3, 'B', ''].iloc[0]
print(x)
while printing the x I get 4 as the output.Its fine. If the condition get fails as per the below code
x=df.loc[df['A'] == 33, 'B', ''].iloc[0]
I want to print the x's initial value 0 and I want avoid the below error:
IndexError: single positional indexer is out-of-bounds
Guide me to avoid the error and display the initial value of x. Thanks in advance.
you can have a look at try and except for exception handling,
Use:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
x=0
print(df)
try:
x=df.loc[df['A'] == 3, 'B', ''].iloc[0]
print(x)
except Exception as e:
print(e)
print(x)
Output:
A B
0 1 2
1 3 4
2 5 6
3 7 8
Too many indexers #the exception
0 #the initial value
Use next with default parameter for get value if no match condition:
Notice - Also there is another error - need only value B, remove '' (maybe typo)
default=0
x = next(iter(df.loc[df['A'] == 30, 'B']), default)
print (x)
0
I have the following data frame my_df:
name numbers
----------------------
A [4,6]
B [3,7,1,3]
C [2,5]
D [1,2,3]
I want to combine all numbers to a new list, so the output should be:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
And here is my code:
def combine_list(my_lists):
new_list = []
for x in my_lists:
new_list.append(x)
return new_list
new_df = my_df.agg({'numbers': combine_list})
but the new_df still looks the same as original:
numbers
----------------------
0 [4,6]
1 [3,7,1,3]
2 [2,5]
3 [1,2,3]
What did I do wrong? How do I make new_df like:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
Thanks!
You need flatten values and then create new Dataframe by constructor:
flatten = [item for sublist in df['numbers'] for item in sublist]
Or:
flatten = np.concatenate(df['numbers'].values).tolist()
Or:
from itertools import chain
flatten = list(chain.from_iterable(df['numbers'].values.tolist()))
df1 = pd.DataFrame({'numbers':[flatten]})
print (df1)
numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]
Timings are here.
You can use df['numbers'].sum() which returns a combined list to create the new dataframe
new_df = pd.DataFrame({'new_numbers': [df['numbers'].sum()]})
new_numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]
This should do:
newdf = pd.DataFrame({'numbers':[[x for i in mydf['numbers'] for x in i]]})
Check this pandas groupby and join lists
What you are looking for is,
my_df = my_df.groupby(['name']).agg(sum)
I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)
I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)
The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)