I have two excel files which I read by pandas. I am comparing the index in file 1 with the index in file 2 (not the same length (ex: 10,100) and if they match, the row[index] in the second file will be zeros and else will not change. I am using for and if loops for this, but the more data I want to process(1e3,5e3), the run time becomes longer. So, is there a better way to perform such a comparison?. Here's an example of what I am using.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df1 = pd.DataFrame([['w'], ['y' ], ['z']],
index=[4, 5, 1])
for j in df1.index:
for i in df.index:
if i == j:
df.loc[i, :] = 0
else:
df.loc[i, :] = df.loc[i, :]
print(df)
Here loops are not necessary, you can set values to 0 per rows by DataFrame.mask with Series.isin (necessary convert index to Series for avoid ValueError: Array conditional must be same shape as self):
df = df.mask(df.index.to_series().isin(df1.index), 0)
Or with Index.isin and numpy.where if want improve performance:
arr = np.where(df.index.isin(df1.index)[:, None], 0, df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
A B C
4 0 0 0
5 0 0 0
6 10 20 30
Related
Let's say I have a dataframe df with headers a, b, c, d.
I want to compare other dfs (df1, df2, df3, ...) columns name with it. I need all the dfs's columns name should be exactly identical as df (Please note the different order of columns names should be not considered as different column names).
For example:
Original dataframe:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
col = ['a', 'b', 'c']
dfs:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'c', 'b'])
Returns identical columns name;
df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
columns=['a', 'c', 'e', 'b'])
Returns extra columns in dataframe;
df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
columns=['a', 'c'])
Returns missing columns in dataframe;
df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', '*c', 'b'])
Returns errors in dataframe's column names;
df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
columns=['a', 'b', 'b', 'c'])
returns extra columns in dataframe.
If it's too complicated, it's also OK returning columns names are incorrect for all kinds of errors.
How could I do that in Pandas? Thanks.
I think set here is good choice, because order is not important:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
return ('extra columns in dataframe')
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
return ('missing columns in dataframe')
#compared subsets
elif orig.issubset(c):
return ('extra columns in dataframe')
else:
return ('columns names are incorrect')
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe
For returned values:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
col = df1.columns.tolist()
a = set([str(x) for x in col if col.count(x) > 1])
return f'duplicated columns: {", ".join(a)}'
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
a = (str(x) for x in orig - c)
return f'missing columns: {", ".join(a)}'
#compared subsets
elif orig.issubset(c):
a = (str(x) for x in c - orig)
return f'extra columns: {", ".join(a)}'
else:
a = (str(x) for x in c - orig)
return f'incorrect: {", ".join(a)}'
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b
I wrote a normal python function which uses pandas function to get columns and compare them, please see if this helps:
def check_errors(original_df, df1):
original_columns = original_df.columns
columns1 = df1.columns
if len(original_columns) > len(columns1):
print("Columns missing!!")
elif len(original_columns) < len(columns1):
print("Extra Columns")
else:
for i in columns1:
if i not in original_columns:
print("Column names are incorrect")
So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!
df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]
Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]
I'm using Python 3.6 and I'm a newbie so thanks in advance for your patience.
I have a function that sums the difference between 3 points. It should then take the 'differences' and concatenate them with another DataFrame called labels. k and length are integers. I expected the resulting DataFrame to have two columns but it only has one.
Sample Code:
def distance(df1,df2,labels,k,length):
total_dist = 0
for i in range(length):
dist_dif = df1.iloc[:,i] - df2.iloc[:,i]
sq_dist = dist_dif ** 2
root_dist = sq_dist ** 0.5
total_dist = total_dist + root_dist
return total_dist
distance_df = pd.concat([total_dist, labels], axis=1)
distance_df.sort(ascending=False, axis=1, inplace=True)
top_knn = distance_df[:k]
return top_knn.value_counts().index.values[0]
Sample Data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': [0, 1,1]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
I expected the data to look something like this:
total_dist labels
0 1.715349 0
1 2.872991 1
2 4.344087 1
but instead it looks like this:
0 1.715349
1 4.344087
2 2.872991
dtype: float64
The output doesn't do the following:
1. Return the labels column data
2. Sort the data in descending order
If someone could point me in the right direction, I'd truly appreciate it.
Given two DataFrames, df1-df2 will perform the subtraction element-wise. Use abs() to take the absolute value of that difference, and finally sum each row. That's the explanation to the first command in the following function. The other lines are similar to your code.
import numpy as np
import pandas as pd
def calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels):
diff = np.sum(np.abs(df1-df2), axis=1) # np.sum(..., axis=1) sums the rows
diff.name = 'total_abs_distance' # Not really necessary, but just to refer to it later
diff = pd.concat([diff, labels], axis=1)
diff.sort_values(by='total_abs_distance', axis=0, ascending=True, inplace=True)
return diff
So for your example data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': ['a', 'b', 'c']}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels)
We get hopefully what you wanted:
total_abs_distance Survived
0 1.71 a
2 3.87 c
1 4.34 b
A few notes:
Did you really want the L1-norm? If you wanted the L2-norm (Euclidean distance), then replace the first command in that function above by np.sqrt(np.sum(np.square(df1-df2),axis=1)).
What's the purpose of those labels? Consider using the index of the DataFrames instead. Maybe it will fit your purposes better? For example:
# lbl_series = pd.Series(['a','b','c'], name='Survived') # Try this later instead of lbl_list, to further explore the wonders of Pandas indexes :)
lbl_list = ['a', 'b', 'c']
df1.index = lbl_list
df2.index = lbl_list
# Then the L1-norm is simply this:
np.sum(np.abs(df1 - df2), axis=1).sort_values()
# Whose output is the Series: (with the labels as its index)
a 1.71
c 3.87
b 4.34
dtype: float64
I'm looking for a concise way to do arithmetics on a single dimension of a DataArray, and then have the result returned as a new DataArray (both the changed and unchanged parts). In pandas, I would do this using df.subtract(), but I haven't found the way to do this with xarray.
Here's how I would subtract the value 2 from the x dimension in pandas:
data = np.arange(0,6).reshape(2,3)
xc = np.arange(0, data.shape[0])
yc = np.arange(0, data.shape[1])
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
For xarray though I don't know:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = ?
In xarray, you can subtract from the rows or columns of an array by using broadcasting by dimension name.
For example:
>>> foo = xarray.DataArray([[1, 2, 3], [4, 5, 6]], dims=['x', 'y'])
>>> bar = xarray.DataArray([1, 4], dims='x')
# subtract along 'x'
>>> foo - bar
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[0, 1, 2]])
Dimensions without coordinates: x, y
>>> baz = xarray.DataArray([1, 2, 3], dims='y')
# subtract along 'y'
>>> foo - baz
<xarray.DataArray (x: 2, y: 3)>
array([[0, 0, 0],
[3, 3, 3]])
Dimensions without coordinates: x, y
This works similar to axis='columns' vs axis='index' options that pandas provides, except the desired dimension is referenced by name.
When you do:
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
You really are just subtracting 2 from the entire dataset...
Here is your output from above:
In [15]: df1
Out[15]:
0 1 2
0 0 1 2
1 3 4 5
In [16]: df2
Out[16]:
0 1 2
0 -2 -1 0
1 1 2 3
Which is equivalent to:
df3 = df1.subtract(2)
In [20]: df3
Out[20]:
0 1 2
0 -2 -1 0
1 1 2 3
And equivalent to:
df4 = df1 -2
In [22]: df4
Out[22]:
0 1 2
0 -2 -1 0
1 1 2 3
Therefore, for an xarray data array:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = da1-2
In [24]: da1
Out[24]:
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[3, 4, 5]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
In [25]: da2
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2, -1, 0],
[ 1, 2, 3]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
Now, if you would like to subtract from a specific column, that's a different problem, which I believe would require assignment indexing.
Is there a way to make my loop work with no errors because there is no next value? Or not to use a foor loop for this at all?
Inside this function below I have another function with a for loop:
def funcA(self,perc,bloc):
def funcA1(self):
maxIndex = len(self)
localiz = self.loc
for x in range(0,maxIndex-1):
if localiz[x,bloc] == localiz[x+1,bloc]:
localiz[x,"CALC"] = True
else:
localiz[x,"CALC"]= False
return self
I got it working by creating first the column "CALC" with False because the last line of my df will always be False. But surely there is a better way.
EDIT
I'm basically using pandas and numpy for this code.
The bloc that i'm using in the function is the ID column
The data structure I'm working with is like this:
ID NUMBER
2 100
2 150
3 500
4 100
4 200
4 250
And the expected results are:
ID NUMBER CALC
2 100 True
2 150 False
3 500 False
4 100 True
4 200 True
4 250 False
a pythonic way is this:
lst = [char for char in 'abcdef']
print(lst)
for i, (cur, nxt) in enumerate(zip(lst, lst[1:])):
print(i, cur, nxt)
just note that cur will only run to the second-to-last element of lst.
this will print:
['a', 'b', 'c', 'd', 'e', 'f']
0 a b
1 b c
2 c d
3 d e
4 e f
i is the index in lst of the cur element.
lst[1:] creates a new list excluding the first element. if your lists are very long you may consider replaicing that part with islice; that way no additional copy is made.
this also works if your arr is an n-dimensional numpy array:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], np.int32)
print(arr)
for i, (cur, nxt) in enumerate(zip(arr, arr[1:])):
print(i, cur, nxt)
with ouput:
[[1 2 3]
[4 5 6]
[7 8 9]]
0 [1 2 3] [4 5 6]
1 [4 5 6] [7 8 9]
Because I'm not familiar with this vector-style solution that numpy gives us, I think I couldn't make the most of the proposed solution that was given.
I did find a way to overcome the loop I was using though:
def funcA(self,perc,bloc):
def new_funcA1(self):
df = self[[bloc]]
self['shift'] = df.shift(-1)
self['CALC'] = self[bloc] == self['shift']
self.drop('shift', axis=1, inplace=True)
return self
With pandas.DataFrame.shift(-1) the last row will return NaN. This way I don't have to make any adjustments for the first or last row and I got rid of the loop!