How can I delete useless strings by index from a Pandas DataFrame defining a function? - python-3.x

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.

I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Related

Pandas data frame concat return same data of first dataframe

I have this datafram
PNN_sh NN_shap PNN_corr NN_corr
1 25005 1 25005
2 25012 2 25001
3 25011 3 25009
4 25397 4 25445
5 25006 5 25205
Then I made 2 dataframs from this one.
NN_sh = data[['PNN_sh', 'NN_shap']]
NN_corr = data[['PNN_corr', 'NN_corr']]
Thereafter, I sorted them and saved in new dataframes.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'])
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'])
Now I want to combine 2 columns from the 2 dataframs above.
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')
But what I got is only the first column copied into second one also.
PNN_sh PNN_corr
1 1
5 5
3 3
2 2
4 4
The second column should be
PNN_corr
2
1
3
5
4
Any idea how to fix it? Thanks in advance
Put ignore_index=True to sort_values():
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'], ignore_index=True)
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'], ignore_index=True)
Then the result after concat will be:
PNN_sh PNN_corr
0 1 2
1 5 1
2 3 3
3 2 5
4 4 4
I think when you sort you are preserving the original indices of the example DataFrames. Therefore, it is joining the PNN_corr value that was originally in the same row (at same index). Try resetting the index of each DataFrame after sorting, then join/concat.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap']).reset_index()
NN_corr_sort = NN_corr.sort_values(by=['NN_corr']).reset_index()
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')

Printing Pattern in Python

1. The Problem
Given a positive integer n. Print the pattern as shown in sample outputs.
A code has already been provided. You have to understand the logic of the code on your own and try and make changes to the code so that it gives correct output.
1.1 The Specifics
Input: A positive integer n, 1<= n <=9
Output: Pattern as shown in examples below
Sample input:
4
Sample output:
4444444
4333334
4322234
4321234
4322234
4333334
4444444
Sample input:
5
Sample output:
555555555
544444445
543333345
543222345
543212345
543222345
543333345
544444445
555555555
2. My Answer
2.1 My Code
n=int(input())
answer=[[1]]
for i in range(2, n+1):
t=[i]*((2*i)-3)
answer.insert(0, t)
answer.append(t)
for a in answer:
a.insert(0,i)
a.append(i)
print(answer)
outlst = [' '.join([str(c) for c in lst]) for lst in answer]
for a in outlst:
print(a)
2.2 My Output
Input: 4
4 4 4 4 4 4 4 4 4
4 4 3 3 3 3 3 3 3 4 4
4 4 3 3 2 2 2 2 2 3 3 4 4
4 3 2 1 2 3 4
4 4 3 3 2 2 2 2 2 3 3 4 4
4 4 3 3 3 3 3 3 3 4 4
4 4 4 4 4 4 4 4 4
2.3 Desired Output
4444444
4333334
4322234
4321234
4322234
4333334
4444444
Your answer isn't as expected because you add the same object t to the answer list twice:
answer.insert(0, t)
answer.append(t)
More specifically, when you assign t = [i]*(2*i - 3), a new data structure is created, [i, ..., i], and t just points to that data structure. Then you put the pointer t in the answer list twice.
In the for a in answer loop, when you use a.insert(0, i) and a.append(i), you update the data structure a is pointing to. Since you call insert(0, i) and append(i) on both pointers that point to the same data structure, you effectively insert and append i to that data structure twice. That's why you end up with more digits than you need.
Instead, you could run the loop for a in answer for only the top half of the rows in the answer list (and the middle row that has was created without a pair). E.g. for a in answer[:(len(answer)+1)/2].
Other things you could do:
using literals as the arguments instead of reusing the reference, e.g. append([i]*(2*i-3)). The literal expression will create a new data structure every time.
using a copy in one of the calls, e.g. append(t.copy()). The copy method creates a new list object with a "shallow" copy of the data structure.
Also, your output digits are space-separated, because you used a non-empty string in ' '.join(...). You should use the empty string: ''.join(...).
n=5
answer=[[1]]
for i in range(2, n+1):
t=[i]*((2*i)-3)
answer.insert(0, t)
answer.append(t.copy())
for a in answer:
a.insert(0,i)
a.append(i)
answerfinal=[]
for a in answer:
answerfinal.append(str(a).replace(' ','').replace(',','').replace(']','').replace('[',''))
for a in answerfinal:
print(a)
n = int(input())
for i in range(1,n*2):
for j in range(1,n*2):
if i <= j<=n*2-i: print(n-i+1,end='')
elif i>n and i>=j >= n*2 -i : print(i-n+1,end='')
elif j<=n: print(n-j+1,end="")
else: print(j-n+1,end='')
print()
n = int(input())
k = 2*n - 1
for i in range(k):
for j in range(k):
a = i if i<j else j
a = a if a<k-i else k-i-1
a = a if a<k-j else k-j-1
print(n-a, end = '')
print()

delete columns based on index name string operation [duplicate]

This question already has answers here:
Drop columns whose name contains a specific string from pandas DataFrame
(11 answers)
Closed 3 years ago.
I have a large dataframe with a lot of columns and want to delete some based on string operations on the column names.
Consider the following example:
df_tmp = pd.DataFrame(data=[(1,2,3, "foo"), ("bar", 4,5,6), (7,"baz", 8,9)],
columns=["test", "anothertest", "egg", "spam"])
Now, I would like to delete all columns where the column name contains test; I have tried to adapt answers given here (string operations on column content) and here (on addressing the name) to no avail.
df_tmp = df_tmp[~df_tmp.index.str.contains("test")]
# AttributeError: Can only use .str accessor with string values!
df_tmp[~df_tmp.name.str.contains("test")]
# AttributeError: 'DataFrame' object has no attribute 'name'
Can someone point me in the right direction?
Thanks a ton in advance. :)
Better would be with df.filter()...
>>> df_tmp
test anothertest egg spam
0 1 2 3 foo
1 bar 4 5 6
2 7 baz 8 9
Result:
1-
>>> df_tmp.loc[:,~df_tmp.columns.str.contains("test")]
egg spam
0 3 foo
1 5 6
2 8 9
2-
>>> df_tmp.drop(df_tmp.filter(like='test').columns, axis=1)
egg spam
0 3 foo
1 5 6
2 8 9
3-
>>> df_tmp.drop(df_tmp.filter(regex='test').columns, axis=1)
egg spam
0 3 foo
1 5 6
2 8 9
4-
>>> df_tmp.filter(regex='^((?!test).)*$')
egg spam
0 3 foo
1 5 6
2 8 9
Regex explanation
'^((?!test).)*$'
^ #Start matching from the beginning of the string.
(?!test) #This position must not be followed by the string "test".
. #Matches any character except line breaks (it will include those in single-line mode).
$ #Match all the way until the end of the string.
Nice explanation about regex negative lookahead

Join rows based on particular column value in python [duplicate]

I have a dataframe like this:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
is a
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply that.
You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)
df.groupby('A')['B'].agg(lambda col: ''.join(col))
You could try this:
df.groupby('A').agg({'B':'sum','C':'-'.join})
Named aggregations with pandas >= 0.25.0
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
aggregate and get a list of strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random
a simple solution would be :
>>> df.groupby(['A','B']).c.unique().reset_index()
If you'd like to overwrite column B in the dataframe, this should work:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
Following #Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))

Pandas Python How do use common data from one data frame to write to a different data frame?

I am trying to use df4's LineNum column to identify the GeneralDescription in df1 by matching LineNumbers and writing to the corresponding GeneralDescription's column cell in df1. I am going for a solution that is scalable to work with data frames with thousands of rows and several other inconsequential columns. I would rather not merge if it isnt absolutely necessary. I just want to write to df1's TrueDepartment column and leave the original structure of the 2 data frames the same. Thanks –
df1
LineNum Warehouse GeneralDescription
0 2 Empty Empty
1 3 Empty Empty
2 4 PBS Empty
3 5 Empty Empty
4 6 Empty Empty
5 7 General Liability Empty
6 8 Empty Empty
7 9 Empty Empty
df4
LineNum GeneralDescription
0 4 TRUCKING
1 6 TRUCKING-GREENVILLE,TN
2 7 Human Resources
Desired result
LineNum Warehouse GeneralDescription
0 2 Empty Empty
1 3 Empty Empty
2 4 PBS TRUCKING
3 5 Empty Empty
4 6 Empty TRUCKING-GREENVILLE,TN
5 7 General Liability Human Resources
6 8 Empty Empty
7 9 Empty Empty
This is the code I have so far with packages that might be helpful. As it is I'm getting the error that says KeyError: 'the label [LineNum] is not in the [index]'
import pandas as pd
import openpyxl
import numpy as np
data= [[2,'Empty','Empty'],[3,'Empty','Empty'],[4,'PBS','Empty'],[5,'Empty','Empty'],[6,'Empty','Empty'],[7,'General Liability','Empty'],[8,'Empty','Empty'],[9,'Empty','Empty']]
df1=pd.DataFrame(data,columns=['LineNum','Warehouse','GeneralDescription'])
data4 = [[4,'TRUCKING'],[6,'TRUCKING-GREENVILLE,TN'],[7,'Human Resources']]
df4=pd.DataFrame(data4,columns=['LineNum','GeneralDescription'])
for i in range(len(df1.index)):
if df1.loc[i,'LineNum']==df4.loc['LineNum']:
df1.loc[i,'GeneralDescription']=df4.loc['GeneralDescription']
Use map with Series created by df4 with fillna by original column values:
s = df4.set_index('LineNum')['TrueDepartment']
df1['TrueDepartment'] = df1['LineNum'].map(s).fillna(df1['TrueDepartment'])
print (df1)
LineNum Department TrueDepartment
0 2 Empty Empty
1 3 Empty Empty
2 4 GBS TRUCKING
3 5 Empty Empty
4 6 Empty TRUCKING-GREENVILLE,TN
5 7 General Liability Human Resources
6 8 Empty Empty
7 9 Empty Empty
Solution with DataFrame.merge:
df = df1.merge(df4,how='left', on='LineNum', suffixes=('','_'))
df['TrueDepartment'] = df['TrueDepartment_'].combine_first(df['TrueDepartment'])
df = df.drop('TrueDepartment_', axis=1)
print (df)
LineNum Department TrueDepartment
0 2 Empty Empty
1 3 Empty Empty
2 4 GBS TRUCKING
3 5 Empty Empty
4 6 Empty TRUCKING-GREENVILLE,TN
5 7 General Liability Human Resources
6 8 Empty Empty
7 9 Empty Empty

Resources