Create duplicate column in pandas dataframe - python-3.x

I want to duplicate a column which has numerical character in the start position. ie(1stfloor)
In simple term, I want to convert column 1stfloor to FirstFloor
df
1stfloor
456
784
746
44
9984
Tried using the below code,
df['FirstFloor'] = df['1stfloor']
encountered with below error message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Expected output:
df
FirstFloor
456
784
746
44
9984

df['FirstFloor'] = df['1stfloor']
df['FirstFloor'] = df.loc[:, '1stfloor']
Both worked!

Related

Separate content the single column into multiple column?

I am working project to convert pdf file into table using tabule python. Where while scanning the tabula detect such table, but one such column as table is as below in while the actually image of table is as below picture_2
Is there any method using python to single column into separate column, like second picture.
You need to use str.split with expand=True.
example:
>>> import pandas as pd
>>> df = pd.DataFrame([["Purchase Balance"],["138 303"]])
>>> df
0
0 Purchase Balance
1 138 303
>>> df[0].str.split(" ", expand=True)
0 1
0 Purchase Balance
1 138 303

Pandas: new column using data from multiple other file

I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.

Using a function to replace cell values in a column

I have a fairly large Dataframes 22000X29 . I want to clean up one particular column for data aggregation. A number of cells can be replaced by one column value. I would like to write a function to accomplish this using replace function. How do I pass the column name to the function?
I tried passing the column name as a variable to the function.
Of course, I could do this variable by variable, but that would be tedious
#replace in df from list
def replaceCell(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
replaceCell((c1,c2,c3,c4,c5,c6,c7),c0,'SCity',cimsBid)
cimsBid is the Dataframes, SCity is the column in which I want values to be changed
Error message:
AttributeError: 'DataFrame' object has no attribute 'mycol'
Try accessing your column as:
mydf[mycol]
On this command:
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
Pandas columns access by attribute operator doesn't allows on variable name. You need to access it through indexing operator [] as:
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
There are few more warnings here
Warning
You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of
valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding
element or column.
hi try these function hopefully it will work
def replace_values(replace_dict,mycol,mydf):
mydf = mydf.replace({mycol: replace_dict})
return mydf
pass replacing values as dictonary
Address the column as a string.
You should pass the whole list of values you want to replace (to_replace) and a list of new values (value). (Don't use tuples.
If you want to replace all values with the same new value, it might be best
def replaceCell(mylist,myval,mycol,mydf):
mydf[mycol].replace(to_replace=mylist,value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# replace the 'SCity' column with a new value
mylist = list(df['SCity'])
myval = ['c0']*len(mylist)
df = replaceCell(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 c0 42
2 c0 76
3 c0 34
4 c0 87
5 c0 1
6 c0 52
7 c0 94
This returns the df, with the replaced values.
If you intend to only change a few values, you can do this in a loop.
def replaceCell2(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# Only entries with value 'A' or 'B' will be replaced by 'c0'
mylist = ['A','B']
myval = 'c0'
df = replaceCell2(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 D 42
2 D 76
3 c0 34
4 C 87
5 c0 1
6 c0 52
7 D 94

Python3: Adding multi column rows in empty pandas dataframe

My Code
import pandas as pd
data = pd.read_csv('input_file', header = None, delimiter="\t", names = ['chr', 'sTSS', 'eTSS', 'gene', 'clust1', 'clust2'])
row_filter_column_clust2_1 = pd.DataFrame(columns = data.columns, index=data.index)
row_filter_column_clust2_1.append(data.loc[0]) #Row is not appended
print(row_filter_column_clust2_1) #Nothing is printed
Problem description
I want to add the multi-column rows from the imported file(input_file: see below) into the empty panda's data frame using the .loc function.
input_file
chr2 166760255 166760255 Cse1l_tss10 52 5426
chr2 166760282 166760282 Cse1l_tss9 52 5426
chr2 166885599 166886548 IRF8 150.18 5431
chr2 166885925 166885925 Znfx1_tss1 52 5433
Expected Output
chr2 166760255 166760255 Cse1l_tss10 52 5426
Most probably what you are looking after is the result of the append. So you should store that in some (probably the same) variable:
row_filter_column_clust2_1 = row_filter_column_clust2_1.append(data.loc[0])
Besides this I would like to help you with one more detail regarding the following row:
row_filter_column_clust2_1 = pd.DataFrame(columns = data.columns, index=data.index)
here you should not add the index for creating an empty dataframe, it will add a lot of NaN entries for all the indices.
Also .loc will make the selection based on your index, so you will have a problem if you do not have a row with index 0, if you want to select the first row irrespective of the index use .iloc.

How do I make an int object into something that is subscriptable?

My specific question is if anyone can recognize why when I run this code below, I am getting this specific error. Or better yet, how to fix it. I'm trying to map a department description of a department number in df5 to a second data frame's (df2) TrueDepartment column. Df2 has a column called "Department" that I want to iterate through, searching for substrings that contain 4 or 5 digit dpt_nbrs. Dpt_Nbr's go in ascending order from 1 to over 10000 in df5 with no blank rows. There is a Dept_Desc_HR for every Dept_Nbr in df5 and when a substring is found (4 or 5 consecutive digits) in df2's Department column, I want to write this Dept_Desc to the TrueDepartment column of df2. so for each data frame (df2 has 2 columns and df5 has 3 columns). df2 has a column Deparment that i want to iterate though and a column TrueDepartment that I want to write to. df5 has 3 columns, Dept_Nbr, Dept_Desc_HR, and Dept_Desc_AD. The Department Column of df2 has many blank cells and many cells with values. Some of these values have no numbers in them while others have several numbers and some cells have a combination of digits and letters and special characters. I want to use the cells that have 4 or 5 consecutive digits to identify a dept_nbr and then map the dept_desc of that Dept_Nbr to TrueDepartment column of df2. If the Dept_Nbr has a value in Dept_Desc_AD, I want to use this value and write it to the TrueDepartment column of df2. If it does not have a value in the Dept_Desc_AD, I want to write the contents of Dept_Desc_HD to the TrueDepartment column of df2. My code works on a sample data set, but on the larger data set using the full excelspreadsheet, it gives me the error you see at the bottom. I appreciate any help in solving this problem. I will be happy to provide the spreadsheets or any other infor if needed. Thanks
import pandas as pd
import numpy as np
import re
#reading my two data frames from 2 excel files
excel_file='/Users/j0t0174/anaconda3/Depts_sheets_withonlyAD_4columns.xlsx'
df2 = pd.read_excel(excel_file)
excel_file='/Users/j0t0174/anaconda3/dept_nbr.xlsx'
df5=pd.read_excel(excel_file)
df2=df2.replace(np.nan, "Empty",regex=True)
df5=df5.replace(np.nan, "Empty",regex=True)
numbers = df5['Dept_Nbr'].tolist()#-->adding dept_nbr's to list
df5['Dept_Nbr'] = [int(i) for i in df5['Dept_Nbr']]
df5 = df5.set_index('Dept_Nbr') #<--setting data frame 5 (df5) to the new index
for n in numbers:
for i in range(len(df5.index)): #<--iterate through the number of elements not the elements themselves
if str(n) == df2.loc[i, 'Department'][-4:]: #<-- convert n to str and slice df2 string for the last 4 chars
if df5.loc[n, 'Dept_Desc_AD'] != "Empty": #<--checking against a string, not a NaN
df2.loc[i, 'TrueDepartment'] = df5.loc[n, 'Dept_Desc_AD'] #<-- use .loc not .at
else:
df2.loc[i, 'TrueDepartment'] = df5.loc[n, 'Dept_Desc_HD']
TypeError Traceback (most recent call last)
<ipython-input-5-aa578c4c334c> in <module>()
17 for n in numbers:
18 for i in range(len(df5.index)): #<-- you want to iterate through the number of elements not the elements themselves
---> 19 if str(n) == df2.loc[i, 'Department'][-4:]: #<-- convert n to str and slice df2 string for the last 4 chars
20 if df5.loc[n, 'Dept_Desc_AD'] != "Empty": #<-- you're actually checking against a string, not a NaN
21 df2.loc[i, 'TrueDepartment'] = df5.loc[n, 'Dept_Desc_AD'] #<-- use .loc not .at
TypeError: 'int' object is not subscriptable
Your error is raised because
df2.loc[i, 'Department']
returns an int, which is not subscriptable. If you want the last 4 digits of this integer, make it a str first
str(df2.loc[i, 'Department'])
and just then you can subscript it
str(df2.loc[i, 'Department'])[-4:]

Resources