Accumalate column through pandas - python-3.x

I have multiple tab delimited files, all having same entries. I intend to read each file choose first column as index. My final table will have first column as index mapped against last column from all the files. For this, I wrote a pandas code but not a great ones. Is there an alternate way to do this ?
import pandas as pd
df1 = pd.read_csv("FB_test.tsv",sep='\t')
df1_idx = df1.set_index('target_id')
df1_idx.drop(df1_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df1_idx)
df2 = pd.read_csv("Myc_test.tsv",sep='\t')
df2_idx = df2.set_index('target_id')
df2_idx.drop(df2_idx[['length','eff_length','est_counts']],inplace=True, axis=1)
print(df2_idx)
frames = [df1_idx, df2_idx]
results = pd.concat(frames, axis=1)
results
The output it generated was,
tpm
target_id
A 0
B 0
C 0
D 0
E 0
tpm
target_id
A 1
B 1
C 1
D 1
E 1
Out[18]:
target_id tpm tpm
A 0 1
B 0 1
C 0 1
D 0 1
E 0 1
How to loop it so that, I read each file and achieve this same output ?
Thanks,
AP

I think you can use parameters index_col and usecols in read_csv with list comprehension. But get duplicates columns names (so is problem for selecting), so better is add parameter keys to concat - after converting Multiindex get nice unique column names:
files = ["FB_test.tsv", "Myc_test.tsv"]
dfs = [pd.read_csv(f,sep='\t', index_col=['target_id'], usecols=['target_id','tpm'])
for f in files]
results = pd.concat(dfs, axis=1, keys=('a','b'))
results.columns = results.columns.map('_'.join)
results = results.reset_index()
print (results)
target_id a_tpm b_tpm
0 A 0 1
1 B 0 1
2 C 0 1
3 D 0 1
4 E 0 1

To clean the code and use a looping mechanism, you can put both your file names and the columns you are dropping in two separate lists, and then use list comprehension on the file names to import each dataset. Subsequently, you concatenate the output of the list comprehension into one dataframe:
import pandas as pd
drop_cols = ['length','eff_length','est_counts']
filenames = ["FB_test.tsv", "Myc_test.tsv"]
results = pd.concat([pd.read_csv(filename,sep='\t').set_index('target_id').drop(drop_cols, axis=1) for filename in filenames], axis=1)
I hope this helps.

Related

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Pandas: new column using data from multiple other file

I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.

Remove duplicated permuted rows in Pandas

I have one Pandas DF with three columns like below:
City1 City2 Totalamount
0 A B 1000
1 A C 2000
2 B A 1000
3 B C 500
4 C A 2000
5 C B 500
I want to delete the duplicated rows where (city1,city2) =(city2,city1). The result should be
City1 City2 Totalamount
0 A B 1000
1 A C 2000
2 B C 500
I tried
res=DFname.drop(DFname[(DFname.City1,DFname.City2) == (DFname.City2,DFname.City1)].index)
but its giving an error.
Could you please help
Thanks
You sort first, then drop the duplicates:
import numpy as np
cols = ['City1', 'City2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates()
If the entire dataframe follows the pattern you show in your sample, where:
All rows are duplicated like (A, B) and (B, A)
There are no unpaired entries
CityA and CityB are always different (no instances of (A, A))
then you can simply do
df = df[df['City1'] < df['City2']]
If the sample is not representative of your whole dataframe, please include a sample that is.

Pandas writing in csv file as columns not rows-Python

This is my code:
import os
file=[]
directory ='/Users/xxxx/Documents/sample/'
for i in os.listdir(directory):
file.append(i)
Com = list(file)
df=pd.DataFrame(data=Com)
df.to_csv('com.csv', index=False, header=True)
print('done')
at the moment I am getting all the values for i in one column as a row header. Does anyone know how to make each i value in one row as a column header?
You need to transpose the df first using .T prior to writing out to csv:
In [44]:
l = list('abc')
df = pd.DataFrame(l)
df
Out[44]:
0
0 a
1 b
2 c
compare with:
In [45]:
df = pd.DataFrame(l).T
df
Out[45]:
0 1 2
0 a b c

Resources