I have various files containing data. I want to extract one specific column from each file and create a new dataframe with one column containing all the extracted data.
So for example I have 3 files:
A B C
1 2 3
4 5 6
A B C
7 8 9
8 7 6
A B C
5 4 3
2 1 0
The new dataframe should only contain the values from column C:
C
3
6
9
6
3
0
So the column of the first file should be copied to the new dataframe, the column from the second file should be appendend to the new dataframe.
My code looks like this so far:
import pandas as pd
import glob
for filename in glob.glob('*.dat'):
df= pd.read_csv(filename, delimiter="\t", header=6)
df1= df["Bias"]
print(df)
Now df1 is overwritten in each loop step. Would it be a good idea to create a temporary dataframe in each loop step and then copy the data to the new dataframe?
Any input is appreciated!
Use list comprehension or for loop with append for list of DataFrames and if need only some columns add parameter usecols, last concat all together for big DataFrame:
dfs = [pd.read_csv(f, delimiter="\t", header=6, usecols=['C']) for f in glob.glob('*.dat')]
Or:
dfs = []
for filename in glob.glob('*.dat'):
df = pd.read_csv(filename, delimiter="\t", header=6, usecols=['C'])
#if need all columns
#df = pd.read_csv(filename, delimiter="\t", header=6)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
Related
I have an excel file which I need to convert using python pandas.
I want to create a file for each 5 rows i.e. if I have 29 rows in an excel. I want to create total 6 files. First 5 files consisting of 5 rows each and last file containing of 4 rows. Can anyone help please?
You can read the whole excel file like this:
df = pd.read_excel(filename)
Then, you can split this df in batches of 5 rows like this:
n = 5 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
list_df will have 6 chunks for your case. 5 of them having 5 rows each and the 6th one having 4 rows.
You can use the code below. c is just a counter, x is the number of files you will need, and the output files will be named file_1.xlsx and so on:
import pandas as pd
import numpy as np
import math
df = pd.read_excel('path_to_your_file.xlsx') # create original df
c = 1
x = math.ceil(df.shape[0]/5)
for i in np.array_split(df, x):
filename = 'file_'+str(c)
pd.DataFrame(i).to_excel(filename+'.xlsx', index=False)
c += 1
I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.
I am sorry for asking a naive question but it's driving me crazy at the moment. I have a dataframe df1, and creating new dataframe df2 by using it, as following:
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1)
df2 =NewDF(df1)
print(df1)
which gives
a b
0 1 4
1 2 5
2 3 6
a b sum
0 1 4 5
1 2 5 7
2 3 6 9
Why I am loosing df1 shape and getting third column? How can I avoid this?
DataFrames are mutable so you should either explicitly pass a copy to your function, or have the first step in your function copy the input. Otherwise, just like with lists, any modifications your functions make also apply to the original.
Your options are:
def NewDF(df):
df = df.copy()
df['sum']=df['a']+df['b']
return df
df2 = NewDF(df1)
or
df2 = NewDF(df1.copy())
Here we can see that everything in your original implementation refers to the same object
import pandas as pd
def NewDF(df):
print(id(df))
df['sum']=df['a']+df['b']
print(id(df))
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(id(df1))
#2242099787480
df2 = NewDF(df1)
#2242099787480
#2242099787480
print(id(df2))
#2242099787480
The third column that you are getting is the Index column, each pandas DataFrame will always maintain an Index, however you can choose if you don't want it in your output.
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1.to_string(index=False))
df2 =NewDF(df1)
print(df1.to_string(index = False))
Gives the output as
a b
1 4
2 5
3 6
a b sum
1 4 5
2 5 7
3 6 9
Now you might have the question why does index exist, Index is actually a backed hash table which increases speed and is a highly desirable feature in multiple contexts, If this was just a one of question, this should be enough, but if you are looking to learn more about pandas and I would advice you to look into indexing, you can begin by looking here https://stackoverflow.com/a/27238758/10953776
I have a large file, imported into a single dataframe in Pandas.
I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.
eg: 10 rows:
file 1 gets [0:4]
file 2 gets [5:9]
Is there a way to do this without having to create more dataframes?
assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .
df.assign(g=df.index//3)
Out[324]:
0 g
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 3
and you can call the df[df.g==1] to get what you need
There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.
A single DataFrame evenly divided into N number of CSV files
import pandas as pd
import csv, math
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
file.close()
More than one DataFrame evenly divided into N number of CSV files
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
with open(f"out{i}.csv","w") as fo:
fo.write(new_df.to_csv())
use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv
N = 10
for i, df in enumerate(np.array_split(dfX, N)):
df.to_csv(f"dfX_{i + 1}.csv", index=False)
iterating over iloc's arguments will do the trick.
This is my code:
import os
file=[]
directory ='/Users/xxxx/Documents/sample/'
for i in os.listdir(directory):
file.append(i)
Com = list(file)
df=pd.DataFrame(data=Com)
df.to_csv('com.csv', index=False, header=True)
print('done')
at the moment I am getting all the values for i in one column as a row header. Does anyone know how to make each i value in one row as a column header?
You need to transpose the df first using .T prior to writing out to csv:
In [44]:
l = list('abc')
df = pd.DataFrame(l)
df
Out[44]:
0
0 a
1 b
2 c
compare with:
In [45]:
df = pd.DataFrame(l).T
df
Out[45]:
0 1 2
0 a b c