Merge data frames based on column with different rows - python-3.x

I have multiple csv files that I read into individual data frames based on their name in the directory, like so
# ask user for path
path = input('Enter the path for the csv files: ')
os.chdir(path)
# loop over filenames and read into individual dataframes
for fname in os.listdir(path):
if fname.endswith('Demo.csv'):
demoRaw = pd.read_csv(fname, encoding = 'utf-8')
if fname.endswith('Key2.csv'):
keyRaw = pd.read_csv(fname, encoding = 'utf-8')
Then I filter to only keep certain columns
# filter to keep desired columns only
demo = demoRaw.filter(['Key', 'Sex', 'Race', 'Age'], axis=1)
key = keyRaw.filter(['Key', 'Key', 'Age'], axis=1)
Then I create a list of the above dataframes and use reduce to merge them on Key
# create list of data frames for combined sheet
dfs = [demo, key]
# merge the list of data frames on the Key
combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs)
Then I drop the auto generated column, create an Excel writer and write to a csv
# drop the auto generated index colulmn
combined.set_index('RecordKey', inplace=True)
# create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('final.xlsx', engine='xlsxwriter')
# write to csv
combined.to_excel(writer, sheet_name='Combined')
meds.to_excel(writer, sheet_name='Meds')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The problem is some files have keys that aren't in others. For example
Demo file
Key Sex Race Age
1 M W 52
2 F B 25
3 M L 78
Key file
Key Key2 Age
1 7325 52
2 4783 25
3 1367 78
4 9435 21
5 7247 65
Right now, it will only include rows if there is a matching key in each (in other words it just leaves out the rows with keys not in the other files). How can I combine all rows from all files, even if keys don't match? So the end result will look like this
Key Sex Race Age Key2 Age
1 M W 52 7325 52
2 F B 25 4783 25
3 M L 78 1367 78
4 9435 21
5 7247 65
I don't care if the empty cells are blanks, NaN, #N/A, etc. Just as long as I can identify them.

Replace combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs) With: combined=pd.merge(demo,key, how='outer', on='Key') You will have to specificy the 'outer' to join both the full table of Key and Demo

Related

Pandas load data frame in to txt file with column value pair

I have loaded csv file into pandas data frame and it looks like below example
c1 c2 c3
0 12 22 66
1 44 78 AB
2 DF WE 13
I want this data frame to be loaded in a .txt file with below format
c1="12",c2="22",c3="66"
c1="44",c2="78",c3="AB"
c1="DF",c2="WE",c3="13"
Each row should be on new line in the text file.
Please help me out to find out how it can be done in a best optimal way.
Convert values to strings, append " by DataFrame.add, prepend = by DataFrame.radd and also prepend columns names:
df = df.astype(str).add('"').radd('="').radd(df.columns.to_series())
#alternative
#df = ('="' + df.astype(str) + '"').radd(df.columns.to_series())
print (df)
c1 c2 c3
0 c1="12" c2="22" c3="66"
1 c1="44" c2="78" c3="AB"
2 c1="DF" c2="WE" c3="13"
And then use DataFrame.to_csv:
import csv
df.to_csv('file.txt', header=None, index=False, quoting=csv.QUOTE_NONE)
c1="12",c2="22",c3="66"
c1="44",c2="78",c3="AB"
c1="DF",c2="WE",c3="13"

Is there a way to export multiple pandas Dataframes in different sheet names using "to_csv" [duplicate]

I need to Export or save pandas Multiple Dataframe in an excel in different tabs?
Let's suppose my df's is:
df1:
Id Name Rank
1 Scott 4
2 Jennie 8
3 Murphy 1
df2:
Id Name Rank
1 John 14
2 Brown 18
3 Claire 11
df3:
Id Name Rank
1 Shenzen 84
2 Dass 58
3 Ghouse 31
df4:
Id Name Rank
1 Zen 104
2 Ben 458
3 Susuie 198
These are my four Dataframes and I need to Export as an Excel with 4 tabs i.e, df1,df2,df3,df4.
A simple method would be to hold your items in a collection and use the pd.ExcelWriter Class
Lets use a dictionary.
#1 Create a dictionary with your tab name and dataframe.
dfs = {'df1' : df1, 'df2' : df2...}
#2 create an excel writer object.
writer = pd.ExcelWriter('excel_file_name.xlsx')
#3 Loop over your dictionary write and save your excel file.
for name,dataframe in dfs.items():
dataframe.to_excel(writer,name,index=False)
writer.save()
adding a path.
from pathlib import Path
trg_path = Path('your_target_path')
writer = pd.ExcelWriter(trg_path.joinpath('excel_file.xlsx'))
Using xlsxwriter, you could do something like the following:
import xlsxwriter
import pandas as pd
### Create df's here ###
writer = pd.ExcelWriter('C:/yourFilePath/example.xslx', engine='xlsxwriter')
workbook = writer.book
### First df tab
worksheet1 = workbook.add_worksheet({}.format('df1') # The value in the parentheses is the tab name, so you can make that dynamic or hard code it
row = 0
col = 0
for Name, Rank in (df1):
worksheet.write(row, col, Name)
worksheet.write(row, col + 1, Rank)
row += 1
### Second df tab
worksheet2 = workbook.add_worksheet({}.format('df2')
row = 0
col = 0
for Name, Rank in (df2):
worksheet.write(row, col, Name)
worksheet.write(row, col + 1, Rank)
row += 1
### as so on for as many tabs as you want to create
workbook.close()
xlsxwriter allows you to do a lot of formatting as well. If you want to do that check out the docs

combine rows of single column over multiple .csv files in pandas

I have a bunch of .csv files with the same column headers and data types in the columns.
c1 c2 c3
1 5 words
2 6 words
3 7 words
4 8 words
is there a way to combine all the text in c3 in each .csv file then combine them into one csv?
I combined them this way
path = r'C:\\Users\\...\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
i'm not sure how to combine the text rows first then bring them together.
There are many way to do it. One way:
path = r'C:\\Users\\...\**\*.csv'
all_rec = iglob(path, recursive=True)
# Extract only c3 column from files
dataframes = {f: pd.read_csv(f, usecols=['c3']) for f in all_rec}
# Group all dataframes then combine text rows of each dataframe
big_dataframe = pd.concat(dataframes).groupby(level=0)['c3'] \
.apply(lambda x: ' '.join(x.tolist())).reset_index(drop=True)
Output:
>>> big_dataframe
0 words words words words
1 words2 words2 words2 words2
2 words3 words3 words3 words3
Name: c3, dtype: object

Pandas: new column using data from multiple other file

I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.

Appending values to a column in a loop

I have various files containing data. I want to extract one specific column from each file and create a new dataframe with one column containing all the extracted data.
So for example I have 3 files:
A B C
1 2 3
4 5 6
A B C
7 8 9
8 7 6
A B C
5 4 3
2 1 0
The new dataframe should only contain the values from column C:
C
3
6
9
6
3
0
So the column of the first file should be copied to the new dataframe, the column from the second file should be appendend to the new dataframe.
My code looks like this so far:
import pandas as pd
import glob
for filename in glob.glob('*.dat'):
df= pd.read_csv(filename, delimiter="\t", header=6)
df1= df["Bias"]
print(df)
Now df1 is overwritten in each loop step. Would it be a good idea to create a temporary dataframe in each loop step and then copy the data to the new dataframe?
Any input is appreciated!
Use list comprehension or for loop with append for list of DataFrames and if need only some columns add parameter usecols, last concat all together for big DataFrame:
dfs = [pd.read_csv(f, delimiter="\t", header=6, usecols=['C']) for f in glob.glob('*.dat')]
Or:
dfs = []
for filename in glob.glob('*.dat'):
df = pd.read_csv(filename, delimiter="\t", header=6, usecols=['C'])
#if need all columns
#df = pd.read_csv(filename, delimiter="\t", header=6)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Resources