I have loaded csv file into pandas data frame and it looks like below example
c1 c2 c3
0 12 22 66
1 44 78 AB
2 DF WE 13
I want this data frame to be loaded in a .txt file with below format
c1="12",c2="22",c3="66"
c1="44",c2="78",c3="AB"
c1="DF",c2="WE",c3="13"
Each row should be on new line in the text file.
Please help me out to find out how it can be done in a best optimal way.
Convert values to strings, append " by DataFrame.add, prepend = by DataFrame.radd and also prepend columns names:
df = df.astype(str).add('"').radd('="').radd(df.columns.to_series())
#alternative
#df = ('="' + df.astype(str) + '"').radd(df.columns.to_series())
print (df)
c1 c2 c3
0 c1="12" c2="22" c3="66"
1 c1="44" c2="78" c3="AB"
2 c1="DF" c2="WE" c3="13"
And then use DataFrame.to_csv:
import csv
df.to_csv('file.txt', header=None, index=False, quoting=csv.QUOTE_NONE)
c1="12",c2="22",c3="66"
c1="44",c2="78",c3="AB"
c1="DF",c2="WE",c3="13"
Related
I have a bunch of .csv files with the same column headers and data types in the columns.
c1 c2 c3
1 5 words
2 6 words
3 7 words
4 8 words
is there a way to combine all the text in c3 in each .csv file then combine them into one csv?
I combined them this way
path = r'C:\\Users\\...\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
i'm not sure how to combine the text rows first then bring them together.
There are many way to do it. One way:
path = r'C:\\Users\\...\**\*.csv'
all_rec = iglob(path, recursive=True)
# Extract only c3 column from files
dataframes = {f: pd.read_csv(f, usecols=['c3']) for f in all_rec}
# Group all dataframes then combine text rows of each dataframe
big_dataframe = pd.concat(dataframes).groupby(level=0)['c3'] \
.apply(lambda x: ' '.join(x.tolist())).reset_index(drop=True)
Output:
>>> big_dataframe
0 words words words words
1 words2 words2 words2 words2
2 words3 words3 words3 words3
Name: c3, dtype: object
My excel spreadsheet is for the form as below.
A
B
Part 1- 20210910
55
Part 2- 20210829
45
Part 3- 20210912
2
I would like to take the strings from Column A "Part A- 20210910" but read it using Pandas as "2021/09/10", a date format. How could I implement this?
IIUC:
df['A'] = df['A'].str.extract(r'(\d{8})').astype('datetime64')
print(df)
# Output:
A B
0 2021-09-10 55
1 2021-08-29 45
2 2021-09-12 2
My beginner way of doing it:
import pandas as pd
df = pd.read_excel('file_name.xlsx')
df['A'] = df['A'].apply(lambda x: x.split('-')).apply(lambda x: x[1]).apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
Output
I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.
I meet a problem when I try to covert Json data to Dataframe using pandas and json package.
My raw data from Json file looks like:
{"Number":41,"Type":["A1","A2","A3","A4","A5"],"Percent":{"Very Good":1.2,"Good":2.1,"OK":1.1,"Bad":1.3,"Very Bad":1.7}}
And my code is:
import pandas as pd
import json
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df =pd.DataFrame(json_file['Type'],columns=['Type'])
The problem is when I only read Type from Json file, it gives me the correct result which looks like:
Type
0 A1
1 A2
2 A3
3 A4
4 A5
However when only read Number from Json file:
df =pd.DataFrame(json_file['Number'],columns=['Number'])
It gives me error: ValueError: DataFrame constructor not properly called!
Also If I use:
df = pd.DataFrame.from_dict(json_file)
I get the error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
I have done some researches on Google, but still cannot figure out why.
My goal is to break this Json data into two Dataframe, the first one is combine the Number and Type together:
Number Type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Another Dataframe I want to get is the data in the Percent, which may look like:
Very Good 1.2
Good 2.1
OK 1.1
Bad 1.3
Very Bad 1.7
This should give your desired output:
with open('Test.json', 'r') as filename:
json_file=json.load(filename)
df = pd.DataFrame({'Number': json_file.get('Number'), 'Type': json_file.get('Type')})
df2 = pd.DataFrame({'Percent': json_file.get('Percent')})
Number type
0 41 A1
1 41 A2
2 41 A3
3 41 A4
4 41 A5
Percent
Bad 1.3
Good 2.1
OK 1.1
Very Bad 1.7
Very Good 1.2
You can generalize this into a function:
def json_to_df(d, ks):
return pd.DataFrame({k: d.get(k) for k in ks})
df = json_to_df(json_file, ['Number', 'Type'])
If you wish to avoid using the json package, you can do it directly in pandas:
_df = pd.read_json('Test.json', typ='series')
df = pd.DataFrame.from_dict(dict(_df[['Number', 'Type']]))
df2 = pd.DataFrame.from_dict(_df['Percent'], orient='index', columns=['Percent'])
I have multiple csv files that I read into individual data frames based on their name in the directory, like so
# ask user for path
path = input('Enter the path for the csv files: ')
os.chdir(path)
# loop over filenames and read into individual dataframes
for fname in os.listdir(path):
if fname.endswith('Demo.csv'):
demoRaw = pd.read_csv(fname, encoding = 'utf-8')
if fname.endswith('Key2.csv'):
keyRaw = pd.read_csv(fname, encoding = 'utf-8')
Then I filter to only keep certain columns
# filter to keep desired columns only
demo = demoRaw.filter(['Key', 'Sex', 'Race', 'Age'], axis=1)
key = keyRaw.filter(['Key', 'Key', 'Age'], axis=1)
Then I create a list of the above dataframes and use reduce to merge them on Key
# create list of data frames for combined sheet
dfs = [demo, key]
# merge the list of data frames on the Key
combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs)
Then I drop the auto generated column, create an Excel writer and write to a csv
# drop the auto generated index colulmn
combined.set_index('RecordKey', inplace=True)
# create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('final.xlsx', engine='xlsxwriter')
# write to csv
combined.to_excel(writer, sheet_name='Combined')
meds.to_excel(writer, sheet_name='Meds')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The problem is some files have keys that aren't in others. For example
Demo file
Key Sex Race Age
1 M W 52
2 F B 25
3 M L 78
Key file
Key Key2 Age
1 7325 52
2 4783 25
3 1367 78
4 9435 21
5 7247 65
Right now, it will only include rows if there is a matching key in each (in other words it just leaves out the rows with keys not in the other files). How can I combine all rows from all files, even if keys don't match? So the end result will look like this
Key Sex Race Age Key2 Age
1 M W 52 7325 52
2 F B 25 4783 25
3 M L 78 1367 78
4 9435 21
5 7247 65
I don't care if the empty cells are blanks, NaN, #N/A, etc. Just as long as I can identify them.
Replace combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs) With: combined=pd.merge(demo,key, how='outer', on='Key') You will have to specificy the 'outer' to join both the full table of Key and Demo