How to prepend a string to a column in csv - python-3.x

I have a csv with 5 columns in in.csv and I want to prepend all the data in column 2 with "text" in this same file such that the row
data1 data2 data3 data4 data5
becomes
data1 textdata2 data3 data4 data5
I thought using regex might be a good idea, but I'm not sure how to proceed
Edit:
After proceeding according to bigbounty's answer, I used the following script:
import pandas as pd
df = pd.read_csv("in.csv")
df["id_str"] = str("text" + str(df["id_str"]))
df.to_csv("new_in.csv", index=False)
My in.out file is like:
s_no,id_str,screen_name
1,1.15017060743203E+018,lorem
2,1.15015544419693E+018,ipsum
3,1.15015089995785E+018,dolor
4,1.15015054311063E+018,sit
After running the script, the new_in.csv file is:
s_no,id_str,screen_name
1,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",lorem
2,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",ipsum
3,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",dolor
4,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",sit
Whereas it should be:
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit

This can easily be done using pandas.
import pandas as pd
df = pd.read_csv("in.csv")
df["data2"] = "text" + df["data2"].astype(str)
df.to_csv("new_in.csv", index=False)

Using the csv module
import csv
with open('test.csv', 'r+', newline='') as f:
data = list(csv.reader(f)) # produces a list of lists
for i, r in enumerate(data):
if i > 0: # presumes the first list is a header and skips it
r[1] = 'text' + r[1] # add text to the front of the text at index 1
f.seek(0) # find the beginning of the file
writer = csv.writer(f)
writer.writerows(data) # write the new data back to the file
# the resulting text file
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit
Using pandas
This solution is agnostic of any column names because it uses column index.
pandas.DataFrame.iloc
import pandas as pd
# read the file set the column at index 1 as str
df = pd.read_csv('test.csv', dtype={1: str})
# add text to the column at index 1
df.iloc[:, 1] = 'text' + df.iloc[:, 1]
# write to csv
df.to_csv('test.csv', index=False)
# resulting csv
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit

Related

How to avoid writing an empty row when I save a multi-header DataFrame into Excel file?

I would like to save a multi-header DataFrame as Excel file. Following is the sample code:
import pandas as pd
import numpy as np
header = pd.MultiIndex.from_product([['location1','location2'],
['S1','S2','S3']],
names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6),
index=['a','b','c','d','e'],
columns=header)
df.to_excel('result.xlsx')
There are two issues in the excel file as can be seen below:
Issue 1:
There is an empty row under headers. Please let me know how to avoid Pandas to write / insert an empty row in the Excel file.
Issue 2:
I want to save DataFrame without index. However, when I set index=False, I get the following error:
df.to_excel('result.xlsx', index=False)
Error:
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.
You can create 2 Dataframes - only headers and with default header and write both to same sheet with startrow parameter:
header = df.columns.to_frame(index=False)
header.loc[header['loc'].duplicated(), 'loc'] = ''
header = header.T
print (header)
0 1 2 3 4 5
loc location1 location2
S S1 S2 S3 S1 S2 S3
df1 = df.set_axis(range(len(df.columns)), axis=1)
print (df1)
0 1 2 3 4 5
a -1.603958 1.067986 0.474493 -0.352657 -2.198830 -2.028590
b -0.989817 -0.621200 0.010686 -0.248616 1.121244 0.727779
c -0.851071 -0.593429 -1.398475 0.281235 -0.261898 -0.568850
d 1.414492 -1.309289 -0.581249 -0.718679 -0.307876 0.535318
e -2.108857 -1.870788 1.079796 0.478511 0.613011 -0.441136
with pd.ExcelWriter('output.xlsx') as writer:
header.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False)
df1.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False, startrow=2)

Iterate through excel files' sheets and append if sheet names share common part in Python

Let's say we have many excel files with the multiple sheets as follows:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I wish to append all the sheets to one if the last part split by _ of sheet names are same across all excel files. ie., sheet 1 will append with sheet 2 since their both have common bj, if another excel file also have sheets with name bj, it will also be append to this one, same logic for sheet 3 and sheet 4.
How could I achieve that in Pandas or other Python packages?
The expected result for current excel file would be:
bj:
a b c d
0 1 2 23 2
1 2 3 45 5
2 1 2 23 6
3 2 3 45 7
sh:
a b c
0 1 2 23
1 2 3 45
2 1 2 23
3 2 3 40
Code for reference:
import os, glob
import pandas as pd
files = glob.glob("*.xlsx")
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx', index=False)
Update:
You may download test excel files and final expected result from this link.
Try:
dfs = pd.read_excel('Downloads/WS_1.xlsx', sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx')
Update
import os, glob
import pandas as pd
files = glob.glob("Downloads/test_data/*.xlsx")
writer = pd.ExcelWriter('Downloads/test_data/Output_file.xlsx', engine='xlsxwriter')
excel_dict = {}
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
excel_dict.update(dfs)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(writer, index=False, sheet_name=f'{n}')
writer.save()
writer.close()
I have achieved the whole process and get the final expected result with the code below.
Thanks to provide alternative and more concise solutions or give me some advices if it's possible:
import os, glob
import pandas as pd
from pandas import ExcelWriter
from datetime import datetime
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key, index=False)
writer.save()
root_dir = './original/'
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
file_path = os.path.join(root_dir, file)
print(file)
f = pd.ExcelFile(file_path)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
print(sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
# year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = './add_columns_from_sheet_name/' + "new_" + file)
root_dir = './add_columns_from_sheet_name/'
list1 = []
df = pd.DataFrame()
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
# print(file)
city = file.split('_')[0]
# print(file)
file_path = os.path.join(root_dir, file)
# print(file_path)
dfs = pd.read_excel(file_path, sheet_name=None)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
print(n)
timestr = datetime.utcnow().strftime('%Y%m%d-%H%M%S%f')[:-3]
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'./output/{n}_{timestr}.xlsx', index=False)
file_set = set()
file_dir = './output/'
file_list = os.listdir(file_dir)
for file in file_list:
data_type = file.split('_')[0]
file_set.add(data_type)
print(file_set)
file_dir = './output'
file_list = os.listdir(file_dir)
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
file_set = set()
for file in file_list:
if '.xlsx' in file:
# print(file)
df_temp = pd.read_excel(os.path.join(file_dir, file))
if 'bj' in file:
df1 = df1.append(df_temp)
elif 'sh' in file:
df2 = df2.append(df_temp)
elif 'gz' in file:
df3 = df3.append(df_temp)
elif 'sz' in file:
df4 = df4.append(df_temp)
# function
def dfs_tabs(df_list, sheet_list, file_name):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)
writer.save()
# list of dataframes and sheet names
dfs = [df1, df2, df3, df4]
sheets = ['bj', 'sh', 'gz', 'sz']
# run function
dfs_tabs(dfs, sheets, './final/final_result.xlsx')

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

How to read with Pandas txt file with column names in each row

I'm begginer with Python, and I need to read a txt file where the column name is on each row, the columns are dissordered and not all columns are informed. Are there any way to read this kind of file with Pandas?
This is a example (3 rows):
pepe01#mail.com:{ssha}fiy9XI6d:created="1575487257" fwd="" spf_block="" quota="1024mb" full_name="Full Name" mailaccess="envia" mailstatus="cancelled"
pepe02#mail.com:{ssha}Q0H90Rf9:created="1305323967" mailaccess="1" mailstatus="active" admin_access="" quota="" expire="0" full_name="Full Name" pais="CO"
pepe03#mail.com:{ssha}sCPC3HOE:created="1550680636" fwd="" pass_question="" pass_answer="" disabled="Y" mailstatus="cancelled" full_name="Name"
You can use re module to parse the file.
For example:
import re
import pandas as pd
all_data = []
with open('<YOUR FILE>', 'r') as f_in:
for line in f_in:
m = re.search(r'^(.*?):(.*?):', line)
if not m:
continue
data = dict(re.findall(r'([^\s]+)="([^"]+)"', line.split(':', maxsplit=2)[-1]))
data['mail'] = m.group(1)
data['password'] = m.group(2)
all_data.append(data)
df = pd.DataFrame(all_data).fillna('')
print(df)
Prints the dataframe:
created quota full_name mailaccess mailstatus mail password expire pais disabled
0 1575487257 1024mb Full Name envia cancelled pepe01#mail.com {ssha}fiy9XI6d
1 1305323967 Full Name 1 active pepe02#mail.com {ssha}Q0H90Rf9 0 CO
2 1550680636 Name cancelled pepe03#mail.com {ssha}sCPC3HOE Y

Parsing data with variable number of columns

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)

Resources