pandas read_csv with data and headers in alternate columns - python-3.x

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?

Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80

I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

Related

How to avoid writing an empty row when I save a multi-header DataFrame into Excel file?

I would like to save a multi-header DataFrame as Excel file. Following is the sample code:
import pandas as pd
import numpy as np
header = pd.MultiIndex.from_product([['location1','location2'],
['S1','S2','S3']],
names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6),
index=['a','b','c','d','e'],
columns=header)
df.to_excel('result.xlsx')
There are two issues in the excel file as can be seen below:
Issue 1:
There is an empty row under headers. Please let me know how to avoid Pandas to write / insert an empty row in the Excel file.
Issue 2:
I want to save DataFrame without index. However, when I set index=False, I get the following error:
df.to_excel('result.xlsx', index=False)
Error:
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.
You can create 2 Dataframes - only headers and with default header and write both to same sheet with startrow parameter:
header = df.columns.to_frame(index=False)
header.loc[header['loc'].duplicated(), 'loc'] = ''
header = header.T
print (header)
0 1 2 3 4 5
loc location1 location2
S S1 S2 S3 S1 S2 S3
df1 = df.set_axis(range(len(df.columns)), axis=1)
print (df1)
0 1 2 3 4 5
a -1.603958 1.067986 0.474493 -0.352657 -2.198830 -2.028590
b -0.989817 -0.621200 0.010686 -0.248616 1.121244 0.727779
c -0.851071 -0.593429 -1.398475 0.281235 -0.261898 -0.568850
d 1.414492 -1.309289 -0.581249 -0.718679 -0.307876 0.535318
e -2.108857 -1.870788 1.079796 0.478511 0.613011 -0.441136
with pd.ExcelWriter('output.xlsx') as writer:
header.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False)
df1.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False, startrow=2)

How to clean CSV file for a coordinate system using pandas?

I wanted to create a program to convert CSV files to DXF(AutoCAD), but the CSV file sometimes comes with a header and sometimes no and there are cells that cannot be empty such as coordinates, and I also noticed that after excluding some of the inputs the value is nan or NaN and it was necessary to get rid of them so I offer you my answer and please share your opinions to implement a better method.
sample input
output
solution
import string
import pandas
def pandas_clean_csv(csv_file):
"""
Function pandas_clean_csv Documentation
- I Got help from this site, it's may help you as well:
Get the row with the largest number of missing data for more Documentation
https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
"""
try:
if not csv_file.endswith('.csv'):
raise TypeError("Be sure you select .csv file")
# get punctuations marks as list !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
punctuations_list = [mark for mark in string.punctuation]
# import csv file and read it by pandas
data_frame = pandas.read_csv(
filepath_or_buffer=csv_file,
header=None,
skip_blank_lines=True,
error_bad_lines=True,
encoding='utf8',
na_values=punctuations_list
)
# if elevation column is NaN convert it to 0
data_frame[3] = data_frame.iloc[:, [3]].fillna(0)
# if Description column is NaN convert it to -
data_frame[4] = data_frame.iloc[:, [4]].fillna('-')
# select coordinates columns
coord_columns = data_frame.iloc[:, [1, 2]]
# convert coordinates columns to numeric type
coord_columns = coord_columns.apply(pandas.to_numeric, errors='coerce', axis=1)
# Find rows with missing data
index_with_nan = coord_columns.index[coord_columns.isnull().any(axis=1)]
# Remove rows with missing data
data_frame.drop(index_with_nan, 0, inplace=True)
# iterate data frame as tuple data
output_clean_csv = data_frame.itertuples(index=False)
return output_clean_csv
except Exception as E:
print(f"Error: {E}")
exit(1)
out_data = pandas_clean_csv('csv_files/version2_bad_headers.csl')
for i in out_data:
print(i[0], i[1], i[2], i[3], i[4])
Here you can Download my test CSV files

How to get the full text file after merge?

I’m merging two text files file1.tbl and file2.tbl with a common column. I used pandas to make data frames of each and merge function to have the output.
The problem is the output file does not show me the whole data and there is a row of "..." instead and at the end it just prints [9997 rows x 5 columns].
I need a file containing the whole 9997 rows.
import pandas
with open("file1.tbl") as file:
d1 = file.read()
with open("file2.tbl") as file:
d2 = file.read()
df1 = pandas.read_table('file1.tbl', delim_whitespace=True, names=('ID', 'chromosome', 'strand'))
df2 = pandas.read_table('file2.tbl', delim_whitespace=True, names=('ID', 'NUClen', 'GCpct'))
merged_table = pandas.merge(df1, df2)
with open('merged_table.tbl', 'w') as f:
print(merged_table, file=f)

Parsing data with variable number of columns

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)

how to classify a large csv file of signals without headers in python?

i had a large csv file (3000*20000) of data without headers i added one columns to represent the classes. how i can fit the data to the model when the features has no headers and it can not be added manually due to the large number of columns.
is there i way to automatically iterate each columns in a row?
when i had a small file of 4 columns i used the following code:
import pandas as pd
pd = pd.ExcelFile("bcs.xlsx")
col = [0, 1, 2, 3]
data = pd.parse(pd.sheet_names[0], parse_cols = col)
pdc = list(data["pdc"])
pds = list(data["pds"])
pdsh = list(data["pdsh"])
pd_class = list(data["class"])
features = []
for i in range(len(pdc)):
features.append([pdc[i],pds[i],pdsh[i]])
labels = []
labels = pd_class
But with a 3000 by 20000 file i don't know how to identify the features and labels/target
Let's say you have a csv like that:
1,2,3,4,0
1,2,3,4,1
1,2,3,4,1
1,2,3,4,0
where the first 4 columns are features and the last one is the label or class you want. You can read the file with pandas.read_csv and create a dataframe for you features and one for your labels which you can fit next, to your model.
import pandas as pd
#CSV localPath
mypath ='C:\\...'
#The names of the columns you want to have in your dataframe
colNames = ['Feature1','Feature2','Feature3','Feature4','class']
#Read the data as dataframe
df = pd.read_csv(filepath_or_buffer = mypath,
names = colNames , sep = ',' , header = None)
#Get the first four columns as features
features = df.ix[:,:4]
#and last columns as label
labels = df['class']

Resources