How to clean CSV file for a coordinate system using pandas? - python-3.x

I wanted to create a program to convert CSV files to DXF(AutoCAD), but the CSV file sometimes comes with a header and sometimes no and there are cells that cannot be empty such as coordinates, and I also noticed that after excluding some of the inputs the value is nan or NaN and it was necessary to get rid of them so I offer you my answer and please share your opinions to implement a better method.
sample input
output

solution
import string
import pandas
def pandas_clean_csv(csv_file):
"""
Function pandas_clean_csv Documentation
- I Got help from this site, it's may help you as well:
Get the row with the largest number of missing data for more Documentation
https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
"""
try:
if not csv_file.endswith('.csv'):
raise TypeError("Be sure you select .csv file")
# get punctuations marks as list !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
punctuations_list = [mark for mark in string.punctuation]
# import csv file and read it by pandas
data_frame = pandas.read_csv(
filepath_or_buffer=csv_file,
header=None,
skip_blank_lines=True,
error_bad_lines=True,
encoding='utf8',
na_values=punctuations_list
)
# if elevation column is NaN convert it to 0
data_frame[3] = data_frame.iloc[:, [3]].fillna(0)
# if Description column is NaN convert it to -
data_frame[4] = data_frame.iloc[:, [4]].fillna('-')
# select coordinates columns
coord_columns = data_frame.iloc[:, [1, 2]]
# convert coordinates columns to numeric type
coord_columns = coord_columns.apply(pandas.to_numeric, errors='coerce', axis=1)
# Find rows with missing data
index_with_nan = coord_columns.index[coord_columns.isnull().any(axis=1)]
# Remove rows with missing data
data_frame.drop(index_with_nan, 0, inplace=True)
# iterate data frame as tuple data
output_clean_csv = data_frame.itertuples(index=False)
return output_clean_csv
except Exception as E:
print(f"Error: {E}")
exit(1)
out_data = pandas_clean_csv('csv_files/version2_bad_headers.csl')
for i in out_data:
print(i[0], i[1], i[2], i[3], i[4])
Here you can Download my test CSV files

Related

How to check which row in producing LangDetectException error in LangDetect?

I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets.
I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:
LangDetectException: No features in text.
Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.
Simple code which worked on sample data but not on original data:
from langdetect import detect
import pandas as pd
df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new)
How can I check which row is producing error?
Thanks in Advance!
Use custom function for return True if function detect failed:
df = pd.read_csv('Sample.csv')
def f(x):
try:
detect(x)
return False
except:
return True
s = df.loc[df.text.apply(f), 'text']
Another idea is create new column filled by detect, if failed return NaN, last filtr rows with missing values to df1 and also df_new with new column filled by ouput of function detect:
df = pd.read_csv('Sample.csv')
def f1(x):
try:
return detect(x)
except:
return np.nan
df['new'] = df.text.apply(f1)
df1 = df[df.new.isna()]
df_new = df[df.new.eq('en')]

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

Cleaning a text file for csv export

I'm trying to cleanup a text file from a URL using pandas and my idea is to split it into separate columns, add 3 more columns and export to a csv.
I have tried cleaning up the file (I believe it is being separated by " ") and so far no avail.
# script to check and clean text file for 'aberporth' station
import pandas as pd
import requests
# api-endpoint for current weather
URLH = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with requests.session() as s:
# sending get for histroy text file
r = s.get(URLH)
df1 = pd.read_csv(io.StringIO(r.text), sep=" ", skiprows=5, error_bad_lines=False)
df2 = pd.read_csv(io.StringIO(r.text), nrows=1)
# df1['location'] = df2.columns.values[0]
# _, lat, _, lon = df2.index[0][1].split()
# df1['lat'], df1['lon'] = lat, lon
df1.dropna(how='all')
df1.to_csv('Aberporth.txt', sep='|', index=True)
What makes it worse is that the file itself has uneven columns, and somewhere down line 944, it adds one more column for which I to skip error on bad lines. At this point I'm a bit lost as for how I should proceed and if I should look at something else beyond Pandas.
You don't really need pandas for this. The built-in csv module does just fine.
The data comes in fixed-width format (which is not the same as "delimited format"):
Aberporth
Location: 224100E 252100N, Lat 52.139 Lon -4.570, 133 metres amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy mm tmax tmin af rain sun
degC degC days mm hours
1942 2 4.2 -0.6 --- 13.8 80.3
1942 3 9.7 3.7 --- 58.0 117.9
...
So we can either split it at predefined indexes (which we would have to count & hard-code, and which probably are subject to change), or we can split on "multiple spaces" using regex, in which case it does not matter where the exact column positions are:
import requests
import re
import csv
def get_values(url):
resp = requests.get(url)
for line in resp.text.splitlines():
values = re.split("\s+", line.strip())
# skip all lines that do not have a year as first item
if not re.match("^\d{4}$", values[0]):
continue
# replace all '---' by None
values = [None if v == '---' else v for v in values]
yield values
url = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with open('out.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f)
writer.writerows(get_values(url))
You can do writer.writerow(['yyyy','mm','tmax','tmin','af','rain','sun']) to get a header row if you need one.

how to set numeric format of a cell in XLSX file using python

I want to set numeric format for a column or cell in XLSX file using python script.
The conversion script takes CSV file and converts it to XLSX. I deliberately treat header as a regular line, because final script does in the end of the conversion, in various ways according to specified command line parameters.
The example below shows only my attempt to set numeric format to a column or cell.
What do I do wrong?
With this code I manage to set alignment to the right. But any of the ways to set up numeric format fail. The XLSX file still keep that green triangle in the left upper corner of the cell and refuse to see it as a numeric cell.
Attached screenshot shows "wrong" result.
---- data file ----
a,b,c,d,e
q,1,123,0.4,1
w,2,897346,.786876,-1.1
e,3,9872346,7896876.098098,2.098
r,4,65,.3,1322
t,5,1,0.897897978,-786
---- python script ----
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import os
import pandas
import xlsxwriter
def is_type( value ):
'''Function to identify true type of the value passed
Input parameters: value - some value which type need to be identified
Returned values: Type of the value
'''
try:
int(value)
return "int"
except:
try:
float(value)
return "float"
except:
return "str"
csv_file_name = "test37.csv"
xls_file_name = "test37.xlsx"
# Read CSV file to DataFrame
df = pandas.read_csv(csv_file_name, header=None, low_memory=False, quotechar='"', encoding="ISO-8859-1")
# Output DataFrame to Excel file
df.to_excel(xls_file_name, header=None, index=False, encoding="utf-8")
# Create writer object for output of XLSX file
writer = pandas.ExcelWriter(xls_file_name, engine="xlsxwriter")
# Write our Data Frame object to newly created file
xls_sheet_name = os.path.basename(xls_file_name).split(".")[0]
df.to_excel(writer, header=None, index=False, sheet_name=xls_sheet_name, float_format="%0.2f")
# get objects for workbook and worksheet
wb = writer.book
ws = writer.sheets[xls_sheet_name]
ws.set_zoom(120)
num_format1 = wb.add_format({
'align': 'right'
})
num_format2 = wb.add_format({
'align': 'right',
'num_format': '0.00'
})
num_format3 = wb.add_format()
num_format3.set_num_format('0.00')
ws.set_column('D:D', None, num_format1)
ws.set_column('D:D', None, num_format2)
for column in df.columns:
for row in range(1,len(df[column])):
if is_type(df[column][row]) == "int":
#print("int "+str(df.iloc[row][column]))
ws.write( row, column, df.iloc[row][column], num_format2 )
elif is_type(df[column][row]) == "float":
#print("float "+str(df.iloc[row][column]))
ws.write( row, column, df.iloc[row][column], num_format2 )
else:
pass
wb.close()
writer.save()
exit(0)
The problem has nothing to do with your xlsxwriter script, but lies in the way you import the csv in Pandas. Your csv-file has a header, but you specify in pd.read_csv() that there isn't a header. Therefore, Pandas also parses the header row as data. Because the header is a string, the entire column gets imported as a string (instead of integer or float).
Just remove the 'header=None' in pd.read_csv and df.to_excel() and it should work fine.
so:
...<first part of your code>
# Read CSV file to DataFrame
df = pandas.read_csv(csv_file_name, low_memory=False, quotechar='"', encoding="ISO-8859-1")
# Output DataFrame to Excel file
df.to_excel(xls_file_name, index=False, encoding="utf-8")
<rest of your code>...

How to iterate through a folder of csv files

I'm really new to python, so please bear with me!
I have folder on my desktop that contains a few csv files named as "File 1.csv", "File 2.csv" and so on. In each file, there is a table that looks like:
Animal Level
Cat 1
Dog 2
Bird 3
Snake 4
But each one of the files has a few differences in the "Animal" column. I wrote the following code that compares only two files at a time and returns the animals that match:
def matchlist(file1, file2):
new_df = pd.DataFrame()
file_one = pd.read_csv(file1)
file_two = pd.read_csv(file2)
for i in file_one["Animal"]:
df_temp = file_two[file_two["Animal"] == i]
new_df = new_df.append(df_temp)
df_temp = pd.DataFrame()
return new_df
But that only compares two files at a time. Is there a way that will iterate through all the files in that single folder and return all the ones that match to the new_df above?
For example, new_df compares file 1 and file 2. Then, I am looking for code that compares new_df to file 3, file 4, file 5, and so on.
Thanks!
Im not sure if it is really what you want, i can't comment yet on you questions... so:
this function returns a dataframe with animals that can be found in all csv files (can be very small) it uses the animal name as key, so the level value will not be considered
import pandas as pd
import os, sys
def matchlist_iter(folder_path):
# get list with filenames in folder and throw away all non ncsv
files = [file_path for file_path in os.listdir(folder_path) if file_path.endswith('.csv')]
# init return df with first csv
df = pd.read_csv(os.path.join(folder_path, files[0]), )
for file_path in files[1:]:
print('compare: {}'.format(file_path))
df_other = pd.read_csv(os.path.join(folder_path, file_path))
# only keep the animals that are in both frames
df = df_other[df['Animal'].isin(df_other['Animal'])]
return df
if __name__ == '__main__':
matched = matchlist_iter(sys.argv[1])
print(matched)
i've found a similar question with more answerers regarding the match here: Compare Python Pandas DataFrames for matching rows
EDIT: added csv and sample output
csv
Animal, Level
Cat, 1
Dog, 2
Bird, 3
Snake, 4
csv
Animal, Level
Cat, 1
Parrot, 2
Bird, 3
Horse, 4
output
compare: csv2.csv
Animal Level
0 Cat 1
2 Bird 3
I constructed a set of six files in which only the first columns of File 1.csv and File 6.csv are identical.
You need only the first column of each csv for comparison purposes, I therefore arrange to extract only those from each file.
>>> import pandas as pd
>>> from pathlib import Path
>>> column_1 = pd.read_csv('File 1.csv', sep='\s+')['Animal'].tolist()
>>> column_1
['Cat', 'Dog', 'Bird', 'Snake']
>>> for filename in Path('.').glob('*.csv'):
... if filename.name == 'File 1.csv':
... continue
... next_column = pd.read_csv(filename.name, sep='\s+')['Animal'].tolist()
... if column_1 == next_column:
... print (filename.name)
...
File 6.csv
As expected, File 6.csv is the only file found to be identical (in the first column) to File 1.csv.

Resources