Why is pandas.to_csv dropping numbers when trying to preserve NaN? - python-3.x

Given a pandas dataframe
df = pd.DataFrame([(290122, 0.20, np.nan),
(1900, 1.20, "ABC")],
columns = ("number", "x", "class")
)
number x class
0 290122 0.2 NaN
1 1900 1.2 ABC
Then exporting it to a csv, I would like to keep the NaN, e.g. as "NULL" or "NaN",
df.to_csv("df.csv", encoding="utf-8", index=False, na_rep="NULL")
Yet, opening the csv I get the following:
That is, the last two digits of number in the first cell are dropped.
Here is the output opened in text editor:
number,x,class
2901,0.20,NULL
1900,1.20,ABC
As mentionned, when dropping the na_rep argument, I obtain as expected:
number,x,class
290122,0.20,
1900,1.20,ABC

Yes this is in fact a bug in pandas 1.0.0. Fixed in 1.0.1. See release notes and https://github.com/pandas-dev/pandas/issues/25099.
Depending on you data a quick work around could be:
import numpy as np
import pandas as pd
na_rep = 'NULL'
if pd.__version__ == '1.0.0':
na_rep_wrk = 8 * na_rep
data = [(290122, 0.20, 'NULL'), (2**40 - 1, 3.20, 'NULL'), (1900, 1.20, "ABC")]
df = pd.DataFrame(data, columns=("number", "x", "class"))
df.to_csv("df.csv", encoding="utf-8", index=False, na_rep=na_rep_wrk)
df2 = pd.read_csv('df.csv', keep_default_na=False)
assert(np.all(df == df2))
This gives the csv file:
number,x,class
290122,0.2,NULL
109951162777,3.2,NULL
1900,1.2,ABC

In .csv files, when reading or writing from pandas - np.nan is stored in file as '' (empty-string), so dont use na_rep='NULL' and instead when reading from pandas again (after saving), do this:
for col in df.columns:
df[col].apply(lambda x: np.nan if x == '' else x)
Although, all empty strings are read as NaN by default. This is still useful - just to be safe.
I have faced this issue myself before and found that there's just no other way to fix this, but if I do find one (saving nan or NULL flawlessly), I'll update here.

Related

Python Pandas Writing Value to a Specific Row & Column in the data frame

I have a Pandas df of Stock Tickers with specific dates, I want to add the adjusted close for that date using yahoo finance. I iterate through the dataframe, do the yahoo call for that Ticker and Date, and the correct information is returned. However, I am not able to add that information back to the original df. I have tried various loc, iloc, and join methods, and none of them are working for me. The df shows the initialized zero values instead of the new value.
import pandas as pd
import yfinance as yf
from datetime import timedelta
# Build the dataframe
df = pd.DataFrame({'Ticker':['BGFV','META','WIRE','UG'],
'Date':['5/18/2021','5/18/2021','4/12/2022','6/3/2019'],
})
# Change the Date to Datetime
df['Date'] = pd.to_datetime(df.Date)
# initialize the adjusted close
df['Adj_Close'] = 0.00 # You'll get a column of all 0s
# iterate through the rows of the df and retrieve the Adjusted Close from Yahoo
for i in range(len(df)):
ticker = df.iloc[i]['Ticker']
start = df.iloc[i]['Date']
end = start + timedelta(days=1)
# YF call
data = yf.download(ticker, start=start, end=end)
# Get just the adjusted close
adj_close = data['Adj Close']
# Write the acjusted close to the dataframe on the correct row
df.iloc[i]['Adj_Close'] = adj_close
print(f'i value is {i} and adjusted close value is {adj_close} \n')
print(df)
The simplest way to do is to use loc as below-
# change this line
df.loc[i,'Adj_Close'] = adj_close.values[0]
You can use:
def get_adj_close(x):
# You needn't specify end param because period is already set to 1 day
df = df = yf.download(x['Ticker'], start=x['Date'], progress=False)
return df['Adj Close'][0].squeeze()
df['Adj_Close'] = df.apply(get_adj_close, axis=1)
Output:
>>> df
Ticker Date Adj_Close
0 BGFV 2021-05-18 27.808811
1 META 2021-05-18 315.459991
2 WIRE 2022-04-12 104.320045
3 UG 2019-06-03 16.746983

Getting `A value is trying to be set on a copy of a slice from a DataFrame.` when setting a column

I know a value should not be set on a view of a pandas dataframe and I'm not doing that but I'm getting this error. I have a function like this:
def do_something(df):
# id(df) is xxx240
idx = get_skip_idx(df) # another function that returns a boolean series
if any(idx):
df = df[~idx]
# id(df) is xxx744, df is now a local variable which is a copy of the input argument
assert not df._is_view # This doesn't fail, I'm not having a view
df['date_fixed'] = pd.to_datetime(df['old_date'].str[:10], format='%Y-%m-%d')
# I'm getting the warning here which doesn't make any sense to me
I'm using pandas 1.4.1. This sounds like a bug to me, wanted to confirm I'm not missing anything before filing a ticket.
My understanding is that _is_view can return false negatives and that you are actually working on a view of the original dataframe.
One workaround is to replace df[~idx] with df[~idx].copy():
import pandas as pd
df = pd.DataFrame(
{
"value": [1, 2, 3],
"old_date": ["2022-04-20 abcd", "2022-04-21 efgh", "2022-04-22 ijkl"],
}
)
def do_something(df, idx):
if any(idx):
df = df[~idx].copy()
df["date_fixed"] = pd.to_datetime(df["old_date"].str[:10], format="%Y-%m-%d")
return df
print(do_something(df, pd.Series({0: True, 1: False, 2: False})))
# No warning
value old_date date_fixed
1 2 2022-04-21 efgh 2022-04-21
2 3 2022-04-22 ijkl 2022-04-22

Python - Pandas CSV file with error when converting mixed data types string to number

So I have a large csv file with lots of data. The main column 'Results', that I am interested in has integers, float, NaN data types and also number as text. I need to aggregate 'Results' but before I do I want to convert the column to float data type. The values that are text have trailing spaces like the following:
["1.07 ", "8.22 ", "8.6 ", "11.41 ", "7.93 "]
The error I get is...
AttributeError: Can only use .str accessor with string values!
import pandas as pd
import os
import numpy as np
csv_file = 'c:/path/to/file/big.csv'
# ... more lines of code ...
df = pd.read_csv(csv_file, usecols=my_cols, parse_dates=['Date'])
df = df[df['Company ID'].str.contains(my_company)]
print('df of csv created')
# Above code works great.
# the below 2 tries did not work for me
# df['Result'] = pd.to_numeric(df['Result'].str.replace(' ', ''), errors='ignore')
# df['Result'] = df['Result'].str.strip() # causes an error
# now let's try np.where...
# the below causes AttributeError: Can only use .str accessor with string values!
df['Result'] = np.where(df['Result'].dtype == np.str, df['Result'].str.strip(),
df['Result'])
df['Result'] = pd.to_numeric(df['Result'], downcast="float", errors='raise')
How should I resolve this?
Why don't you try this code to explicitly convert all the value as stirng using astype(str).
import pandas as pd
df = pd.DataFrame({
'Result': [' a ', ' b', 'c ']
})
df['Result'] = df['Result'].astype(str).str.strip()
print(df['Result'])
#0 a
#1 b
#2 c
#Name: Result, dtype: object
Sometime, I use this code if NaN or numbers are included in a Series to avoid getting the error msg.

Add two columns to a pandas DataFrame based on condition

I am trying to add two new columns with different values based on two conditions.
Source sample data for left and right DataFrames
id
rec_type
end_date
13759
U
20210113
23806
N
NaN
21347
U
20210113
36904
N
NaN
id
23806
21347
Expected output:
id
rec_type
end_date
_merge
error_code
error_description
13759
U
20210113
left_only
601
update record not available in right table
23806
N
NaN
both
0
0
21347
U
20210113
both
0
0
36904
N
NaN
left_only
602
New record not available in right table
I am using numpy (np) select to achieve my requirement as in below code but I am getting error.
import pandas as pd
import numpy as np
merged_df = pd.merge(left_df, right_df,
how='outer',
on=['id'],
indicator=True)
merged_df = merged_df.query('_merge != "right_only"')
conditions = [((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "U") &
(merged_df['end_date'].notnull())),
((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "N") &
(merged_df['end_date'].isnull()))]
error_codes = dict()
error_codes['error_code'] = [601, 602]
error_codes['error_description'] = ['update record not available in right table',
'New record not available in right table']
merged_df['error_code'] = np.select(conditions, error_codes['error_code'])
merged_df['error_description'] = np.select(conditions, error_codes['error_description'])
I am getting below error, please share suggestions to resolve the error.
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_code'] = np.select(conditions,
error_codes['error_code'])
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_description'] = np.select(conditions,
error_codes['error_description'])
Thanks,
Raghunath.
Note: Code is working fine with sample data but with more data, getting above error
I am able to resolve the issue by changing
merged_df = merged_df.query('_merge != "right_only"')
to below code
merged_df = merged_df[merged_df._merge != "right_only"]

How to clean CSV file for a coordinate system using pandas?

I wanted to create a program to convert CSV files to DXF(AutoCAD), but the CSV file sometimes comes with a header and sometimes no and there are cells that cannot be empty such as coordinates, and I also noticed that after excluding some of the inputs the value is nan or NaN and it was necessary to get rid of them so I offer you my answer and please share your opinions to implement a better method.
sample input
output
solution
import string
import pandas
def pandas_clean_csv(csv_file):
"""
Function pandas_clean_csv Documentation
- I Got help from this site, it's may help you as well:
Get the row with the largest number of missing data for more Documentation
https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
"""
try:
if not csv_file.endswith('.csv'):
raise TypeError("Be sure you select .csv file")
# get punctuations marks as list !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
punctuations_list = [mark for mark in string.punctuation]
# import csv file and read it by pandas
data_frame = pandas.read_csv(
filepath_or_buffer=csv_file,
header=None,
skip_blank_lines=True,
error_bad_lines=True,
encoding='utf8',
na_values=punctuations_list
)
# if elevation column is NaN convert it to 0
data_frame[3] = data_frame.iloc[:, [3]].fillna(0)
# if Description column is NaN convert it to -
data_frame[4] = data_frame.iloc[:, [4]].fillna('-')
# select coordinates columns
coord_columns = data_frame.iloc[:, [1, 2]]
# convert coordinates columns to numeric type
coord_columns = coord_columns.apply(pandas.to_numeric, errors='coerce', axis=1)
# Find rows with missing data
index_with_nan = coord_columns.index[coord_columns.isnull().any(axis=1)]
# Remove rows with missing data
data_frame.drop(index_with_nan, 0, inplace=True)
# iterate data frame as tuple data
output_clean_csv = data_frame.itertuples(index=False)
return output_clean_csv
except Exception as E:
print(f"Error: {E}")
exit(1)
out_data = pandas_clean_csv('csv_files/version2_bad_headers.csl')
for i in out_data:
print(i[0], i[1], i[2], i[3], i[4])
Here you can Download my test CSV files

Resources