How to read comma separated data into data frame without column header? - python-3.x

I am getting a comma separated data set as bytes which I need to:
Convert in to string from byte
Create csv (can skip this if there is any way to jump to 3rd output)
format and read as data frame without converting first row as column name.
(Later I will be using this df to compare with oracle db output.)
Input data:
val = '-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
type(val)
i managed to do till step 3 but my first row is becoming header. I am fine if we can give any value as column header e.g. a, b, c, ...
#1 Code I tried to convert byte to str
strval = val.decode('ascii').strip()
#2 code to craete csv. Frist i created blank csv and later appended the data
import csv
import pandas as pd
abc = ""
with open('csvfile.csv', 'w') as csvOutput:
testData = csv.writer(csvOutput)
testData.writerow(abc)
with open('csvfile.csv', 'a') as csvAppend:
csvAppend.write(val)
#3 now converting it into dataframe
df = pd.read_csv('csvfile.csv')
# hdf = pd.read_csv('csvfile.csv', column=none) -- this give NameError: name 'none' is not defined
output:
df

according to the read_csv documentation it should be enough to add header=None as a parameter:
df = pd.read_csv('csvfile.csv', header=None)
In this way the header will be interpreted as a row of data. If you want to exclude this line then you need to add the skiprows=1 parameter:
df = pd.read_csv('csvfile.csv', header=None, skiprows=1)

You can do it without saving to csv file like this, you don't need to convert the bytes to string or save that to file
Here val is of type bytes if it is of type string as in your example you can use io.StringIO instead of io.BytesIO
import pandas as pd
import io
val = b'-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
buf_bytes = io.BytesIO(val)
pd.read_csv(buf_bytes, header=None)

Related

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:
reads in a hdfs parquet file
converts it to a pandas dataframe
loops through specific columns and changes some values
writes the dataframe back to a parquet file
Then the parquet file is imported back into hdfs using impala-shell.
The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
fields that should be an Int with a value of 0 in the database are changed to "0.00000" and turned into a float in the dataframe.
It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:
WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]
I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.
I have been using this reference page:
http://arrow.apache.org/docs/python/parquet.html
It uses
pq.read_table(in_file)
to read the parquet file and then
df = table2.to_pandas()
to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?
If I query the database, the data would look like this:
tport
0
1
My print(df) line for the same thing looks like this:
tport
0.00000
nan
nan
1.00000
Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path
# <CLI PROCESSING SECTION HERE>
# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r')
contents = field_file.read()
scrambler_columns = contents.split('\n')
def scramble_str(xstr):
#print(xstr + '_scrambled!')
return xstr + '_scrambled!'
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)
df = table2.to_pandas() #dataframe
print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
#df.fillna(value='', inplace=True) # np.nan # \xa0
print(df) # print before making any changes
cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
#print(cols[index])
if col_name in scrambler_columns:
print('scrambling values in column ' + col_name)
for i, val in col_data.items():
df.at[i, col_name] = scramble_str(str(val))
print(df) # print after making changes
print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))
# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
writer.write_table(new_table)
writer.close()
if os.path.isfile(out_file) == True:
print('wrote ' + out_file)
else:
print('error writing file ' + out_file)
# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)
EDIT
Here are the datatypes for the 1st few columns in hdfs
and here are the same ones that are in the pandas dataframe:
id object
col1 float64
col2 object
col3 object
col4 float64
col5 object
col6 object
col7 object
It appears to convert
String to object
Int to float64
bigint to float64
How can I tell pandas what data types the columns should be?
Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
This is expected. It's just how pandas print function is defined.
It appears to convert String to object
This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.
It appears to convert Int to float64
This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...
def lookup(t):
if pa.types.is_integer(t):
return pd.Int64Dtype()
df = table.to_pandas(types_mapper=lookup)
Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

Pandas : how to consider content of certain columns as list

Let's say I have a simple pandas dataframe named df :
0 1
0 a [b, c, d]
I save this dataframe into a CSV file as follow :
df.to_csv("test.csv", index=False, sep="\t", encoding="utf-8")
Then later in my script I read this csv :
df = pd.read_csv("test.csv", index_col=False, sep="\t", encoding="utf-8")
Now what I want to do is to use explode() on column '1' but it does not work because the content of column '1' is not a list since I saved df into a CSV file.
What I tried so far is to change column '1' type into a list with astype() without any success.
Thank you by advance.
Try this, Since you are reading from csv file,your dataframe value in column A (1 in your case) is essentially a string for which you need to infer the values as list.
import pandas as pd
import ast
df=pd.DataFrame({"A":["['a','b']","['c']"],"B":[1,2]})
df["A"]=df["A"].apply(lambda x: ast.literal_eval(x))
Now, the following works !
df.explode("A")

Python3: How to keep the leading zero when exporting a dataframe to a csv or text file

I am using the following command: (python3)
Mydataframe__df.to_csv(string_io, sep=',', quoting=csv.QUOTE_ALL, header=True, index=False , encoding='utf-8')
df_writer = Mydata_Output.get_writer('/MYFILE_TEST.csv')
df_string = string_io.getvalue()
# save the string as bytes to with the writer
df_writer.write(df_string.encode('utf-8'))
# close the writer connection
df_writer.close()
the issue is for the columns with a format like "012345", the leading 0 is removed in the output file even when opening it with Notepad and even when the column format is set as string in the dataframe.
I'm kind of new here too so don't have the street cred here to comment.
You can preserve leading zeros by converting to a string before outputting the data. Let's say, for example, you want to have eight digits in your data columns, you could use zfill to left pad the string with zeros so it is eight digits long.
outvar = str(numvar).zfill(8)
The issue with leading 0 is that when we load the dataframe into panda before to write the csv then panda by default infer its own data type.
The .get_dataframe(infer_with_pandas=False) force to keep the source dataframe.
The issue becomes that when we have nulls in the data (other than string data field) panda does not like it so we need to change everything into string or clean the data before ....
In found the .get_dataframe(infer_with_pandas=False) in one of the publication here. I will try to reference it later.
# Read recipe inputs
Mydataframe = dataiku.Dataset("TESTING_for_leading0")
Mydataframe_df = Mydataframe.get_dataframe(infer_with_pandas=False)
Mydataframe_df.to_csv(string_io, sep=',', quoting=csv.QUOTE_ALL, header=True, index=False , encoding='utf-8')
df_writer = Mydata_Output.get_writer('/MYFILE_TEST.csv')
df_string = string_io.getvalue()
# save the string as bytes to with the writer
df_writer.write(df_string.encode('utf-8'))
# close the writer connection
df_writer.close()

How read big csv file and read each column separatly?

I have a big CSV file that I read as dataframe.
But I can not figure out how can I read separately by every column.
I have tried to use sep = '\' but it gives me an error.
I read my file with this code:
filename = "D:\\DLR DATA\\status maritime\\nperf-data_2019-01-01_2019-09-10_bC5HIhAfJWqdQckz.csv"
#Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv(filename)
df1 = df.head()
When I print my dataframe head I have this result:
In variable explorer my dataframe consist only 1 column with all data inside:
I try to set sep = '' with space and coma. But it didn't work.
How can I read all data with appropriate different columns?
I would appreciate any help.
You have a tab separated file use \t
df = dd.read_csv(filename, sep='\t')

Why read_sas is converting strings to float?

I am trying to read a .sas7bdat file using pandas and I am having a hard time because pandas is converting strings values that look like a number into float.
For example, if I have a telephone number like '348386789' and I read it with the following code:
import pandas as pd
df = pd.read_sas('test.sas7bdat', format='sas7bdat', encoding='utf-8')
The output would be 348386789.0!
I can convert every single column with something like df['number'].astype(int).astype(str) but this would be very unefficent.
There is the same problem in the read_csv function but there you can use the argument dtype that sets the type for the required column (es. dtype={'number': str)}).
Is there a better way to read values in the desired format and use it in a dataframe?
UPDATE
I even tried sas7bdat.py and pyreadstat with the same results. You might say that the problem is in the data but using an online tool to read sas7bdat the data seems correct.
Code for the other two libraries:
# pyreadstat module
import pyreadstat
df2, meta = pyreadstat.read_sas7bdat('test.sas7bdat')
# sas7bdat module
from sas7bdat import SAS7BDAT
reader = SAS7BDAT('test.sas7bdat')
df_sas = reader.to_data_frame()
If you want to try, (and you have a SAS license), you can create a .sas7bdat file with the following content:
column_1,column_2,column_3
11,20190129,5434
19,20190228,5236
59,20190328,10448
76,20190129,5434
Use sas7bdat.py instead. That typically preserves the dataset formats better.
IF a particular column is defined as character in the SAS dataset, then sas7bdat will read it as as string regardless of how the contents look like. As a lazy example, I created this dataset in SAS:
data test;
id = '1111111'; val = 1; output;
id = '2222222'; val = 2; output;
run;
And then ran the following Python code on it:
reader = SAS7BDAT('test.sas7bdat')
df = reader.to_data_frame()
print(df)
cols = reader.columns
for col in cols:
print(str(col.name) + " " + str(col.type))
Here is what I see:
id val
0 1111111 1.0
1 2222222 2.0
b'id' string
b'val' number
If you are looking to 'intelligently' convert numbers to strings based on context, then you may need to look elsewhere. Any SAS dataset reader is just going to read based on the format specified within the dataset at best.

Resources