AWS Redshift Spectrum not working with apache parquet files - python-3.x

following is my sample csv file.
id,name,gender
1,isuru,male
2,perera,male
3,kasun,male
4,ann,female
i converted above csv file into apache parquet using pandas library. following is my code.
import pandas as pd
df = pd.read_csv('./data/students.csv')
df.to_parquet('students.parquet')
after that i uploaded the parquet file into the s3 and created a external table like below.
create external table imp.s1 (
id integer,
name varchar(255),
gender varchar(255)
)
stored as PARQUET
location 's3://sample/students/';
after that i just run select query, but i got following error.
select * from imp.s1
Spectrum Scan Error. File 'https://s3.ap-southeast-2.amazonaws.com/sample/students/students.parquet'
has an incompatible Parquet schema for column 's3://sample/students.id'.
Column type: INT, Parquet schema:\noptional int64 id [i:0 d:1 r:0]
(s3://sample/students.parquet)
Could you please help me to figure out what's the problem in here ?

For NULLable integer values, Pandas use the dtype Int64 that correspond to Bigint in Parquet Amazon S3.
Parquet Amazon S3 File Data Type
Transformation
Description
Int32
Integer
-2,147,483,648 to 2,147,483,647 (Precision of 10, scale of 0)
Int64
Bigint
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (Precision of 19, scale of 0)
You need to explicitly set the column type of id when calling pandas.read_csv.
df = pd.read_csv('./data/students.csv', dtype={'id': 'int32'})

Related

pandas db created dataframe compare with csv created dataframe having type issue while comparing

I have a requirement to compare db migrated data to s3 created csv file for same table using python script with pandas library.
While doing this,I am facing dtype issue as data type has changed when it moves to csv file. for exmaple: table created dataframe has dtype as object however csv file has dtype as float.
and while doing df1table.equals(df2csv) ,getting result as false.
Even ,I tried to change the dtype of table data frame got error saying can't change string to float. Also facing issue with Null values of the table data frame compare to csv data frame.
I need a generic solution which work for all table and respective csv file.
Any better way to compare them. For ex: change both data frame into same type and compare.
looking for your reply.Thanks!
To prevent Pandas inferring the data type, you can use dtype=object as parameter of pd.read_csv:
df2csv = pd.read_csv('file.csv', dtype=object, # other params)
Example:
df1 = pd.read_csv('data.csv')
df2 = pd.read_csv('data.csv', dtype=object)
# df1
A B C
0 0.988888 1.789871 12.7
# df2
A B C
0 0.988887546565 1.789871131 12.7
CSV file:
A,B,C
0.988887546565,1.789871131,12.7

Extracting column name and datatype from parquet file with python

I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:
COL_1: string
-- field metadata --
PARQUET:field_id: '34'
COL_2: int32
-- field metadata --
PARQUET:field_id: '35'
I just want:
COL_1 string
COL_2 int32
In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the PARQUET key
You can remove the meta data easily:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_arrays(
[pa.array([1,2]), pa.array(['foo', 'bar'])],
schema=pa.schema({'COL1': pa.int32(), 'COL2': pa.string()})
)
pq.write_table(table, '/tmp/table.pq')
parquet_file = pq.ParquetFile('/tmp/table.pq')
schema = pa.schema(
[f.remove_metadata() for f in parquet_file.schema_arrow])
schema
This will print:
COL1: int32
COL2: string
Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the PARQUET key.

Create External Partition Table GCP Bucket

We want to create external table on top of our GCP bucket where we store parquet data
Currently our Bucket is having this structure
Buckets/ MY BUCKET / DATA /2020/07/11/. -- This location will have parquet file
How we can create external table top of that where we can partition the table based on year/month/date format.
Parquet file contains filed time which has the required value
Sample Value- 2020-07-11T15:13:52.032Z
I am using this command:
CREATE TABLE if NOT EXISTS TESTING(
( ID VARCHAR,
SOURCE varchar,
TIME varchar
)
PARTITIONED BY (TIME)
)
WITH (format = 'parquet', external_location = 'My Bucket Location')
You can use the pseudo column _FILE_NAME that provides you the full path of your file gs://MY BUCKET/DATA/2020/07/11/filename
You can do this
with part as (
select *,
REGEXP_EXTRACT(_FILE_NAME, r"^.*/DATA/([0-9]{4})/.*") as year,
REGEXP_EXTRACT(_FILE_NAME, r"^.*/DATA/[0-9]{4}/([0-9]{1,2})/.*") as month,
REGEXP_EXTRACT(_FILE_NAME, r"^.*/DATA/[0-9]{4}/[0-9]{1,2}/([0-9]{1,2})/.*") as day)
select *
from part
where year="2020" and month="07"
Or create a view and query only the view not the raw external table.

Table in Pyspark shows headers from CSV File

I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").

DateTime in USQL automatically convert to Unix Timestamp in parquet file

I have a problem with DateTime value generated by U-SQL.
I wrote some U-SQL to save data into the parquet file, but all the DateTime column automatically converted to Int64 (Unix Timestamp).
I tried to investigate and I found some informations at https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2018/2018_Spring/USQL_Release_Notes_2018_Spring.md#data-driven-output-partitioning-with-output-fileset-is-in-private-preview. All the DateTime will be convert to int64.
Why MS need to do that ? and how I can keep original DateTime value generated by U-SQL in parquet file.
I put an simple code bellow:
SET ##FeaturePreviews = "EnableParquetUdos:on";
#I =
SELECT DateTime.Parse("2015-05-10T12:15:35.1230000Z") AS c_timestamp
FROM (VALUES(1)) AS T(x);
OUTPUT #I
TO #"/data/Test_TimeStamp.parquet"
USING Outputters.Parquet();
Result in parquet file is: c_timestamp -Int64 datatype - 1431260135123000
But I expected parquet file like: c_timestamp - DateTime datatype - 2015-05-10T12:15:35.1230000Z

Resources