parquet to spark dataframe with location column - apache-spark

I am trying to read in a parquet file into a Dataframe using spark. My requirement is to create another column in the dataframe using the parquet path of the parquet file.
Eg: I have parquet files in the following path in hdfs:
/event/2018-01-01/abc/xyz=10/parquet1.parquet
/event/2018-01-01/abc/xyz=10/parquet2.parquet
I want to read all the files in /event/2018-01-01/abc and create a column "dt" in the dataframe which specifies the date in the path. How do I extract the date from the path and create it as a column in the spark dataframe?

Related

pyspark csv format - mergeschema

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis.
Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

Pyspark: how to filter by date and read parquet files which is partitioned by date

I have a huge dataset of partitioned parquet files stored in AWS s3 in data-store/year=<>/month=<>/day=<>/hour=<>/ folder format. e.g data-store/year=2020/month=06/day=01/hour=05.
I want to read the files only for a specific date range e.g 2020/06/01 to 2020/08/30 or something like all the records with a date greater than equal to 2020/06/01.
How to do this effectively so that only the required data is loaded to the spark memory.

PySpark parquet datatypes

I am using PySpark to read a relative large csv file (~10GB):
ddf = spark.read.csv('directory/my_file.csv')
All the columns have the datatype string
After changing the datatype of for example column_a I can see the datatype changed to an integer. If I write the ddf to a parquet file and read the parquet file I notice that all columns have the datatype string again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).
Notes:
I write the ddf as a parquet file as follows:
ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')
I use:
PySpark version 2.0.0.2
Python 3.x
I read my large files with pandas and not have this problem. Try use pandas.
http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html
In[1]: Import pandas as pd
In[2]: df = pd.read_csv('directory/my_file.csv')

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

Resources