Read partioned hive table in pyspark instead of a parquet - python-3.x

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?

Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Related

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Partition column is moved to end of row when saving a file to Parquet

For a given DataFrame just before being save'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType:
However when saving the file using:
df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
and with the partitionCols as centroid0:
then there is a (to me) surprising result:
the centroid0 partition column has been moved to the end of the Row
the data type has been changed to Integer
I confirmed the output path via println :
path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
And here is the schema upon reading back from the saved parquet:
Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the centroid0 as a partitioning column?
Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering.
In addition - is there any way to cause spark to "change it's mind" on the inferred column types? I have had to change the partitions from 0, 1 etc to c0, c1 etc in order to get the inference to map to StringType. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.
When you write.partitionBy(...) Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions
The effect of this is that the partition fields (centroid0 in your case) are not written into the parquet file only as folder names (centroid0=1, centroid0=2, etc.)
The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.
The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)
You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
.schema(encoder.schema)
.format("parquet")
.option("mergeSchema", "true")
.load(myPath)
// reorder columns, since reading from partitioned data, the partitioning columns are put to end
.select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
.as[RecordType]
The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case.
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.
To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...").
You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

Writing Spark Dataframe directly to HIVE is taking too much time

I am writing 2 dataframes from Spark directly to Hive using PySpark. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.
I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:
df_log.write.mode("append").insertInto("project_alpha.sends_log_test")
2nd Method
#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")
In the 2nd approach I tried using both registerTempTable() as well as createOrReplaceTempView() but there was no difference in the run time.
Is there a way to write it faster or more efficiently. Thanks.
Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.
Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows.
When I write data to Hive I always use:
df.write.saveAsTable('schema.table', mode='overwrite')
Please try:
df.write.saveAsTable('schema.table', mode='append')
Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.
Can you please try that route.
try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.

Resources