I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.
Related
I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks
I have a dataset that was partitioned by column ID and written to disk. This results in each partition getting its own folder in the filesystem. Now I am reading this data back in and would like to call groupBy('ID') followed by calling a pandas_udf function. My question is, since the data was partitioned by ID, is groupBy('ID') any faster than if it hadn't been partitioned? Would it be better to e.g. read one ID at a time using the folder structure? I worry the groupBy operation is looking through every record even though they've already been partitioned.
You have partitioned by ID and saved to disk
You read it again and want to groupby and apply a pandas udf
It is obvious the groupby will look through every record, and so will most functions. But using a pandas_udf which groupby("ID") is going to be expensive because it will go through an unnecessary shuffle.
You can optimize performance by using groupby spark_partition_id() since you have already partitioned by the column you want to groupby on.
EDIT:
If you want file names, you can try:
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?
group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.
I am trying to remove duplicates in spark dataframes by using dropDuplicates() on couple of columns. But job is getting hung due to lots of shuffling involved and data skew. I have used 5 cores and 30GB of memory to do this. Data on which I am performing dropDuplicates() is about 12 million rows.
Please suggest me the most optimal way to remove duplicates in spark, considering data skew and shuffling involved.
Delete duplicate operations is an expensive operation as it compare values from one RDD to all other RDDs and tries to consolidate the results. Considering the size of your data results can time consuming.
I would recommend groupby transformation on the columns of your dataframe followed by commit action. This way only the consolidated results from your RDD will be compared with other RDD that too lazily and then you can request the result through any of the action like commit / show etc
transactions.groupBy("col1”,”col2").count.sort($"count".desc).show
distance():
df.select(['id', 'name']).distinct().show()
dropDuplicates()
df.dropDuplicates(['id', 'name']).show()
dropDuplicates() is the way to go if you want to drop duplicates over a subset of columns, but at the same time you want to keep all the columns of the original structure.
I am writing 2 dataframes from Spark directly to Hive using PySpark. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.
I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:
df_log.write.mode("append").insertInto("project_alpha.sends_log_test")
2nd Method
#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")
In the 2nd approach I tried using both registerTempTable() as well as createOrReplaceTempView() but there was no difference in the run time.
Is there a way to write it faster or more efficiently. Thanks.
Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.
Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows.
When I write data to Hive I always use:
df.write.saveAsTable('schema.table', mode='overwrite')
Please try:
df.write.saveAsTable('schema.table', mode='append')
Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.
Can you please try that route.
try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.