I use HBase in a single-node mode. My rows in the table are huge. I have to read sequentially all columns in all rows using Java API. But Get and Scan operations return entire row (which results contains of ALL columns from the row). A lot of RAM is required. So, what should I do in case of some rows are above available RAM? Is it possible to read columns sequentially one by one?
I have solved this problem. I have used setBatch(int batch) method in Scan class. So each next() Scanner's method call returns no more than fixed number of columns.
Related
I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks
If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?
df.select(column).distinct().collect()...
vs
df.select(collect_set(column)).first()...
collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.
1. collect_set
df.select(collect_set(column)).first()...
This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.
2. distinct
df.select(column).distinct().collect()...
This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.
TLDR:
.distinct is better than .collect_set for your specific need
Here is an example of it. The cell 44 output shows the count of distinct keys but when i find the partition size in cell 45 then it combines together 3 and 5. also on saving the size of one partition is still zero. Any help would be appreciated.
By default, Spark applies a HashPartitioner to the values of the column overint. Apparently, the values 3 and 5 fall into the same partition after being hashed.
You may want to choose the RangePartitioner. Or, in case you need full flexibility, you could also write your custome Partitioner class. However, this is only available on the RDD API and not the Structured API.
I had a question that is related to pyspark's repartitionBy() function which I originally posted in a comment on this question. I was asked to post it as a separate question, so here it is:
I understand that df.partitionBy(COL) will write all the rows with each value of COL to their own folder, and that each folder will (assuming the rows were previously distributed across all the partitions by some other key) have roughly the same number of files as were previously in the entire table. I find this behavior annoying. If I have a large table with 500 partitions, and I use partitionBy(COL) on some attribute columns, I now have for example 100 folders which each contain 500 (now very small) files.
What I would like is the partitionBy(COL) behavior, but with roughly the same file size and number of files as I had originally.
As demonstration, the previous question shares a toy example where you have a table with 10 partitions and do partitionBy(dayOfWeek) and now you have 70 files because there are 10 in each folder. I would want ~10 files, one for each day, and maybe 2 or 3 for days that have more data.
Can this be easily accomplished? Something like df.write().repartition(COL).partitionBy(COL) seems like it might work, but I worry that (in the case of a very large table which is about to be partitioned into many folders) having to first combine it to some small number of partitions before doing the partitionBy(COL) seems like a bad idea.
Any suggestions are greatly appreciated!
You've got several options. In my code below I'll assume you want to write in parquet, but of course you can change that.
(1) df.repartition(numPartitions, *cols).write.partitionBy(*cols).parquet(writePath)
This will first use hash-based partitioning to ensure that a limited number of values from COL make their way into each partition. Depending on the value you choose for numPartitions, some partitions may be empty while others may be crowded with values -- for anyone not sure why, read this. Then, when you call partitionBy on the DataFrameWriter, each unique value in each partition will be placed in its own individual file.
Warning: this approach can lead to lopsided partition sizes and lopsided task execution times. This happens when values in your column are associated with many rows (e.g., a city column -- the file for New York City might have lots of rows), whereas other values are less numerous (e.g., values for small towns).
(2) df.sort(sortCols).write.parquet(writePath)
This options works great when you want (1) the files you write to be of nearly equal sizes (2) exact control over the number of files written. This approach first globally sorts your data and then finds splits that break up the data into k evenly-sized partitions, where k is specified in the spark config spark.sql.shuffle.partitions. This means that all values with the same values of your sort key are adjacent to each other, but sometimes they'll span a split, and be in different files. This, if your use-case requires all rows with the same key to be in the same partition, then don't use this approach.
There are two extra bonuses: (1) by sorting your data its size on disk can often be reduced (e.g., sorting all events by user_id and then by time will lead to lots of repetition in column values, which aids compression) and (2) if you write to a file format the supports it (like Parquet) then subsequent readers can read data in optimally by using predicate push-down, because the parquet writer will write the MAX and MIN values of each column in the metadata, allowing the reader to skip rows if the query specifies values outside of the partition's (min, max) range.
Note that sorting in Spark is more expensive than just repartitioning and requires an extra stage. Behind the scenes Spark will first determine the splits in one stage, and then shuffle the data into those splits in another stage.
(3) df.rdd.partitionBy(customPartitioner).toDF().write.parquet(writePath)
If you're using spark on Scala, then you can write a customer partitioner, which can get over the annoying gotchas of the hash-based partitioner. Not an option in pySpark, unfortunately. If you really want to write a custom partitioner in pySpark, I've found this is possible, albeit a bit awkward, by using rdd.repartitionAndSortWithinPartitions:
df.rdd \
.keyBy(sort_key_function) \ # Convert to key-value pairs
.repartitionAndSortWithinPartitions(numPartitions=N_WRITE_PARTITIONS,
partitionFunc=part_func) \
.values() # get rid of keys \
.toDF().write.parquet(writePath)
Maybe someone else knows an easier way to use a custom partitioner on a dataframe in pyspark?
df.repartition(COL).write().partitionBy(COL)
will write out one file per partition. This will not work well if one of your partition contains a lot of data. e.g. if one partition contains 100GB of data, Spark will try to write out a 100GB file and your job will probably blow up.
df.repartition(2, COL).write().partitionBy(COL)
will write out a maximum of two files per partition, as described in this answer. This approach works well for datasets that are not very skewed (because the optimal number of files per partition is roughly the same for all partitions).
This answer explains how to write out more files for the partitions that have a lot of data and fewer files for the small partitions.
We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.