LoadIncrementalHFiles: Split occurred while grouping HFiles - apache-spark

I implemented a Spark(v2.4) application that processes raw data and stores it into containerized Hbase(v2.1).
I would like to bulk load the data into Hbase and for that purpose, I use apache/hbase-connectors. I followed this example.
I pre-split Hbase regions to 10 and transformed each key by hashing and applying modulo upon the hash value, then concatenated it as a prefix to the key.
for example: key = a123, newKey = 0_a123 (assume: hash(a123) mod 10 = 0).
When I run my Spark app, I can see that the Hfiles has been created but when I'm trying to doBulkLoad with LoadIncrementalHFiles I get the following error:
LoadIncrementalHFiles: Split occurred while grouping HFiles, retry
attempt 12 with 10 files remaining to group or split
I saw the following solution which I think is similar to what I have already done.
Why does LoadIncrementalHFiles fail?
Should the Hfiles be on the container as well before doing the LoadIncrementalHFiles in a containerized environment?
Should I pre-split Hbase regions differently?
Is there any formula to calculate the number of regions?
In Hbase logs I can see the following error:
regionserver.SecureBulkLoadManager: Failed to complete bulk load
java.io.FileNotFoundException: File ... does not exist

The problem was with the location of the Hfiles.
I read Hbase logs and saw it looks for the Hfiles which were on my host machine, FileNotFoundException was thrown.
I mount the Hfiles directory to the Hbase container and the problem was solved.

Since you can hash your rows for better distribution, you might want to pre-split your table by using the Hex region splitter utility. It will automatically figure out how to split the table nicely across the hexadecimal space, based on how many region servers you have. Maybe this can help you bypass the unnecessary splitting on the fly. You can use it from command-line like this:
hbase org.apache.hadoop.hbase.util.RegionSplitter TableName HexStringSplit -c 10 -f CF
TableName is your table name
10 is the number of region servers you have in the cluster
CF is the name of the Column Family to create
The table shouldn't exist when you are launching this.

Related

How do I workaround the 5GB s3 copy limit with pyspark/hive?

I am trying to run a spark sql job against an EMR cluster. My create table operation contains many columns but I'm getting an s3 error:
The specified copy source is larger than the maximum allowable size for a copy source: 5368709120
Is there a hive/spark/pyspark setting that can be set so that _temporary files do not reach that 5GB threshold to write to s3?
This is working: (only 1 column)
create table as select b.column1 from table a left outer join verysmalltable b on ...
This is not working: (many columns)
create table as select b.column1, a.* from table a left outer join verysmalltable b on ...
In both cases, select statements alone work. (see below)
Working:
select b.column1 from table a left outer join verysmalltable b on ...
select b.column1, a.* from table a left outer join verysmalltable b on ...
I'm wondering if memory related - but am unsure. I would think I'd run into a memory error before running into a copy error if it was a memory error (also assuming that the select statement with multiple columns would not work if it was a memory issue)
Only when create table is called do I run into the s3 error. I don't have the option of not using s3 for saving tables and was wondering if there was a way around this issue. The 5GB limit seems to be a hard limit. If anyone has any information about what I can do on the hive/spark end, it would be greatly appreciated.
I'm wondering if there is a specific setting that can be included in the spark-defaults.conf file to limit the size of temporary files.
Extra information: the _temporary file is 4.5 GB after the error occurs.
In the past few months, something changed with how s3 is using the parameter
fs.s3a.multipart.threshold
This setting needs to be under 5G for queries of a certain size to work. Previously I had this setting set at a large number in order to save larger files, but apparently the behavior for this has changed.
The default value for this setting is 2GB. In the spark documentation, there are multiple different definitions based on the hadoop version being used.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Databricks Spark CREATE TABLE takes forever for 1 million small XML files

I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.
The Experiment
The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID> element is kept identical.
<OBSERVATION>
<HEADER>...</HEADER>
<RESULT>
<ID>...</ID>
<VALUES>...</VALUES>
</RESULT>
</OBSERVATION>
For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER and RESULT, which is what I am getting.
CREATE TABLE Observations
USING XML
OPTIONS (
path "/mnt/blobstorage/records/*.xml",
rowTag "RESULT",
rootTag "OBSERVATION",
excludeAttribute True
)
The Problem
The CREATE TABLE statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87 in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).
I have noticed that the cell with the CREATE TABLE command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml" for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).
I also know that spark-xml reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:
predefine a schema (from only the first XML file)
ingest as plaintext without parsing (using the TEXT provider).
The same problem persists in both cases.
The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.
My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.
The Question
What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE, how do I do it?
My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size.
As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):
1. Copy the files from Blob to local file-system and generate a script for file merge:
# copy files from mounted storage to driver local storage
dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)
unzipdir= 'temp/records/'
gzipdir= 'temp/gzip/'
# generate shell-script and write it into the local filesystem
script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)
2. Run the shell script
%sh
sudo sh ./scripts/makeone.sh
3. Copy the files back to the mounted storage
dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True)
Another important point is that the spark-xml library does a two step approach:
It parses the data to infer the schema. If the parameter samplingRatio is not changed, it does this for the whole dataset. Often it is enough only to do this for a smaller sample, or you can predefine the schema (use the parameter schema for this), then you don' t need this step.
Reading the data.
Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.

Spark - Performing read of large file results in multiplication of dataset

I'm currently running a script that performs a very simple read on a rather large pipe delimited file (~870,000 records with 28 columns). Code below for reference:
readFile = spark.read.option("delimiter", inputFileDemiliter).csv(inputPath, mode = readMode, \
header=True, inferSchema=False,schema = schema)
The issue is, if I perform a simple count on the dataframe, readFile, I'm getting a record count of about 14 million (it's 16.59 times the initial record count, to be exact).
I imagine it has something to do with replication. We could perform a dedup on the primary key column, but we shouldn't be getting this issue in the first place and so want to avoid that.
Does anyone know how to prevent this? Thanks in advance.
Turns out that the issue was due to an encryption service which was active on our HDFS directory. Encryption would mess with the number of delimiters in our file, hence the whacky record count.

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources