null values in some of dataframe columns, while reading it from hbase - apache-spark

I am reading data from hbase using spark sql. one column has xml data. when xml size is small , I am able to read correct data. but as soon as size increases too much, some of the columns in dataframe becomes null. xml is still coming correctly.
while reading data from sql to hbase I have used this constraint:
hbase.client.keyvalue.maxsize=0 in my sqoop.

Related

Hbase filter specific rows

I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase.
My table has millions of rows and reading the entire table is very expensive (memory).
My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?
Currently, I read from Hbase as following:
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
I saw the answer in this post, but I didn't find how can I filter out specific keys.
Any help?
Thanks!

Write large data set around 100 GB having just one partition to hive using spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

Cassandra data modelling - Select subset of rows in large table for batch Spark processing

I am working on a project in which we ingest large volumes of structured raw data into a Cassandra table and then transform it to a target schema using Spark.
Since our raw data table is getting pretty large we would like to process it in batches. That means Spark has to look at our raw data table to identify not yet processed rows (by partition key) and then load the subset of these rows into Spark.
Being new to Cassandra I am now wondering how to implement this. Using Spark, I can quite efficiently load raw data keys and compare them with the keys in the transformed table to identify the subsets. But what is the best strategy to load the subset of these rows from my raw data table.
Here is an example. If my schema looks like this:
CREATE TABLE raw_data (
dataset_id text PRIMARY KEY,
some_json_data text
);
...and if I have dataset_ids 1,2,3,4,5 in my table and know that I now need to process the rows with ids 4, 5, how can I efficiently select those rows knowing that the list of ids can be pretty long in practice?

Transfer time series data from PySpark to Cassandra

I have a Spark Cluster and a Cassandra cluster. In pyspark I read a csv file then transform it to an RDD. I then go through every row in my RDD and use a mapper and reducer function. I end up getting the following output (I've made this list short for demonstration purposes):
[(u'20170115', u'JM', u'COP'), (u'20170115', u'JM', u'GOV'), (u'20170115', u'BM', u'REB'), (u'20170115', u'OC', u'POL'), (u'20170114', u'BA', u'EDU')]
I want to go through each row in the array above and store each tuple into one table in Cassandra. I want the unique key to be the date. Now I know that I can turn this array into a dataframe and then store it into Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md#saving-a-dataframe-in-python-to-cassandra). If I turn the list into a dataframe and then store it into Cassandra will Cassandra still be able to handle it? I guess I'm not fully understanding how Cassandra stores values. In my array the dates are repeated, but the other values are different.
What is the best way for me to store the data above in Cassandra? Is there a way for me to store data directly from Spark to Cassandra using python?
Earlier versions of DSE 4.x supported RDDs, but the current connector for DSE and open source Cassandra is "limited to DataFrame only operations."
PySpark with Data Frames
You stated "I want the unique key to be the date". I assume you mean partion key, since date is not unique in your example. Its ok to use date as the partion key (assuming partitons will not be too large) but your primary key needs to be unique.

How to store Spark data frame as a dynamic partitioned Hive table in Parquet format?

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?
If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Resources