I have a dataframe with n parameter of columns more than 100 and i want to insert them to hbase.
What's the best way to do that?
HBase connector can be used, more details here:
https://github.com/hortonworks-spark/shc
Related
I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase.
My table has millions of rows and reading the entire table is very expensive (memory).
My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?
Currently, I read from Hbase as following:
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
I saw the answer in this post, but I didn't find how can I filter out specific keys.
Any help?
Thanks!
I have created a HIVE table through pyspark in ORC format and everything is working as per the requirement.
However, when I observed the details o fthe HIVE table, I see below
describe formatted <tbl_name>;
I get below output
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL FALSE
numFiles 99
numRows -1
rawDataSize -1
How can I change the value of "COLUMN_STATS_ACCURATE" while writing the code in pyspark? IS there any way to do that? If no, then is there a way to change it after the table has been created?
You can call ANALYZE TABLE:
spark.sql("ANALYZE TABLE foo COMPUTE STATISTICS")
but please remember, that Spark output in general, provides only partial Hive compatibility.
I have a RDD containing the primary keys of a table. I need to delete the rows in Cassandra table which matches the values in RDD.
I see that there is deleteFromCassandra in spark-cassandra-connector, but unable to use it, deleteFromCassandra is unresolved.
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/streaming/DStreamFunctions.scala
Thanks for your help.
if i understand your use case, you can follow this question, maybe it will help you:
Delete from cassandra Table in Spark
I have a Spark Cluster and a Cassandra cluster. In pyspark I read a csv file then transform it to an RDD. I then go through every row in my RDD and use a mapper and reducer function. I end up getting the following output (I've made this list short for demonstration purposes):
[(u'20170115', u'JM', u'COP'), (u'20170115', u'JM', u'GOV'), (u'20170115', u'BM', u'REB'), (u'20170115', u'OC', u'POL'), (u'20170114', u'BA', u'EDU')]
I want to go through each row in the array above and store each tuple into one table in Cassandra. I want the unique key to be the date. Now I know that I can turn this array into a dataframe and then store it into Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md#saving-a-dataframe-in-python-to-cassandra). If I turn the list into a dataframe and then store it into Cassandra will Cassandra still be able to handle it? I guess I'm not fully understanding how Cassandra stores values. In my array the dates are repeated, but the other values are different.
What is the best way for me to store the data above in Cassandra? Is there a way for me to store data directly from Spark to Cassandra using python?
Earlier versions of DSE 4.x supported RDDs, but the current connector for DSE and open source Cassandra is "limited to DataFrame only operations."
PySpark with Data Frames
You stated "I want the unique key to be the date". I assume you mean partion key, since date is not unique in your example. Its ok to use date as the partion key (assuming partitons will not be too large) but your primary key needs to be unique.
The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)