Pyspark : GET row from hbase using row-key - apache-spark

I have a use-case to read from HBase inside a pyspark job and is currently doing a scan on the HBase table like this,
conf = {"hbase.zookeeper.quorum": host, "hbase.cluster.distributed": "true", "hbase.mapreduce.inputtable": "table_name", "hbase.mapreduce.scan.row.start": start, "hbase.mapreduce.scan.row.stop": stop}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,conf=cmdata_conf)
I am unable to find the conf to do a GET on the HBase table. Can someone help me? I could find that filters are not supported with pyspark. But is it not possible to do a simple GET?
Thanks!

Related

Getting duplicate values for each key while querying parquet file using PySpark

We are getting duplicated values while querying data from the parquet file using PySpark.
While getting correct data after querying from presto.
Spark Version: 3.1
Configuration setup so far:
from scbuilder.kubernetes import Kubernetes
kobj = Kubernetes(kubernetes = True)
kobj.setExecutorCores(5)
kobj.setExecutorMemory("5g")
kobj.addAdditionalConf("spark.driver.memory", "8g")
kobj.setNumberOfExecutor(2)
sc = kobj.buildSparkSession()
sc.getActiveSession()
sc.conf.set('sc.hadoopConfiguration.setClass',"mapreduce.input.pathFilter.class")
sc.conf.set("hive.convertMetastoreParquet",False)
sc.conf.set("hive.input.format","org.apache.hadoop.hive.ql.io.HiveInputFormat")
Actual data count: 17722
After querying from parquet file: 1036320
Need help to understand why parquet file is showing such behavior and how we can fix it?

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

How to Read Hive Table in Spark without header

I am trying to read one hive table in pyspark but I am getting header as well that I do not want.
File.csv
Id,Name
1,A
2,B
3,C
4,D
Hive Table
I build hive table with tblproperties("skip.header.line.count"="1") and in Hive I am getting data correctly so there is no issue with Hive.
I am facing issue while I am reading this table in pyspark.
There is Spark-11374 jira reported for this issue and closed as won't fix.
Possible ways to do this are:
1.You can directly read the HDFS file:
spark.read.option("header","true").option("delimiter",",").csv("<hdfs_path>").show()
2.using hive query:
spark.sql("select * from <table_name> where <col_name1> != 'id'").show()

How to save a dataframe into HBase?

I have a df with a schema, also create a table in HBase with phoenix. What i want is to save this df to HBase using spark. I have tried the descriptions in the following link and run the spark-shell with phoenix plugin dependencies.
spark-shell --jars ./phoenix-spark-4.8.0-HBase-1.2.jar,./phoenix-4.8.0-HBase-1.2-client.jar,./spark-sql_2.11-2.0.1.jar
However, i got an error saying even when i run the read function ;
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
| "zkUrl" -> hbaseConnectionString))
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I have a feeling that i am on the wrong track. So if there is another way of putting data generated on spark into HBase, i will appreciate if you share it with me.
https://phoenix.apache.org/phoenix_spark.html

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
http://spark.apache.org/docs/latest/api/python/pyspark.html
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?
Google now has an example on how to use the BigQuery connector with Spark.
There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)

Resources