Load only part of HBase/Phoenix table as Spark Datafrom - apache-spark

I am using the following code in Spark to load specified columns of my HBase/Phoenix table into a Spark Dataframe. I can specify the columns I want to load, but can I specify which rows? Or do I have to load all rows?
import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
sc.stop()
val sc = new SparkContext("local", "phoenix-test")
val df = sqlContext.phoenixTableAsDataFrame(
"TABLENAME", Array("ROWKEY", "CF.COL1","CF.COL2","CF.COL3"), conf = configuration
)

Related

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

Write results from Kafka to csv in pyspark

I have setup a Kafka broker and I manage to read the records with pyspark.
import os
from pyspark.sql import SparkSession
import pyspark
import sys
from pyspark import SparkConf, SparkContext, SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
conf = SparkConf().setMaster("my-master").setAppName("Kafka_Spark")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,5)
kvs = KafkaUtils.createDirectStream(ssc,
['enriched_messages'],
{"metadata.broker.list":"my-kafka-broker","auto.offset.reset" : "smallest"},
keyDecoder=lambda x: x,
valueDecoder=lambda x: x)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination(10)
Example of returning data (timestamp, name, lastname, height):
2020-05-07 09:16:38, JoHN, Doe, 182.5
I want to write these records into a csv file. lines is of type KafkaTransformedDStream and classic solution with rdd is not working.
Has anyone a solution to this?
converting DStreams to single rdd is not possible, as DStreams are continuous streams. You can use the following, which results many files, and later merge them to single file.
lines.saveAsTextFiles("prefix", "suffix")

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

I am able to connect to the Hive database using pyspark but when i run my program data is not showing

I have written the below code to read the data from HIVE table and when I am trying to run no compilation errors and no data displaying.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext, SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars hive-jdbc-2.1.0.jar
pyspark-shell'
sparkConf = SparkConf().setAppName("App")
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc);
source_df = hiveContext.read.format('jdbc').options(
url='jdbc:hive2://localhost:10000/sample',
driver='org.apache.hive.jdbc.HiveDriver',
dbtable='abc',
user='root',
password='root').load()
print source_df.show()
When i run this, I am getting below output and not able to fetch the
data from table.
+--------+------+
|abc.name|abc.id|
+--------+------+
+--------+------+
Just try
df = hiveContext.read.table("your_hive_table") //reads from default db
df = hiveContext.read.table("your_db.your_hive_table") //reads from your db
you could also do
df = hiveContext.sql("select * from your_table")

sc is not defined in SparkContext

My Spark package is spark-2.2.0-bin-hadoop2.7.
I exported spark variables as
export SPARK_HOME=/home/harry/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
I opened spark notebook by
pyspark
I am able to load packages from spark
from pyspark import SparkContext, SQLContext
from pyspark.ml.regression import LinearRegression
print(SQLContext)
output is
<class 'pyspark.sql.context.SQLContext'>
But my error is
print(sc)
"sc is undefined"
plz can anyone help me out ...!
In pysparkShell, SparkContext is already initialized as SparkContext(app=PySparkShell, master=local[*]) so you just need to use getOrCreate() to set the SparkContext to a variable as
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
For coding purpose in simple local mode, you can do the following
from pyspark import SparkConf, SparkContext, SQLContext
conf = SparkConf().setAppName("test").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print(sc)
print(sqlContext)

Resources