Spark: how to save pair rdd to json files? - apache-spark

My Rdd is like:
[('f1',1), ('f2',2)]
How to save it to json files?

you can convert rdd to dataframe and write to JSON
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize(
[('f1', 1), ('f2', 2)]).toDF(["key", "value"])
df.write.format('json').save('output_path')
Output in json file looks like below
{"key":"f1","value":1}
{"key":"f2","value":2}

Related

Fetch dbfs files as a stream dataframe in databricks

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.
I thought about a solution when I would get a streaming dataframe from dbutils.fs.ls() output and then call a function that creates a table inside the forEachBatch().
I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?
Kindly check with the below code block.
package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val schema = StructType(
List(
StructField("Zipcode", IntegerType, true),
)
)
val df = spark.readStream
.schema(schema)
.json("Your directory")
df.printSchema()
val groupDF = df.select("Zipcode")
.groupBy("Zipcode").count()
groupDF.printSchema()
groupDF.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
}
}

Create table in hive through spark

I am trying to connect to Hive through Spark using below code but unable to do so. The code fails with NoSuchDatabaseException Database 'raw' not found. I have database named 'raw' in hive. What am I missing here?
val spark = SparkSession
.builder()
.appName("Connecting to hive")
.config("hive.metastore.uris", "thrift://myserver.domain.local:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
frame.show()
frame.write.mode("overwrite").saveAsTable("raw.temp1")
Output for spark.sql("SHOW DATABASES")

Reading Excel (.xlsx) file in pyspark

I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
Error:
TypeError: 'DataFrameReader' object is not callable
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
df.show()
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating
Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
.load("C:\P_DATA\tyco_93_A.xlsx")
More options are available in below github page.
https://github.com/crealytics/spark-excel

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

Convert csv files to parquet on s3 using Spark structured streaming

I'm trying to create a Spark application that will read my csv files from s3, convert it to parquet files and write the results to s3.
I have 8 new csv files every minute compressed with gzip (~60MB each gzip file), each row have ~200 columns and ~99% are at the same date (my partition column).
The cluster have 3 workers with 10 cores and memory of 20 GB each.
Here is my code:
val spark = SparkSession
.builder()
.appName("Csv2Parquet")
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.s3a.access.key", "MY ACESS KEY")
.config("fs.s3a.secret.key", "MY SECRET")
.config("spark.executor.memory", "15G")
.config("spark.driver.memory", "5G")
.getOrCreate()
import spark.implicits._
val schema= StructType(Array(
StructField("myDate", DateType, nullable=false),
StructField("myTimestamp", TimestampType, nullable=true),
...
...
...
StructField("myColumn200", StringType, nullable=true)
))
val df = spark.readStream
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "false")
.option("mode", "DROPMALFORMED")
.option("delimiter","\t")
.load("s3a://my-bucket/raw-data/*.gz")
.withColumn("myPartitionDate", $"myDate")
val query = df.repartition($"myPartitionDate").writeStream
.option("checkpointLocation", "/shared/checkpoints/csv2parquet")
.trigger(Trigger.ProcessingTime(60000))
.format("parquet")
.option("path", "s3a://my-bucket/parquet-data")
.partitionBy(myPartitionDate)
.start("s3a://my-bucket/parquet-data")
query.awaitTermination()
The problem is that only one task is responsible for writing the "main" partition (that includes 99% of the events) to s3 and it takes ~4 minutes to handle this task. how can i improve it?

Resources