What is the distinct() equivalent in pyspark dstream?

What is the distinct() equivalent in pyspark dstream? - python-3.x

I use pyspark & I have a dstream like below,
mystream = dstream.map(lambda y: (y[0], y[1])).distict().groupByKey()
mystream.pprint()
But unfortunately, it says AttributeError: 'TransformedDStream' object has no attribute 'distict'. Why distict()supports in rdd based operation but not in dstream? What is the distict() equivalent in dstream?

Related

Convert RDD to DataFrame using pyspark

I have a file in spark with following data
Property ID|Location|Price|Bedrooms|Bathrooms|Size|Price SQ Ft|Status
i have read this file as rdd using
a=sc.textFile("/FileStore/tables/realestate.txt")
Now I need to convert this rdd into dataframe. I am using the below mentioned command
d=spark.createDataFrame(a).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")
But i am getting an error as
TypeError: Can not infer schema for type: <class 'str'>

You can split the column first:
d = spark.createDataFrame(a.map(lambda x: x.split('|'))).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")
Or equivalently, calling toDF on the RDD directly
d = a.map(lambda x: x.split('|')).toDF(["Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status"])
In fact, I'd recommend using the Spark CSV reader for this purpose, which could handle the header appropriately too:
df = spark.read.csv('/FileStore/tables/realestate.txt', header=True, inferSchema=True, sep='|')

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?

is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Convert DStreams to RDDs

How can I convert Spark Streaming DStream to RDDs so that be used inside Spark Context & not inside Streaming Context? Using Python.
Here is the error I am getting:
AttributeError: 'TransformedDStream' object has no attribute 'foreach'
def result(y):
return y
d_stream = input_stream.map(lambda x : mainReduce(x)).map(lambda q: q)
rdd = d_stream.foreach(result)

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()

That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Spark DataTables: where is partitionBy?

A common Spark processing flow we have is something like this:
Loading:
rdd = sqlContext.parquetFile("mydata/")
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
Processing by id (this is why we do the partitionBy above!)
rdd.reduceByKey(....)
rdd.join(...)
However, Spark 1.3 changed sqlContext.parquetFile to return DataFrame instead of RDD, and it no longer has the partitionBy, getNumPartitions, and reduceByKey methods.
What do we do now with partitionBy?
We can replace the loading code with something like
rdd = sqlContext.parquetFile("mydata/").rdd
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
df = rdd.map(lambda ...: Row(...)).toDF(???)
and use groupBy instead of reduceByKey.
Is this the right way?
PS. Yes, I understand that partitionBy is not necessary for groupBy et al. However, without a prior partitionBy, each join, groupBy &c may have to do cross-node operations. I am looking for a way to guarantee that all operations requiring grouping by my key will run local.

It appears that, since version 1.6, repartition(self, numPartitions, *cols) does what I need:
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns.
Also made numPartitions optional if partitioning columns are specified.

Since DataFrame provide us an abstraction of Table and Column over RDD, the most convenient way to manipulate DataFrame is to use these abstraction along with the specific table manipulations methods that DataFrame enables us.
On a DataFrame, we could:
transform the table schema with select() \ udf() \ as()
filter rows out by filter() or where()
fire an aggregation through groupBy() and agg()
or other analytic job using sample() \ join() \ union()
persist your result using saveAsTable() \ saveAsParquet() \ insertIntoJDBC()
Please refer to Spark SQL and DataFrame Guide for more details.
Therefore, a common job looks like:
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
And for your specific requirements, this could look like:
val t = sqlContext.parquetFile()
t.filter().select().groupBy().agg()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What is the distinct() equivalent in pyspark dstream? - python-3.x

Related

Convert RDD to DataFrame using pyspark

How to collect a streaming dataset (to a Scala value)?

Convert DStreams to RDDs

createDataFrame() returning a list instead of DataFrame in Spark

Spark DataTables: where is partitionBy?

Categories

Resources