I am sending the following json to the path "/ home / host / test" so that the program can capture it using spark streaming and be able to make queries about it.
{"id": "1", description: "test"}
{"id": "1", description: "test"}
But when I perform the query it looks like the following structure
root
| --word: String (Nulleable = true)
and I get the following result:
+ ------------------- +
| word |
---------------------
| {"id": "1", "test"}
| {"id": "1", "test"}
I need the structure to look like this
root
| --id: String (Nulleable = true)
| --description string (Nulleable = true)
and I need to get a result like the following
----------------
| id | description
----------------
| "1" | "test" |
| "1" | "test" |
----------------
this is my pyspkark code
from __future__ import print_function
import os
import sys
from pyspark import SparkContext
from pyspark.sql.functions import col, explode
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext, Row
from pyspark.sql import SQLContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 3)
sqlcontextoriginal = SQLContext(sc)
# Create a socket stream on target ip:port and count the
# words in input stream of \n delimited text (eg. generated by 'nc')
lines = ssc.textFileStream("/home/host/test")
# Convert RDDs of the words DStream to DataFrame and run SQL query
def process(time, rdd):
print("========= %s =========" % str(time))
try:
# Get the singleton instance of SQLContext
sqlContext = SQLContext(rdd.context)
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = rdd.map(lambda w: Row(word=w))
wordsDataFrame = sqlContext.createDataFrame(rowRdd).toJSON()
json = sqlContext.read.json(wordsDataFrame)
# Register as table
json.createOrReplaceTempView("words")
json.printSchema()
wordCountsDataFrame = sqlContext.sql("select * from words ")
wordCountsDataFrame.show()
except:
pass
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
Ok, i found the solution.
I had to use sql.read.json passing it as parameter the rdd directly.
json = sqlContext.read.json(rdd)
Related
I have a dataframe called Incitoand in Supplier Inv Nocolumn of that data frame consists of comma separated values. I need to recreate the data frame by appropriately repeating those comma separated values using pyspark.I am using following python code for that.Can I convert this into pyspark?Is it possible via pyspark?
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
incito['Supplier Inv No'] = incito['Supplier Inv No'].astype(str)
# calculate lengths of splits
lens = incito['Supplier Inv No'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
dfnew = pd.DataFrame({'Supplier Inv No': chainer(incito['Supplier Inv No']),
'Forwarder': np.repeat(incito['Forwarder'], lens),
'Mode': np.repeat(incito['Mode'], lens),
'File No': np.repeat(incito['File No'], lens),
'ETD': np.repeat(incito['ETD'], lens),
'Flight No': np.repeat(incito['Flight No'], lens),
'Shipped Country': np.repeat(incito['Shipped Country'], lens),
'Port': np.repeat(incito['Port'], lens),
'Delivered_Country': np.repeat(incito['Delivered_Country'], lens),
'AirWeight': np.repeat(incito['AirWeight'], lens),
'FREIGHT CHARGE': np.repeat(incito['FREIGHT CHARGE'], lens)})
This is what I tried in pyspark.But I am not getting the expected outcome.
from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
import pandas as pd
conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
ddf = spark.createDataFrame(dfnew)
exploded = ddf.withColumn('d', F.explode("Supplier Inv No"))
exploded.show()
Something like this, using repeat?
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
('ABCD',),
('EFGH',),
])
.toDF(['col_a'])
)
(df
.withColumn('col_b', F.repeat(F.col('col_a'), 2))
.withColumn('col_c', F.repeat(F.lit('X'), 10))
.show()
)
# +-----+--------+----------+
# |col_a| col_b| col_c|
# +-----+--------+----------+
# | ABCD|ABCDABCD|XXXXXXXXXX|
# | EFGH|EFGHEFGH|XXXXXXXXXX|
# +-----+--------+----------+
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> object rddTest{
| def main(args: Array[String]) = {
| val spark = SparkSession.builder.appName("mapExample").master("local").getOrCreate()
| val rdd1 = spark.sparkContext.parallelize(Seq((1,"jan",2016),(3,"nov",2014),(16,"feb",2014)))
| val rdd2 = spark.sparkContext.parallelize(Seq((5,"dec",2014),(17,"sep",2015)))
| val rdd3 = spark.sparkContext.parallelize(Seq((6,"dec",2011),(16,"may",2015)))
| val rddUnion = rdd1.union(rdd2).union(rdd3)
| rddUnion.foreach(Println)
| }
| }
I am getting this error ,i dont know why this is coming
< console>:81: error: not found: value Println
rddUnion.foreach(Println)
You have an extrat upper case try this :
rddUnion.foreach(println)
I have a csv with a timeseries:
timestamp, measure-name, value, type, quality
1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD
The measure-name should be my partition key and I would like to calculate a moving average with pyspark, here my code (for instance) to calculate the max
def mysplit(line):
ll = line.split(",")
return (ll[1],float(ll[2]))
text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)
However, for the average I would like to respect the timestamp ordering.
How to order by a second column?
You need to use a window function on pyspark dataframes:
First you should transform your rdd to a dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])
Or load it directly as a dataframe:
local:
import pandas as pd
df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
from hdfs:
df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
Then use a window function:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))
+----------+------------+--------+-----+-------+-------------------+
| timestamp|measure_name| value| type|quality| value_rol_mean|
+----------+------------+--------+-----+-------+-------------------+
|1503377340| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377400| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377580| x.x-2.A|0.528125|Float| GOOD|0.17604166666666665|
+----------+------------+--------+-----+-------+-------------------+
in .orderByyou can order by as many columns as you want
Could someone help me solve this problem I have with Spark DataFrame?
When I do myFloatRDD.toDF() I get an error:
TypeError: Can not infer schema for type: type 'float'
I don't understand why...
Example:
myFloatRdd = sc.parallelize([1.0,2.0,3.0])
df = myFloatRdd.toDF()
Thanks
SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/dict* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:
myFloatRdd.map(lambda x: (x, )).toDF()
or even better:
from pyspark.sql import Row
row = Row("val") # Or some other column name
myFloatRdd.map(row).toDF()
To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:
from pyspark.sql.types import FloatType
df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())
df.show()
## +-----+
## |value|
## +-----+
## | 1.0|
## | 2.0|
## | 3.0|
## +-----+
but for a simple range it would be better to use SparkSession.range:
from pyspark.sql.functions import col
spark.range(1, 4).select(col("id").cast("double"))
* No longer supported.
** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.
*** Supported only in Spark 2.0 or later.
from pyspark.sql.types import IntegerType, Row
mylist = [1, 2, 3, 4, None ]
l = map(lambda x : Row(x), mylist)
# notice the parens after the type name
df=spark.createDataFrame(l,["id"])
df.where(df.id.isNull() == False).show()
Basiclly, you need to init your int into Row(), then we can use the schema
Inferring the Schema Using Reflection
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to Row
orders_struct = parts.map(lambda p: Row(order_id=int(p[0]), order_date=p[1], customer_id=p[2], order_status=p[3]))
for i in orders_struct.take(5): print(i)
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
Programmatically Specifying the Schema
from pyspark.sql import Row
# spark - sparkSession
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
orders = sc.textFile("/practicedata/orders")
#Split on delimiters
parts = orders.map(lambda l: l.split(","))
#Convert to tuple
orders_struct = parts.map(lambda p: (p[0], p[1], p[2], p[3].strip()))
#convert the RDD to DataFrame
orders_df = spark.createDataFrame(orders_struct)
# The schema is encoded in a string.
schemaString = "order_id order_date customer_id status"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = Struct
ordersDf = spark.createDataFrame(orders_struct, schema)
Type(fields)
from pyspark.sql import Row
myFloatRdd.map(lambda x: Row(x)).toDF()
I'm using Spark v1.5.2. I wrote a program in Python and I don't understand why it reads the input files twice. The same program written in Scala only reads the input files once.
I use an accumulator to count the number of times that map() is called. From the accumulator value, I infer the number of times the input file is read.
The input file contains 3 lines of text.
Python:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import *
def createTuple(record): # used with map()
global map_acc
map_acc += 1
return (record[0], record[1].strip())
sc = SparkContext(appName='Spark test app') # appName is shown in the YARN UI
sqlContext = SQLContext(sc)
map_acc = sc.accumulator(0)
lines = sc.textFile("examples/src/main/resources/people.txt")
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple) #.cache()
fieldNames = 'name age'
fields = [StructField(field_name, StringType(), True) for field_name in fieldNames.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame(people_rdd, schema)
print 'record count DF:', df.count()
print 'map_acc:', map_acc.value
#people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 6 ##### why 6 instead of 3??
Scala:
import org.apache.spark._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object SimpleApp {
def main(args: Array[String]) {
def createTuple(record:Array[String], map_acc: Accumulator[Int]) = { // used with map()
map_acc += 1
Row(record(0), record(1).trim)
}
val conf = new SparkConf().setAppName("Scala Test App")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val map_acc = sc.accumulator(0)
val lines = sc.textFile("examples/src/main/resources/people.txt")
val people_rdd = lines.map(_.split(",")).map(createTuple(_, map_acc))
val fieldNames = "name age"
val schema = StructType(
fieldNames.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val df = sqlContext.createDataFrame(people_rdd, schema)
println("record count DF: " + df.count)
println("map_acc: " + map_acc.value)
}
}
$ spark-submit ---class SimpleApp --master local[1] test.jar 2> err
record count DF: 3
map_acc: 3
If I remove the comments from the Python program and cache the RDD, then the input files are not read twice. However, I don't think I should have to cache the RDD, right? In the Scala version I don't need to cache the RDD.
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple).cache()
...
people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 3
$ hdfs dfs -cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
It happens because in 1.5 createDataFrame eagerly validates provided schema on a few elements:
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
In contrast current versions validate schema for all elements but it is done lazily and you wouldn't see the same behavior. For example this would fail instantaneously in 1.5:
from pyspark.sql.types import *
rdd = sc.parallelize([("foo", )])
schema = StructType([StructField("foo", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)
but 2.0 equivalent would fail when you try to evaluate DataFrame.
In general you shouldn't expect that Python and Scala code will behave the same way unless you strictly limit yourself to interactions with SQL API. PySpark:
Implements almost all RDD methods natively so the same chain of transformations can result in a different DAG.
Interactions with Java API may require an eager evaluation to provide type information for Java classes.