vertica select stament with groupby function doesn't work - subquery

The query running well through vertica, but it doesn't work in jdbc "pyspark",
ERROR: Syntax error at or near "\"
Even after removing the \$condtion, it returns
"ERROR: Subquery in FROM must have an alias"
SELECT
min(date(time_stamp)) mindate
,max(date(time_stamp)) maxdate
,count (distinct date(time_stamp)) noofdays
, subscriber
, server_hostname
, sum(bytes_in) DL
, sum(bytes_out) UL
, sum(connections_out) conn
from traffic.stats
where \$CONDITIONS
and SUBSCRIBER like '41601%'
and date(time_stamp) between '2019-01-25' and '2019-01-29'
and signature_service_category = 'Web Browsing'
and (signature_service_name = 'SSL v3'
or signature_service_name = 'HTTP2 over TLS')
and server_hostname not like '%.googleapis.%'
and server_hostname not like '%.google.%'
and server_hostname <> 'doubleclick.net'
and server_hostname <> 'youtube.com'
and server_hostname <> 'googleadservices.com'
and server_hostname <> 'app-measurement.com'
and server_hostname <> 'gstatic.com'
and server_hostname <> 'googlesyndication.com'
and server_hostname <> 'google-analytics.com'
and server_hostname <> 'googleusercontent.com'
and server_hostname <> 'ggpht.com'
and server_hostname <> 'googletagmanager.com'
and server_hostname is not null
group by subscriber, server_hostname
I have tried the above query on pyspark 1.6
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType, IntegerType
from pyspark.sql import SQLContext, Row
from pyspark.storagelevel import StorageLevel
from pyspark.streaming import DStream
from pyspark.streaming.dstream import TransformedDStream
from pyspark.streaming.util import TransformFunction
from pyspark.rdd import RDD
from pyspark.sql import *
from pyspark.sql import SQLContext
from datetime import datetime
from pyspark.sql.types import DateType
from dateutil.parser import parse
from datetime import timedelta
from pyspark.sql import HiveContext
import string
import re
import sys,os
import pandas
conf = (SparkConf()
.setAppName("hivereader")
.setMaster("yarn-client")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.executor.instances", 7)
.set("spark.executor.cores" , 7)
.set("spark.sql.inMemoryStorage.compressed", "true")
.set("spark.sql.tungsten.enabled" , 'true')
.set("spark.port.maxRetries" , 200)
)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
url = "jdbc:vertica*****************"
properties = {
"user": "********",
"password": "******",
"driver": "com.vertica.jdbc.Driver"
}
query = "SELECT MIN(date(time_stamp)) mindate, MAX(date(time_stamp)) maxdate,COUNT (distinct date(time_stamp)) noofdays, subscriber, server_hostname, SUM(bytes_in) DL, SUM(bytes_out) UL, SUM(connections_out) conn FROM traffic.stats t WHERE SUBSCRIBER LIKE '41601%' AND date(time_stamp) between '2019-01-25' and '2019-01-29'AND signature_service_category = 'Web Browsing' AND signature_service_name IN ('SSL v3', 'HTTP2 over TLS')AND server_hostname IS NOT NULL AND server_hostname NOT LIKE '%.googleapis.%' AND server_hostname NOT LIKE '%.google.%' AND server_hostname NOT IN ( 'doubleclick.net', 'youtube.com', 'googleadservices.com', 'app-measurement.com', 'gstatic.com', 'googlesyndication.com', 'google-analytics.com', 'googleusercontent.com', 'ggpht.com', 'googletagmanager.com') GROUP BY subscriber, server_hostname"
df = sqlContext.read.format("JDBC").options(
url = url,
dbtable="( " + query + " ) as temp",
**properties
).load()
df.show(50)
at the end of the query I have added "as x" and at the beginning "select * from(normal query)" its worked but without showing any result , spark 1.6 doesn't allow us to use direct query that's why I have tried to as temp db.table

Related

How to direct stream(kafka) a JSON file in spark and convert it into RDD?

Wrote a code that direct streams(kafka) word count when file is given(in producer)
code :
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
Need to convert the input json file to spark Dataframe using Dstream.
This should work:
Once you have your variable containing the TransformedDStream kvs, you can just create a map of DStreams and pass the data to a handler function like this:
data = kvs.map( lambda tuple: tuple[1] )
data.foreachRDD( lambda yourRdd: readMyRddsFromKafkaStream( yourRdd ) )
You should define the handler function that should create the dataframe using your JSON data:
def readMyRddsFromKafkaStream( readRdd ):
# Put RDD into a Dataframe
df = spark.read.json( readRdd )
df.registerTempTable( "temporary_table" )
df = spark.sql( """
SELECT
*
FROM
temporary_table
""" )
df.show()
Hope it helps my friends :)

How to find out the neighbour vertices of a particular vertex in graphframe(pyspark)?

I am trying to find out the neighbouring vertices of a particular vertex using the graphframe API available in pyspark. How can I do it? For example consider the following graph edges ( it should be considered as bidirectional although the input is directional).
edges = [[4,3],[4,5],[5,6],[3,6],[1,3],[1,0],[0,3])
vertices = [0,1,3,4,5,6]
g = GraphFrame(vertices,edges) //this makes the graph directional, is there a way to make it bidirectional?
Now I want to do something like-
degree(3) = 5
neighbour(3) = [4,5,6,1,0]
Here is my code which takes an input file( edge.txt) like
v1 v2
4 3
4 5
5 6
3 6
1 3
1 0
0 3
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('myapp')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
spark = SparkSession(sc)
file_name = sys.argv[1]
log_txt = sc.textFile("/user/rikhan/"+str(file_name))
header = log_txt.first()
log_txt = log_txt.filter(lambda line : line!=header)
temp_var = log_txt.map(lambda k: k.split(" "))
hasattr(temp_var,"toDF")
log_df = temp_var.toDF(header.split(" "))
log_df.dropDuplicates(['v1','v2'])
from functools import reduce
from pyspark.sql.functions import col,lit,when
from graphframes import *
import networkx as nx
import networkx.generators.small as gs
import matplotlib.pyplot as plt
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
from pyspark.sql import Column
from pyspark.sql import GroupedData
from pyspark.sql import DataFrameNaFunctions
from pyspark.sql import DataFrameStatFunctions
from pyspark.sql import functions
from pyspark.sql import types
from pyspark.sql import Window
edges = log_df.selectExpr("v1 as src","v2 as dst")
vertices = log_df.toPandas()['v1'].unique()
vertices2 = log_df.toPandas()['v2'].unique()
ver = vertices.tolist() + vertices2.tolist()
vertex = []
for x in ver:
if x not in vertex:
vertex.append(x)
rdd1 = sc.parallelize(vertex)
row_rdd = rdd1.map(lambda x: Row(x))
ver = spark.createDataFrame(row_rdd,['id'])
g = GraphFrame(ver,edges)

Creating a stream from a text file in Pyspark

I'm getting the following error when I try to create a stream from a text file in Pyspark:
TypeError: unbound method textFileStream() must be called with StreamingContext instance as first argument (got str instance instead)
I don't want to use SparkContext because I get another error so to remove thet error I have to use SparkSession.
My code:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.mllib.stat import Statistics
if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 5)
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
ds1 = ssc.textFileStream(input_path1)
lines1 = ds1.map(lambda x1: x1[1])
windowedds1 = lines1.flatMap(lambda line1: line1.strip().split("\n")).map(lambda strelem1: float(strelem1)).window(5,10)
ds2 = ssc.textFileStream(input_path2)
lines2 = ds2.map(lambda x2: x2[1])
windowedds2 = lines2.flatMap(lambda line2: line2.strip().split("\n")).map(lambda strelem2: float(strelem2)).window(5,10)
result = Statistics.corr(windowedds1,windowedds2, method="pearson")
if result > 0.7:
print("ds1 and ds2 are correlated!!!")
spark.stop()
Thank you!
You have to first create streamingcontext object and then use it to call textFileStream.
spark =
SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 1)
ds = ssc.textFileStream(input_path)

Add extra column for child data frame from parent data frame in nested XML in Spark

I am creating a data after loading many XML files .
Each xml file has one unique field fun:DataPartitionId
I am creating many rows from one XML files .
Now I want to add this fun:DataPartitionId for each row in the resulting rows from the XML.
For example suppose 1st XML has 100 rows then each 100 rows will have same fun:DataPartitionId field .
So fun:DataPartitionId is as a header filed in each XML.
This is what I am doing .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getDataPartition = udf { (DataPartition: String) =>
if (DataPartition=="1") "SelfSourcedPublic"
else if (DataPartition=="2") "Japan"
else if (DataPartition=="3") "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
val getFFActionParent = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "I|!|"
else "D|!|"
}
val getFFActionChild = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "O|!|"
else "D|!|"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfDataPartition=getDataPartition(dfContentEnvelope("env:Header.fun:DataPartitionId"))
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val df =dfContentItem.withColumn("DataPartition",dfDataPartition)
df.show()
When you read your xml file using
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
DataParitionId column is read as Long
fun:DataPartitionId: long (nullable = true)
so you should change the udf function as
val getDataPartition = udf { (DataPartition: Long) =>
if (DataPartition== 1) "SelfSourcedPublic"
else if (DataPartition== 2) "Japan"
else if (DataPartition== 3) "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
If possible you should be using when function instead of udf function to boost the processing speed and memory usage
Now I want to add this fun:DataPartitionId for each row in the resulting rows from the xml .
Your mistake is that you forgot to select that particular column, so the following code
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
should be
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartitionId"),$"column1.*")
Then you can apply the udf function
val df = dfContentItem.select(getDataPartition($"DataPartitionId"), $"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
So working code as a whole should be
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getDataPartition = udf { (DataPartition: Long) =>
if (DataPartition=="1") "SelfSourcedPublic"
else if (DataPartition=="2") "Japan"
else if (DataPartition=="3") "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartitionId"),$"column1.*")
val df = dfContentItem.select(getDataPartition($"DataPartitionId"), $"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
df.show(false)
And you can proceed with the rest of the code.

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

Am getting a compilation error converting the pre-LDA transformation to a data frame using SCALA in SPARK 2.0. The specific code that is throwing an error is as per below:
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
The complete compilation error is:
Error:(132, 8) value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Here is the complete code:
import java.io.FileInputStream
import java.sql.{DriverManager, ResultSet}
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA => oldLDA}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object MPClassificationLDA {
/*Start: Configuration variable initialization*/
val props = new Properties
val fileStream = new FileInputStream("U:\\JIRA\\MP_Classification\\target\\classes\\mpclassification.properties")
props.load(fileStream)
val mpExtract = props.getProperty("mpExtract").toString
val shard6_db_server_name = props.getProperty("shard6_db_server_name").toString
val shard6_db_user_id = props.getProperty("shard6_db_user_id").toString
val shard6_db_user_pwd = props.getProperty("shard6_db_user_pwd").toString
val mp_output_file = props.getProperty("mp_output_file").toString
val spark_warehouse_path = props.getProperty("spark_warehouse_path").toString
val rf_model_file_path = props.getProperty("rf_model_file_path").toString
val windows_hadoop_home = props.getProperty("windows_hadoop_home").toString
val lda_vocabulary_size = props.getProperty("lda_vocabulary_size").toInt
val pre_lda_model_file_path = props.getProperty("pre_lda_model_file_path").toString
val lda_model_file_path = props.getProperty("lda_model_file_path").toString
fileStream.close()
/*End: Configuration variable initialization*/
val conf = new SparkConf().set("spark.sql.warehouse.dir", spark_warehouse_path)
def main(arg: Array[String]): Unit = {
//SQL Query definition and parameter values as parameter upon executing the Object
val cont_id = "14211599"
val top = "100000"
val start_date = "2016-05-01"
val end_date = "2016-06-01"
val mp_spark = SparkSession
.builder()
.master("local[*]")
.appName("MPClassificationLoadLDA")
.config(conf)
.getOrCreate()
MPClassificationLDACalculation(mp_spark, cont_id, top, start_date, end_date)
mp_spark.stop()
}
private def MPClassificationLDACalculation
(mp_spark: SparkSession
,cont_id: String
,top: String
,start_date: String
,end_date: String
): Unit = {
//DB connection definition
def createConnection() = {
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver").newInstance();
DriverManager.getConnection("jdbc:sqlserver://" + shard6_db_server_name + ";user=" + shard6_db_user_id + ";password=" + shard6_db_user_pwd);
}
//DB Field Names definition
def extractvalues(r: ResultSet) = {
Row(r.getString(1),r.getString(2))
}
//Prepare SQL Statement with parameter value replacement
val query = """SELECT docId = audt_id, text = auction_title FROM brands6.dbo.uf_ds_marketplace_classification_listing(#cont_id, #top, '#start_date', '#end_date') WHERE ? < ? OPTION(RECOMPILE);"""
.replaceAll("#cont_id", cont_id)
.replaceAll("#top", top)
.replaceAll("#start_date", start_date)
.replaceAll("#end_date", end_date)
.stripMargin
//Connect to Source DB and execute the Prepared SQL Steatement
val mpDataRDD = new JdbcRDD(mp_spark.sparkContext
,createConnection
,query
,lowerBound = 0
,upperBound = 10000000
,numPartitions = 1
,mapRow = extractvalues)
val schema_string = "docId,text"
val fields = StructType(schema_string.split(",")
.map(fieldname => StructField(fieldname, StringType, true)))
//Create Data Frame using format identified through schema_string
val mpDF = mp_spark.createDataFrame(mpDataRDD, fields)
mpDF.collect()
val mp_listing_tmp = mpDF.selectExpr("cast(docId as long) docId", "text")
mp_listing_tmp.printSchema()
println(mp_listing_tmp.first)
val mp_listing_lda_df = mp_listing_tmp.withColumn("docId", mp_listing_tmp("docId"))
mp_listing_lda_df.printSchema()
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawTokens")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawTokens")
.setOutputCol("tokens")
val vocabSize = 4000
val countVectorizer = new CountVectorizer()
.setVocabSize(vocabSize)
.setInputCol("tokens")
.setOutputCol("features")
val PreLDApipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsRemover, countVectorizer))
val PreLDAmodel = PreLDApipeline.fit(mp_listing_lda_df)
//comment out after saving it the first time
PreLDAmodel.write.overwrite().save(pre_lda_model_file_path)
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
//documents.printSchema()
val numTopics: Int = 20
val maxIterations: Int = 100
//note the FeaturesCol need to be set
val lda = new LDA()
.setOptimizer("em")
.setK(numTopics)
.setMaxIter(maxIterations)
.setFeaturesCol(("_2"))
val vocabArray = PreLDAmodel.stages(2).asInstanceOf[CountVectorizerModel].vocabulary
}
}
Am thinking that it is related to conflicts in the imports section of the code. Appreciate any help.
2 things needed to be done:
Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Move case class outside of the method: case class, by use of which you define the schema of the DataFrame, should be defined outside of the method needing it. You can read more about it here: https://issues.scala-lang.org/browse/SI-6649

Resources