I have a below pyspark code. I am reading a json data from Rest API and trying to load using pyspark.
But i couldnt read the data in DataFrame in spark.Can some one help me in this.
import urllib
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
data = urllib.request.urlopen(url).read().decode('utf-8')
rdd = sc.parallelize(data)
df = spark.createDataFrame(rdd,schema)
df.show()```
**The Error message is TypeError: StructType can not accept object '[' in type <class 'str'>**
** I have been able to do using dataset in scala but i am not able to understand why its not possible using python **
import spark.implicits._
// Load the data from the New York City Taxi data REST API for 2016 Green Taxi Trip Data
val url="https://data.cityofnewyork.us/resource/pqfs-mqru.json"
val result = scala.io.Source.fromURL(url).mkString
// Create a dataframe from the JSON data
val taxiDF = spark.read.json(Seq(result).toDS)
// Display the dataframe containing trip data
taxiDF.show()
Just for others ..
Here is the code that worked for me .Request .get returns a list
import requests
import json
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
r = requests.get(url)
data_json = r.json()
df = spark.createDataFrame(data_json,schema)
display(df)
Related
I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame.
schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on
jsonDF = spark.read.schema(schema).json('/path/test.json')
Since the schema is too big I want to defined inside the code. Can anyone please suggest which is the best way to do.
I tried below ways but doesn't work.
schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')
I figured out how to do this.
1. Define the schema of json file
json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
2. Print the json output
print(sampletmp.json())
3. Copy paste the above output to file sample.schema
4. In the code, recreate the schema as below
schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))
5. Create a DF using above schema
spark.read.schema(schema).json('/path/test.json')
6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")
Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
I haven't tested it with hdfs but I assume it is similar to reading from a local file. The idea is to store the file as a dict and then parse it to create the desidered schema. I have taken inspiration from here. Currently it lacks support for nullable and I have not tested with deeper levels of nested structs.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json
spark = SparkSession.builder.appName('myPython').getOrCreate()
f = open("/path/schema_file", "r")
dictString = f.read()
derived_schema = StructType([])
jdata = json.loads(dictString)
def get_type(v):
if v == "StringType":
return StringType()
if v == "TimestampType":
return TimestampType()
if v == "IntegerType":
return IntegerType()
def generate_schema(jdata, derived_schema):
for k, v in sorted(jdata.items()):
if (isinstance(v, str)):
derived_schema.add(StructField(k, get_type(v), True))
else:
added_schema = StructType([])
added_schema = generate_schema(v, added_schema)
derived_schema.add(StructField(k, added_schema, True))
return derived_schema
generate_schema(jdata, derived_schema)
from datetime import datetime
data = [("first", "the", datetime.utcnow(), ["as", 1])]
input_df = spark.createDataFrame(data, derived_schema)
input_df.printSchema()
With the file being:
{
"col1" : "StringType",
"col2" : "StringType",
"col3" : "TimestampType",
"col4" : {
"col5" : "StringType",
"col6" : "IntegerType"
}
}
The support for python Bigquery API indicates that arrays are possible, however, when passing from a pandas dataframe to bigquery there is a pyarrow struct issue.
The only way round it seems its to drop columns then use JSON Normalise for a separate table.
'''from google.cloud import bigquery
project = 'lake'
client = bigquery.Client(credentials=credentials, project=project)
dataset_ref = client.dataset('XXX')
table_ref = dataset_ref.table('RAW_XXX')
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.write_disposition = 'WRITE_TRUNCATE'
client.load_table_from_dataframe(appended_data, table_ref,job_config=job_config).result()'''
This is the error recieved. NotImplementedError: struct
This is currently not supported due to how parquet serialization works.
A feature request to upload pandas DataFrame containing arrays was created at the client library's GitHub:
https://github.com/googleapis/google-cloud-python/issues/8544
I'm super begginer with pyspark. Just trying some code to process my documents in Databricks Community. I have a lot of html pages in a Dataframe and need to map a function that clean all html tags.
from selectolax.parser import HTMLParser
def get_text_selectolax(html):
tree = HTMLParser(html)
if tree.body is None:
return None
for tag in tree.css('script'):
tag.decompose()
for tag in tree.css('style'):
tag.decompose()
for node in tree.css('body'):
if node.tag == "strong":
print( "node.html" )
print( node.html )
text = tree.body.text(separator='\n')
return text
df_10 = df.limit(10) #Out: df_10:pyspark.sql.dataframe.DataFrame
rdd_10_2 = df_10.select("html").rdd.map( get_text_selectolax )
schema = StructType([
StructField("html", StringType()),
])
df_10_2 = spark.createDataFrame(rdd_10_2, schema)
df_10_2.show() #-----------> here the code failure
I want to clean all my documents and get a Dataframe to work with.
Thx
Here is the complete notebook:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5506005740338231/939083865254574/8659136733442891/latest.html
I could get the thing working, but in scala, what is fine for me.
val version = "3.9.1"
val baseUrl = s"http://repo1.maven.org/maven2/edu/stanford/nlp/stanford-corenlp"
val model = s"stanford-corenlp-$version-models.jar" //
val url = s"$baseUrl/$version/$model"
if (!sc.listJars().exists(jar => jar.contains(model))) {
import scala.sys.process._
// download model
s"wget -N $url".!!
// make model files available to driver
s"jar xf $model".!!
// add model to workers
sc.addJar(model)
}
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
val df_limpo = ds.select(cleanxml('html).as("acordao"))
I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database
Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it .
Now I am getting the streaming data as
import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row
sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")
########Lets get the data from the db which is relavant for streaming ###
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"
########basic data for evaluation purpose ########
files_count = files.flatMap(lambda file: file.split( ))
pattern = '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'
tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def pre_parse(logline):
"""
to read files as rows of sql in pyspark streaming using the pattern . for use of logging
added 0,1 in case there is any failure in processing by this pattern
"""
match = re.search(pattern,logline)
if match is None:
return(line,0)
else:
return(
Row(
customer_id = match.group(8)
trantype = match.group(5)
amount = float(match.group(2))
),1)
def parse():
"""
actual processing is happening here
"""
parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
if fail.count() > 0:
print "no of non parsed file : %d", % fail.count()
return success,fail
success ,fail = parse()
Now I want to evaluate it by the data frame that I get from the historical data
base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()
Now since this being returned as a data frame how do I use this for my purpose .
The streaming programming guide here says
"You have to create a SQLContext using the SparkContext that the StreamingContext is using."
Now this makes me even more confused on how to use the existing dataframe with the streaming object . Any help is highly appreciated .
To manipulate DataFrames, you always need a SQLContext so you can instanciate it like :
sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)
These 2 contexts (SQLContext and StreamingContext) will coexist in the same job because they are associated with the same SparkContext.
But, keep in mind, you can't instanciate two different SparkContext in the same job.
Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream.
To do that, I would do something like :
yourDStream.foreachRDD(lambda rdd: sqlContext
.createDataFrame(rdd)
.join(historicalDF, ...)
...
)
Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions
I'm trying to use Spark SQL Data Frame to read some data in and apply a bunch of text clean up functions to each row.
import langid
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
from pyspark.sql import HiveContext
hsC = HiveContext(sc)
df = hsC.sql("select * from sometable")
def check_lang(data_str):
language = langid.classify(data_str)
# only english
record = ''
if language[0] == 'en':
# probability of correctly id'ing the language greater than 90%
if language[1] > 0.9:
record = data_str
return record
check_lang_udf = udf(lambda x: check_lang(x), StringType())
clean_df = df.select("Field1", check_lang_udf("TextField"))
However when I attempt to run this I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.select.
: java.lang.AssertionError: assertion failed: Unable to evaluate PythonUDF. Missing input attributes
I've spent a good deal trying to gather up more information on this but I can't find anything.
As a sidenote, I know the code below works but I'd like to stay with dataframes.
removeNonEn = data.map(lambda record: (record[0], check_lang(record[1])))
I haven't tried this code, but from the API docs suggest this should work:
hsC.registerFunction("check_lang", check_lang)
clean_df = df.selectExpr("Field1", "check_lang('TextField')")