Arrays not supported in Bigquery Python API - python-3.x

The support for python Bigquery API indicates that arrays are possible, however, when passing from a pandas dataframe to bigquery there is a pyarrow struct issue.
The only way round it seems its to drop columns then use JSON Normalise for a separate table.
'''from google.cloud import bigquery
project = 'lake'
client = bigquery.Client(credentials=credentials, project=project)
dataset_ref = client.dataset('XXX')
table_ref = dataset_ref.table('RAW_XXX')
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.write_disposition = 'WRITE_TRUNCATE'
client.load_table_from_dataframe(appended_data, table_ref,job_config=job_config).result()'''
This is the error recieved. NotImplementedError: struct

This is currently not supported due to how parquet serialization works.
A feature request to upload pandas DataFrame containing arrays was created at the client library's GitHub:
https://github.com/googleapis/google-cloud-python/issues/8544

Related

Using PySpark to read in datalake table and can't parse timestamp column in Synapse Analytics

I can read in the datalake table and print schema but if I try and display data I get the following error. I am working within Synapse Analytics using a PySpark Notebook and Apache Spark Pool.
See error message:
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '10/27/2022 1:14:31 PM' in the new parser.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
I don't want to use the LEGACY version.
I've tried converting using the following code
df = df.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"M/dd/yyyy h:m:s"))
df = df.withColumn("SinkModifiedOn",to_date(col("SinkModifiedOn"),"M/dd/yyyy h:m:s"))
I've also tried converting the suspect columns to StringType() or DateType() but no luck.
Any help appreciated
Thank you
Try the script with below date format
df = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
I repro'd the same with sample input. Below is the approach.
Code:
df1=spark.createDataFrame(
data = [ ("1","Arpit","10/27/2022 1:14:31 PM"),("2","Anand","10/28/2022 1:14:31 PM"),("3","Mike","10/29/2022 1:14:31 PM")],
schema=["id","Name","SinkCreatedOn"])
df1.printSchema()
from pyspark.sql.functions import *
df_output = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
df1.show()
df_output.show()
df1
df_output

Spark s3 csv files read order

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ?
If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name
s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv
01:00:00
s3:bucketname/folder1/file2.csv
01:10:00
s3:bucketname/folder1/file3.csv
01:20:00
You can achive this using following
Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' + file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')

Error while creating data frame from Rest Api in Pyspark

I have a below pyspark code. I am reading a json data from Rest API and trying to load using pyspark.
But i couldnt read the data in DataFrame in spark.Can some one help me in this.
import urllib
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
data = urllib.request.urlopen(url).read().decode('utf-8')
rdd = sc.parallelize(data)
df = spark.createDataFrame(rdd,schema)
df.show()```
**The Error message is TypeError: StructType can not accept object '[' in type <class 'str'>**
** I have been able to do using dataset in scala but i am not able to understand why its not possible using python **
import spark.implicits._
// Load the data from the New York City Taxi data REST API for 2016 Green Taxi Trip Data
val url="https://data.cityofnewyork.us/resource/pqfs-mqru.json"
val result = scala.io.Source.fromURL(url).mkString
// Create a dataframe from the JSON data
val taxiDF = spark.read.json(Seq(result).toDS)
// Display the dataframe containing trip data
taxiDF.show()
Just for others ..
Here is the code that worked for me .Request .get returns a list
import requests
import json
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
r = requests.get(url)
data_json = r.json()
df = spark.createDataFrame(data_json,schema)
display(df)

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Reading Unzipped Shapefiles stored in AWS S3 from AWS EMR Cluster using PySpark in Jupyter Notebook

I'm completely new to AWS EMR and apache spark. I'm trying to assign GeoID's to residential properties using shapefiles. I'm not able to read the shapefiles from my s3 bucket. Please help me in understanding what is going on as I couldn't find any answer on the internet that explains the exact problem.
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
Files That I want to read
The error that I'm getting while reading from the bucket
I really want to read these shapefiles in AWS EMR cluster, as it's not possible for me to work locally on them individually. Any kind of help is appreciated.
I was able to read my shape files from s3 bucket as a binary object in the beginning and then build a wrapper function around it, finally parsed the individual file objects to shapefile.reader() method in .dbf, .shp ,.shx formats separately.
This was happening because PySpark cannot read formats that are not provided in SparkContext. Found this link helpful Using pyshp to read a file-like object from a zipped archive.
My solution
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
This works fine without breaking.

Resources