Create dataframe with schema provided as JSON file - apache-spark

How can I create a pyspark data frame with 2 JSON files?
file1: this file has complete data
file2: this file has only the schema of file1 data.
file1
{"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"}
file2
[{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},"name":"DESCR","nullable":true,"type":"string"},{"metadata":{},"name":"DESCRSHORT","nullable":true,"type":"string"}],"type":"struct"}]

You have to read, first, the schema file using Python json.load, then convert it to DataType using StructType.fromJson.
import json
from pyspark.sql.types import StructType
with open("/path/to/file2.json") as f:
json_schema = json.load(f)
schema = StructType.fromJson(json_schema[0])
Now just pass that schema to DataFrame Reader:
df = spark.read.schema(schema).json("/path/to/file1.json")
df.show()
#+---------+----------+----------+-------------------+----------+
#|RESIDENCY| EFFDT|EFF_STATUS| DESCR|DESCRSHORT|
#+---------+----------+----------+-------------------+----------+
#| AUS|01-01-1900| A|Australian Resident|Australian|
#+---------+----------+----------+-------------------+----------+
EDIT:
If the file containing the schema is located in GCS, you can use Spark or Hadoop API to get the file content. Here is an example using Spark:
file_content = spark.read.text("/path/to/file2.json").rdd.map(
lambda r: " ".join([str(elt) for elt in r])
).reduce(
lambda x, y: "\n".join([x, y])
)
json_schema = json.loads(file_content)

I have found GCSFS packages to access files in GCP Buckets:
pip install gcsfs
import gcsfs
fs = gcsfs.GCSFileSystem(project='your GCP project name')
with fs.open('path/toread/sample.json', 'rb') as f:
json_schema=json.load(f)

Related

user defined function in python to read csv file

Reads in a dataset using pandas.
Parameters
----------
file_path : string containing path to a file
Returns
-------
Pandas DataFrame with data read in from the file path
'''
I have defined the following UDF but it doesnt work.
def read_data(file_path):
pandas.read_csv('file_path')
Looks like you are missing the return and the variable shouldn't have quotes
import pandas as pd
def read_data(file_path: str) -> pd.DataFrame:
return pd.read_csv(file_path)

Python CSV merge issue

New to python and I am presently in the process of CSV merge using Python 3.7.
import pandas as pd
import os
newdir = 'C:\\xxxx\\xxxx\\xxxx\\xxxx'
list = os.listdir(newdir)
writer = pd.ExcelWriter('test.xlsx')
for i in range(0,len(list)):
data = pd.read_csv(list[i],encoding="gbk", index_col=0)
data.to_excel(writer, sheet_name=list[i])
writer.save()
I try to result as below:
FileNotFoundError: [Errno 2] File b'a.csv' does not exist: b'a.csv'
The problem is all of not csv merge into one xlsx file. Please let me know solution.
os.listdir only returns the filenames. You'll need to prepend the folder name to the filename.
import pandas as pd
import os
newdir = 'C:\\xxxx\\xxxx\\xxxx\\xxxx'
names = os.listdir(newdir)
writer = pd.ExcelWriter('test.xlsx')
for name in names:
path = os.path.join(newdir, name)
data = pd.read_csv(path, encoding="gbk", index_col=0)
data.to_excel(writer, sheet_name=name)
writer.save()
Note that I did not bother to check the rest of your code.
Oh and please avoid using builtins to name your variables.

Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

Datatypes are not preserved when a pandas data frame partitioned and saved as parquet file using pyarrow.
Case 1: Saving a partitioned dataset - Data Types are NOT preserved
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
Output:
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
name object
age category
dtype: object
Case 2: Non-partitioned dataset - Data types are preserved
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
Output:
Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
age int64
name object
dtype: object
There is no obvious way to do this. Please refer to below JIRA issue.
https://issues.apache.org/jira/browse/ARROW-6114
You can try this:
import pyarrow as pa
import pyarrow.parquet as pq
# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df)
# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

Read bytes file from AWS S3 into AWS SageMaker conda_python3

Good morning,
Yesterday I saved a file from SageMaker conda_python3 to S3 like this:
s3 = boto3.client(
's3',
aws_access_key_id='XXXX',
aws_secret_access_key='XXXX'
)
y = pandas.DataFrame(df.tag_factor,index = df.index)
s3.put_object(Body = y.values.tobytes(), Bucket='xxx', Key='xxx')
Today I am trying to open it with conda_python3 as a pandas.Series or as a numpy.array object, with this code:
s3 = boto3.client(
's3',
aws_access_key_id='XXX',
aws_secret_access_key='XXX'
)
y_bytes = s3.get_object(Bucket='xxx', Key='xxx')
y = numpy.load(io.BytesIO(y_bytes['Body'].read()))
but I am getting this error: OSError: Failed to interpret file <_io.BytesIO >object at 0x7fcb0b403258> as a pickle
I tried this:
y = numpy.fromfile(io.BytesIO(y_bytes['Body'].read()))
and I get this error:
UnsupportedOperation: fileno
I tried this:
y = pd.read_csv(io.BytesIO(y_bytes['Body'].read()), sep=" ", header=None)
and I get this error:
EmptyDataError: No columns to parse from file
How can I read this file?
As suggested in a previous comment, you probably want to save your data into a known file format for reading from and writing data to S3.
As an example, here is some code that converts a pandas DataFrame to csv, saves it in S3, and reads the file from S3 back into a DataFrame.
import pandas as pd
import boto3
import io
df = pd.dataFrame(...)
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3 = boto3.client('s3')
bucket = 'mybucket'
key = 'myfile.csv'
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key=key)
obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key=key)
df2 = pd.read_csv(io.BytesIO(object['Body'].read()))

Pyspark - How can I convert parquet file to text file with delimiter

I have a parquet file with the following schema:
|DATE|ID|
I would like to convert it into a text file with tab delimiters as follows:
20170403 15284503
How can I do this in pyspark?
In Spark 2.0+
spark.read.parquet(input_path)
to read the parquet file into a dataframe. DataFrameReader
spark.write.csv(output_path, sep='\t')
to write the dataframe out as tab delimited. DataFrameWriter
You can read your .parquet file in python using DataFrame and with the use of list data structure, save that in a text file. the sample code is here:
this code, reads word2vec (word to vector) that is output of spark mllib WordEmbeddings class in a .parquet file and convert it to tab delimiter .txt file.
import pandas as pd
import pyarrow.parquet as pq
import csv
data = pq.read_pandas('C://...//parquetFile.parquet', columns=['word', 'vector']).to_pandas()
df = pd.DataFrame(data)
vector = df['vector'].tolist()
word = df['word']
word = word.tolist()
k = [[]]
for i in range(0, word.__len__()):
l = []
l.append(word[i])
l.extend(vector[i])
k.append(l)
#you can not save data frame directly to .txt file.
#so, write df to .csv file at first
with open('C://...//csvFile.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
for row in k:
writer.writerow(row)
outputTextFile = 'C://...//textFile.txt'
with open(outputTextFile, 'w') as f:
for record in k:
if (len(record) > 0):
for element in record:
#tab delimiter elements
f.write("%s\t" % element)
f.write("%s" % element)
#add enter after each records
f.write("\n")
I hope it helps :)

Resources