Pyspark - How can I convert parquet file to text file with delimiter - apache-spark

I have a parquet file with the following schema:
|DATE|ID|
I would like to convert it into a text file with tab delimiters as follows:
20170403 15284503
How can I do this in pyspark?

In Spark 2.0+
spark.read.parquet(input_path)
to read the parquet file into a dataframe. DataFrameReader
spark.write.csv(output_path, sep='\t')
to write the dataframe out as tab delimited. DataFrameWriter

You can read your .parquet file in python using DataFrame and with the use of list data structure, save that in a text file. the sample code is here:
this code, reads word2vec (word to vector) that is output of spark mllib WordEmbeddings class in a .parquet file and convert it to tab delimiter .txt file.
import pandas as pd
import pyarrow.parquet as pq
import csv
data = pq.read_pandas('C://...//parquetFile.parquet', columns=['word', 'vector']).to_pandas()
df = pd.DataFrame(data)
vector = df['vector'].tolist()
word = df['word']
word = word.tolist()
k = [[]]
for i in range(0, word.__len__()):
l = []
l.append(word[i])
l.extend(vector[i])
k.append(l)
#you can not save data frame directly to .txt file.
#so, write df to .csv file at first
with open('C://...//csvFile.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
for row in k:
writer.writerow(row)
outputTextFile = 'C://...//textFile.txt'
with open(outputTextFile, 'w') as f:
for record in k:
if (len(record) > 0):
for element in record:
#tab delimiter elements
f.write("%s\t" % element)
f.write("%s" % element)
#add enter after each records
f.write("\n")
I hope it helps :)

Related

How to get specific data from excel

Any idea on how can I acccess or get the box data (see image) under TI_Binning tab of excel file using python? What module or similar code you can recommend to me? I just need those specifica data and append it on other file such as .txt file.
Getting the data you circled:
import pandas as pd
df = pd.read_excel('yourfilepath', 'TI_Binning', skiprows=2)
df = df[['Number', 'Name']]
To appending to an existing text file:
import numpy as np
with open("filetoappenddata.txt", "ab") as f:
np.savetxt(f, df.values)
More info here on np.savetxt for formats to fit your output need.

Create dataframe with schema provided as JSON file

How can I create a pyspark data frame with 2 JSON files?
file1: this file has complete data
file2: this file has only the schema of file1 data.
file1
{"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"}
file2
[{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},"name":"DESCR","nullable":true,"type":"string"},{"metadata":{},"name":"DESCRSHORT","nullable":true,"type":"string"}],"type":"struct"}]
You have to read, first, the schema file using Python json.load, then convert it to DataType using StructType.fromJson.
import json
from pyspark.sql.types import StructType
with open("/path/to/file2.json") as f:
json_schema = json.load(f)
schema = StructType.fromJson(json_schema[0])
Now just pass that schema to DataFrame Reader:
df = spark.read.schema(schema).json("/path/to/file1.json")
df.show()
#+---------+----------+----------+-------------------+----------+
#|RESIDENCY| EFFDT|EFF_STATUS| DESCR|DESCRSHORT|
#+---------+----------+----------+-------------------+----------+
#| AUS|01-01-1900| A|Australian Resident|Australian|
#+---------+----------+----------+-------------------+----------+
EDIT:
If the file containing the schema is located in GCS, you can use Spark or Hadoop API to get the file content. Here is an example using Spark:
file_content = spark.read.text("/path/to/file2.json").rdd.map(
lambda r: " ".join([str(elt) for elt in r])
).reduce(
lambda x, y: "\n".join([x, y])
)
json_schema = json.loads(file_content)
I have found GCSFS packages to access files in GCP Buckets:
pip install gcsfs
import gcsfs
fs = gcsfs.GCSFileSystem(project='your GCP project name')
with fs.open('path/toread/sample.json', 'rb') as f:
json_schema=json.load(f)

How to get column names from a dataframe dynamically to geojson properties

I am trying to read the column names from dataframe and append it to the geojson properties dynamically it worked by hard coding the column names but i want those not by hard coding
can any one help me how to get those values (not by geojson iterating rows)
import pandas as pd
import geojson
def data2geojson(df):
#s="name=X[0],description=X[1],LAT-x[2]"
features = []
insert_features = lambda X: features.append(
geojson.Feature(geometry=geojson.Point((float(X["LONG"]),float(X["LAT"]))),
properties=dict(name=X[0],description=X[1])))
df.apply(insert_features,axis=1)
#with open('/dbfs/FileStore/tables/geojson11.geojson', 'w', encoding='utf8') as fp:
# geojson.dump(geojson.FeatureCollection(features), fp, sort_keys=True, ensure_ascii=False)
print(features)
df=spark.sql("select * from geojson1")
df=df.toPandas()
data2geojson(df)

Problem with processing large(>1 GB) CSV file

I have a large CSV file and I have to sort and write the sorted data to another csv file. The CSV file has 10 columns. Here is my code for sorting.
data = [ x.strip().split(',') for x in open(filename+'.csv', 'r').readlines() if x[0] != 'I' ]
data = sorted(data, key=lambda x: (x[6], x[7], x[8], int(x[2])))
with open(filename + '_sorted.csv', 'w') as fout:
for x in data:
print(','.join(x), file=fout)
It works fine with file size below 500 Megabytes but cannot process files with a size greater than 1 GB. Is there any way I can make this process memory efficient? I am running this code on Google Colab.
Here is a Link to a blog about using pandas for large datasets. In the examples from the link, they are looking at analyzing data from large datasets ~1gb in size.
Simply type the following to import your csv data into python.
import pandas as pd
gl = pd.read_csv('game_logs.csv', sep = ',')

read excel file with words and write it to csv file in Python

I have an excel file with string stored in each cell:
rtypl srtyn OCVXZ srtyn
KPLNV KLNWZ bdfgh KLNWZ
xcvwh mvwhd WQKXM mvwhd
GYTR xvnm YTZN YTZN
ngws jklp PLNM jklp
I wanted to read excel file and write it in csv file. As you can see below:
import pandas as np
import csv
df = pd.read_excel(file, encoding='utf-16')
words= open("words.csv",'wb')
wr = csv.writer(words, dialect='excel')
for item in df:
wr.writerow(item)
But it reads the each line in separated alphabet and not as a string.
r,t,y,p,l
I am limited to write file as csv as I gonna use the result in a library that has lots of facility for csv file. Any advice on how I can read all the rows as a string in the cell is appreciated.
You can try the easiest solution:
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_excel(file, encoding='utf-16')
df.to_csv('words.csv', encoding='utf-16')
Adding to zipa : If excel has multiple sheets : you can also try
import pandas as pd
df = pd.read_excel(file, 'Sheet1')
df.to_csv('words.csv')
Refer :
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

Resources