Write parquet from another parquet with a new schema using pyspark - apache-spark

I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file
The original schema is (It have 9.000 variables, I am just putting the first 5 for the example):
[('id', 'string'),
('date', 'string'),
('option', 'string'),
('cel1', 'string'),
('cel2', 'string')]
And I want to write:
[('id', 'integer'),
('date', 'integer'),
('option', 'integer'),
('cel1', 'integer'),
('cel2', 'integer')]
My code is:
df = sqlContext.read.parquet("PATH")
### SOME OPERATIONS ###
write_schema = StructType([StructField('id' , IntegerType(), True),
StructField('date' , IntegerType(), True),
StructField('option' , IntegerType(), True),
StructField('cel1' , IntegerType(), True),
StructField('cel2' , IntegerType(), True) ])
df.option("schema",write_schema).write("PATH")
After I run it I still have the same schema from the original data, everything is string, the schema did not changed
Also I tried using
df = sqlContext.read.option("schema",write_schema).parquet(PATH)
This option does not change the schema when I read it, It shows the original one, so I use (suggested in here):
df = sqlContext.read.schema(write_schema).parquet(PATH)
These one works for the reading part, if I see the types I get:
df.dtypes
#>>[('id', 'int'),
# ('date', 'int'),
# ('option', 'int'),
# ('cel1', 'int'),
# ('cel2', 'int')]
But when I tried to write the parquet I get an error:
Parquet column cannot be converted. Column: [id], Expected: IntegerType, Found: BINARY
Regards

Cast your columns to int and then try writing to another parquet file. No schema specification needed.
df = spark.read.parquet("filepath")
df2 = df.select(*map(lambda col: df[col].cast('int'), df.columns))
df2.write.parquet("filepath")

for this you can actually enforce the schema right when reading the data.
you can modify the code as follows:
df = sqlContext.read.option("schema",write_schema).parquet("PATH")
df.write.parquet("NEW_PATH")

Related

pyspark json not able to inferschema for empty

In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame.
Sample json :
{logs :[],pagination:{}}
And it only ignores the second element, i.e pagination in the above example. is there anyway to read the json with proper schema.?
Yes, you can perform in two ways with schema and without schema:
Reading Json with schema:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('email', StringType(), True),
StructField('first_name', StringType(), True),
StructField('gender', StringType(), True),
StructField('id', LongType(), True),
StructField('last_name', StringType(), True)])
df = spark.read.schema(schema).json(r'dbfs:/FileStore/MOCK_DATA__1_.json')
Reading Json Without schema
d1 = spark.read.json(r'dbfs:/FileStore/MOCK_DATA__1_.json')
d1.show()

Reading csv files in PySpark

I am trying to read csv file and convert into dataframe.
input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

How to create spark dataframe with column name which contains dot/period?

I have data in a list and want to convert it to a spark dataframe with one of the column names containing a "."
I wrote the below code which ran without any errors.
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("`US.sales`", FloatType(), True)])
input_mock_df = spark.createDataFrame(input_mock_rdd_map, rdd_schema)
The below code returns the column names
input_mock_df.columns
But any operations on this dataframe is giving error for example
input_mock_df.count()
How do I make a valid spark dataframe which contains a "."?
Note:
I dont give "." in the column name the code works perfectly.
I want to solve it using native spark and not use pandas etc
I have ran the below code
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("US.sales", IntegerType(), True)])
input_mock_df = sqlContext.createDataFrame(input_data, rdd_schema)
input_mock_df.count()
and it works fine returning the count as 2. Please try and reply

Pyspark: Transforming PythonRDD to Dataframe

Could someone guide me to convert PythonRDD to a DataFrame.
As per my understanding, reading a file should create a DF, but in my case it has created a PythonRDD. I finding it hard to convert PythonRDD to a DataFrame. Could not find CreateDataFrame() or toDF().
Please find my below code to read a tab seperated text file:
rdd1 = sparkCxt.textFile(setting.REFRESH_HDFS_DIR + "/Refresh")
rdd2 = rdd1.map(lambda row: unicode(row).lower().strip()\
if type(row) == unicode else row)
Now, I would want to convert PythonRDD to a DF.
I wanted to convert to DF to map the schema, so that I could do further processing at column level.
Also, please suggest if you think there is a better approach.
Please reply if more details are required.
Thank you.
Spark DataFrames can be created directly from a text file, but you should use sqlContext instead of sc (SparkContext), since sqlContext is an entry point for working with DataFrames.
df = sqlContext.read.text('path/to/my/file')
This will create a DataFrame with a single column named value. You can use UDF functions to split it into required columns.
Another approach would be to read the text files to an RDD, split it into columns using map, reduce, filter and other operations, and then convert the final RDD to a DataFrame.
For example, let's say we have a RDD named my_rdd with the following structure:
[(1, 'Alice', 23), (2, 'Bob', 25)]
We can easily convert it to a DataFrame:
df = sqlContext.createDataFrame(my_rdd, ['id', 'name', 'age'])
where id, name and age are names for our columns.
you can try using toPandas() although you should be cautious when doing so since converting an rdd to pandas DataFrame will be like bringing all data into memory which might cause OOM error if your distributed data is large.
I would use the Spark-csv package (Spark-csv Github) and import directly into a dataframe after defining the schema.
For example:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("year", IntegerType(), True), \
StructField("make", StringType(), True), \
StructField("model", StringType(), True), \
StructField("comment", StringType(), True), \
StructField("blank", StringType(), True)])
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true') \
.load('cars.csv', schema = customSchema)
This defaults to a comma for the delimiter, but you can change that to a tab with something like:
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true', delimiter='\t') \
.load('cars.csv', schema = customSchema)
Note that it is possible to infer the schema using another option, but this does require reading the entire file prior to loading the dataframe.

How to create an empty DataFrame? Why "ValueError: RDD is empty"?

I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This is my code
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().
scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []
scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()
At the time this answer was written it looks like you need some sort of schema
from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)
sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)
This will work with spark version 2.0.0 or more
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
spark.range(0).drop("id")
This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.
You can just use something like this:
pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])
If you want an empty dataframe based on an existing one, simple limit rows to 0.
In PySpark :
emptyDf = existingDf.limit(0)
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkPractice').getOrCreate()
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()
This is a roundabout but simple way to create an empty spark df with an inferred schema
# Initialize a spark df using one row of data with the desired schema
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row. Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')
empty_sdf.printSchema()
# Output
root
|-- name: string (nullable = true)
|-- index: long (nullable = true)
|-- seq_#: long (nullable = true)
Seq.empty[String].toDF()
This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)
In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:
df = spark.createDataFrame([], schema)
You can do it by loading an empty file (parquet, json etc.) like this:
df = sqlContext.read.json("my_empty_file.json")
Then when you try to check the schema you'll see:
>>> df.printSchema()
root
In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.
You can create an empty data frame by using following syntax in pyspark:
df = spark.createDataFrame([], ["col1", "col2", ...])
where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:
**df2.createOrReplaceTempView("artist")**

Resources