How to create a schema for dataset in Hive table? - apache-spark

I am building a schema for the dataset below from a hive table.
After processing I have to write the data to S3.
I need to restructure and group the user id interaction based on date attached json image format to be prepared.
For building this schema i have prepared a struct type with array.
fields = [
StructField("expUserId", StringType(), True),
StructField("recordDate", StringType(), True),
StructField("siteId", StringType(), True),
StructField("siteName", StringType(), True),
StructField("itineraryNumber", StringType(), True),
StructField("travelStartDate", StringType(), True),
StructField("travelEndDate", StringType(), True),
StructField("destinationID", StringType(), True),
StructField("lineOfBusiness", StringType(), True),
StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
StructField("PageId", StringType(), True),
StructField("count", LongType(), True)]))), True)
]
schema = StructType(fields)
return schema
Is this schema correct? How to convert the DataFrame to the below json schema type.

Why wouldn't you just use a SparkSession to read in the json use schema to show the interpreted structure?
spark.read.json(inputPath).schema

If your dataset is in Hive, read it using a JDBC or Hive integration layer (see Hive Tables or JDBC To Other Databases in the official documentation of Spark).
It is as simple as spark.read.format("jdbc")...load() or spark.read.table respectively (see DataFrameReader API in the official documentation).
What's nice about this approach is that Spark can automatically infer the schema for you (so you can leave that out and have more time for yourself!)
Once the dataset is in your hands as a DataFrame or Dataset, you can save it to S3 in JSON format as follows:
inventoryDF.write.format("json").save("s3n://...")
See JSON Datasets and DataFrameWriter API in the official documentation.
I strongly recommend letting Spark do the hard work so you don't have to.

You can create new dataframe from json with your own defined schema.
val myManualSchema = new StructType(Array(
new StructField("column1", StringType, true),
new StructField("column2", LongType, false)
))
val myDf = spark.read.format("json")
.schema(myManualSchema)
.load('/x/y/zddd.json')
dataframe can be created without specifying schema manually. So spark will generate schema by evaluating input file.
val df = spark.read.format("json").load("/x/y/zddd.json")
read the schema from json using below command.
val SchJson = spark.read.format("json").load("/x/y/zddd.json").schema

Related

Loading selected column from csv file to dataframe in Spark

I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit
Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)

Trouble when writing the data to Delta Lake in Azure databricks (Incompatible format detected)

I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception :
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
Here is the code preceding the exception :
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
This error message is telling you that there is already data at the destination path (in this case dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/), and that that data is not in the Delta format (i.e. there is no transaction log). You can either choose a new path (which based on the comments above, it seems like you did) or delete that directory and try again.
I found this Question with this search: "You are trying to write to *** using Databricks Delta, but there is no transaction log present."
In case someone searches for the same:
For me the solution was to explicitly code
.write.format("parquet")
because
.format("delta")
is the dafault since Databricks Runtime 8.0 and above and I need "parquet" for legacy reasons.
One can get this error if also tries to read the data in a format that is not supported by spark.read (or if does not specify the format).
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo.
dataframe = spark.read.format('csv').load(path)

Spark read parquet with custom schema

I'm trying to import data with parquet format with custom schema but it returns :
TypeError: option() missing 1 required positional argument: 'value'
ProductCustomSchema = StructType([
StructField("id_sku", IntegerType(), True),
StructField("flag_piece", StringType(), True),
StructField("flag_weight", StringType(), True),
StructField("ds_sku", StringType(), True),
StructField("qty_pack", FloatType(), True)])
def read_parquet_(path, schema) :
return spark.read.format("parquet")\
.option(schema)\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")\
.load(path)
product_nomenclature = 'C:/Users/alexa/Downloads/product_nomenc'
product_nom = read_parquet_(product_nomenclature, ProductCustomSchema)
As mentioned in the comments you should change .option(schema) to .schema(schema). option() requires you to specify a key (the name of the option you're setting) and a value (what value you want to assign to that option). You are getting the TypeError because you were just passing a variable called schema to option without specifying what that option you were actually trying to set with that variable.
The QueryExecutionException you posted in the comments is being raised because the schema you've defined in your schema variable does not match the data in your DataFrame. If you're going to specify a custom schema you must make sure that schema matches the data you are reading. In your example the column id_sku is stored as a BinaryType, but in your schema you're defining the column as an IntegerType. pyspark will not try to reconcile differences between the schema you provide and what the actual types are in the data and an exception will be thrown.
To fix your error make sure the schema you're defining correctly represents your data as it is stored in the parquet file (i.e. change the datatype of id_sku in your schema to be BinaryType). The benefit to doing this is you get a slight performance gain by not having to infer the file schema each time the parquet file is read.

Specify pyspark dataframe schema with string longer than 256

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.
According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.
According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691
it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.
How can I specify the schema with varchar(max)?
df = ...from source
schema = StructType([
StructField('field1', StringType(), True),
StructField('description', StringType(), True)
])
df = sqlContext.createDataFrame(df.rdd, schema)
Redshift maxlength annotations are passed in format
{"maxlength":2048}
so this is the structure you should pass to StructField constructor:
from pyspark.sql.types import StructField, StringType
StructField("description", StringType(), metadata={"maxlength":2048})
or alias method:
from pyspark.sql.functions import col
col("description").alias("description", metadata={"maxlength":2048})
If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

Can you have a column of dataframes in pyspark?

I am a little new to pyspark/bigdata so this could be a bad idea, but I have about a million individual CSV files each associated with some metadata. I would like a pyspark dataframe with columns for all the metadata fields, but also with a column whose entries are the (whole) CSV files associated with each set of metadata.
I am not at work right now but I remember almost the exact code. I have tried a toy example something like
outer_pandas_df = pd.DataFrame.from_dict({"A":[1,2,3],"B":[4,5,6]})
## A B
## 0 1 4
## 1 2 5
## 2 3 6
And then if you do
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True)
])
outer_spark_df = sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
Then the result is a spark dataframe as expected. But now if you do
inner_pandas_df = pd.DataFrame.from_dict({"W":["X","Y","Z"]})
outer_pandas_df["C"] = [inner_pandas_df, inner_pandas_df, inner_pandas_df]
And make the schema like
inner_schema = StructType([
StructField("W", StringType(), True)
])
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True),
StructField("W", ArrayType(inner_schema), True)
])
then this fails:
sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
with an error related to ArrayType not accepting pandas dataframes. I don't have the exact error.
Is what I'm trying to do possible?
Spark does not support nested dataframes. Why do you want a column that contains the entire CSV to be constantly stored in memory, anyway? It seems to me that if you need that, you are not successfully extracting the data into the other columns.

Resources