I have the schema associated with a table to be created fetched from confluent schema-registry in below code:
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata("topicName").getSchema
private var sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
sparkSchema=sparkSchema.dataType.asInstanceOf[StructType]
Now I'm trying to define a delta lake table which has the structure that is based on this schema.
However I'm not sure how to go about the same.
Any help appreciated.
In Scala you can use the following:
for defining schema
val customSchema =
StructType(
Array(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true)
)
)
for reading the table from the schema
val DF =
spark.read.format("csv")
.option("delimiter","\t") //use a proper delimiter
.schema(customSchema)
.load("path")
while writing the table to a particular location you can specify the .format("delta") to
Related
I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit
Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)
I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.
According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.
According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691
it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.
How can I specify the schema with varchar(max)?
df = ...from source
schema = StructType([
StructField('field1', StringType(), True),
StructField('description', StringType(), True)
])
df = sqlContext.createDataFrame(df.rdd, schema)
Redshift maxlength annotations are passed in format
{"maxlength":2048}
so this is the structure you should pass to StructField constructor:
from pyspark.sql.types import StructField, StringType
StructField("description", StringType(), metadata={"maxlength":2048})
or alias method:
from pyspark.sql.functions import col
col("description").alias("description", metadata={"maxlength":2048})
If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.
I am new to spark. I have a excel file that I need to read into a Dataframe. I am using the com.crealytics.spark.excel library to achieve this. The following is my code:
val df = hiveContext.read.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.load("file:///data/home/items.xlsx")
The above code runs without any error. And I am also able to count the number of rows in the df using df.count. But when I try to print the df using df.show, it throws an error saying:
java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
I am using Spark 1.6, Java 1.8 and scala 2.10.5.
I am not sure why this is happening. How do I solve this error and look at the data in the df ?
UPDATE:
I also tried using the StructType to define the schema and impose it during the loading the data into df:
val newschema = StructType(List(StructField("1", StringType, nullable = true),
StructField("2", StringType, nullable = true),
StructField("3", StringType, nullable = true),
StructField("4", StringType, nullable = true),
StructField("5", StringType, nullable = true),
StructField("6", StringType, nullable = true),
StructField("7", StringType, nullable = true),
StructField("8", StringType, nullable = true),
StructField("9", StringType, nullable = true),
StructField("10", StringType, nullable = true)))
val df = hiveContext.read.schema(newschema).format("com.crealytics.spark.excel")...
This doesn't help and I get the same error as before when I try to display the df.
UPDATE-2:
I also tried loading the df using SQLContext. It still gives me the same error.
Any help would be appreciated. Thank you.
So, Apparently, com.crealytics.spark.excel works with spark version 2.0 and above. updating my dependencies and running the jar using spark 2.0 gives the expected result without any errors.
I hope this helps somebody in the future.
I am building a schema for the dataset below from a hive table.
After processing I have to write the data to S3.
I need to restructure and group the user id interaction based on date attached json image format to be prepared.
For building this schema i have prepared a struct type with array.
fields = [
StructField("expUserId", StringType(), True),
StructField("recordDate", StringType(), True),
StructField("siteId", StringType(), True),
StructField("siteName", StringType(), True),
StructField("itineraryNumber", StringType(), True),
StructField("travelStartDate", StringType(), True),
StructField("travelEndDate", StringType(), True),
StructField("destinationID", StringType(), True),
StructField("lineOfBusiness", StringType(), True),
StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
StructField("PageId", StringType(), True),
StructField("count", LongType(), True)]))), True)
]
schema = StructType(fields)
return schema
Is this schema correct? How to convert the DataFrame to the below json schema type.
Why wouldn't you just use a SparkSession to read in the json use schema to show the interpreted structure?
spark.read.json(inputPath).schema
If your dataset is in Hive, read it using a JDBC or Hive integration layer (see Hive Tables or JDBC To Other Databases in the official documentation of Spark).
It is as simple as spark.read.format("jdbc")...load() or spark.read.table respectively (see DataFrameReader API in the official documentation).
What's nice about this approach is that Spark can automatically infer the schema for you (so you can leave that out and have more time for yourself!)
Once the dataset is in your hands as a DataFrame or Dataset, you can save it to S3 in JSON format as follows:
inventoryDF.write.format("json").save("s3n://...")
See JSON Datasets and DataFrameWriter API in the official documentation.
I strongly recommend letting Spark do the hard work so you don't have to.
You can create new dataframe from json with your own defined schema.
val myManualSchema = new StructType(Array(
new StructField("column1", StringType, true),
new StructField("column2", LongType, false)
))
val myDf = spark.read.format("json")
.schema(myManualSchema)
.load('/x/y/zddd.json')
dataframe can be created without specifying schema manually. So spark will generate schema by evaluating input file.
val df = spark.read.format("json").load("/x/y/zddd.json")
read the schema from json using below command.
val SchJson = spark.read.format("json").load("/x/y/zddd.json").schema
I use something like this, to insert into a table in spark Cassandra. If you see all the columns are hard coded, is there a good way of handling it dynamically?
val logSchema = StructType(Array(StructField("tablename", StringType, true), StructField("filename", StringType, true), StructField("number_of_rows", StringType, true), StructField("loadtime", StringType, true), StructField("statusdetail", StringType, true)))
You can always insert via saveToCassandra an RDD of CassandraRow Objects which don't have to have an explicit schema.
Something like
rdd : RDD[Map[String, Any]]
rdd.map( row => CassandraRow.fromMap(row).saveToCassandra )
http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.CassandraRow