I am new to spark. I have a excel file that I need to read into a Dataframe. I am using the com.crealytics.spark.excel library to achieve this. The following is my code:
val df = hiveContext.read.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.load("file:///data/home/items.xlsx")
The above code runs without any error. And I am also able to count the number of rows in the df using df.count. But when I try to print the df using df.show, it throws an error saying:
java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
I am using Spark 1.6, Java 1.8 and scala 2.10.5.
I am not sure why this is happening. How do I solve this error and look at the data in the df ?
UPDATE:
I also tried using the StructType to define the schema and impose it during the loading the data into df:
val newschema = StructType(List(StructField("1", StringType, nullable = true),
StructField("2", StringType, nullable = true),
StructField("3", StringType, nullable = true),
StructField("4", StringType, nullable = true),
StructField("5", StringType, nullable = true),
StructField("6", StringType, nullable = true),
StructField("7", StringType, nullable = true),
StructField("8", StringType, nullable = true),
StructField("9", StringType, nullable = true),
StructField("10", StringType, nullable = true)))
val df = hiveContext.read.schema(newschema).format("com.crealytics.spark.excel")...
This doesn't help and I get the same error as before when I try to display the df.
UPDATE-2:
I also tried loading the df using SQLContext. It still gives me the same error.
Any help would be appreciated. Thank you.
So, Apparently, com.crealytics.spark.excel works with spark version 2.0 and above. updating my dependencies and running the jar using spark 2.0 gives the expected result without any errors.
I hope this helps somebody in the future.
Related
I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit
Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)
I have the schema associated with a table to be created fetched from confluent schema-registry in below code:
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata("topicName").getSchema
private var sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
sparkSchema=sparkSchema.dataType.asInstanceOf[StructType]
Now I'm trying to define a delta lake table which has the structure that is based on this schema.
However I'm not sure how to go about the same.
Any help appreciated.
In Scala you can use the following:
for defining schema
val customSchema =
StructType(
Array(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true)
)
)
for reading the table from the schema
val DF =
spark.read.format("csv")
.option("delimiter","\t") //use a proper delimiter
.schema(customSchema)
.load("path")
while writing the table to a particular location you can specify the .format("delta") to
I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception :
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
Here is the code preceding the exception :
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
This error message is telling you that there is already data at the destination path (in this case dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/), and that that data is not in the Delta format (i.e. there is no transaction log). You can either choose a new path (which based on the comments above, it seems like you did) or delete that directory and try again.
I found this Question with this search: "You are trying to write to *** using Databricks Delta, but there is no transaction log present."
In case someone searches for the same:
For me the solution was to explicitly code
.write.format("parquet")
because
.format("delta")
is the dafault since Databricks Runtime 8.0 and above and I need "parquet" for legacy reasons.
One can get this error if also tries to read the data in a format that is not supported by spark.read (or if does not specify the format).
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo.
dataframe = spark.read.format('csv').load(path)
I use something like this, to insert into a table in spark Cassandra. If you see all the columns are hard coded, is there a good way of handling it dynamically?
val logSchema = StructType(Array(StructField("tablename", StringType, true), StructField("filename", StringType, true), StructField("number_of_rows", StringType, true), StructField("loadtime", StringType, true), StructField("statusdetail", StringType, true)))
You can always insert via saveToCassandra an RDD of CassandraRow Objects which don't have to have an explicit schema.
Something like
rdd : RDD[Map[String, Any]]
rdd.map( row => CassandraRow.fromMap(row).saveToCassandra )
http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.CassandraRow
I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:
spark= SparkSession.builder.getOrCreate()
from pyspark.sql.types import StringType, IntegerType,
StructType, StructField
rdd = sc.textFile('./some csv_to_play_around.csv'
schema = StructType([StructField('Name', StringType(), True),
StructField('DateTime', TimestampType(), True)
StructField('Age', IntegerType(), True)])
# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)
My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance
It means if the column allows null values, true for nullable, and false for not nullable
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
Refer to Spark SQL and DataFrame Guide for more informations.
You can also use a datatype string:
schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'
There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes