Why can't Spark properly load columns from HDFS? [duplicate] - apache-spark

This question already has answers here:
What is going wrong with `unionAll` of Spark `DataFrame`?
(5 answers)
Closed 4 years ago.
Below I provide my schema and the code that I use to read from partitions in hdfs.
An example of a partition could be this path: /home/maria_dev/data/key=key/date=19 jan (and of course inside this folder there's a csv file that contains cnt)
So, the data I have is partitioned by key and date columns.
When I read it like below the columns are not properly read, so cnt gets read into date and vice versa.
How can I resolve this?
private val tweetSchema = new StructType(Array(
StructField("date", StringType, nullable = true),
StructField("key", StringType, nullable = true),
StructField("cnt", IntegerType, nullable = true)
))
// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
val df = spark.read
.schema(tweetSchema)
.format(format)
.option("basePath", basePath)
.load(path)
df
}
I tried changing their order in the schema from (date, key, cnt) to (cnt, key, date) but it does not help.
My problem is that when I call union, it appends 2 dataframes:
df1: {(key: 1, date: 2)}
df2: {(date: 3, key: 4)}
into the final dataframe like this: {(key: 1, date: 2), (date: 3, key: 4)}. As you can see, the columns are messed up.

The schema should be in the following order:
Columns present in the data files as such - in case of CSV in the natural order from left to right.
Columns used with partitioning in the same order as defined by the directory structure.
So in your case the correct order will be:
new StructType(Array(
StructField("cnt", IntegerType, nullable = true),
StructField("key", StringType, nullable = true),
StructField("date", StringType, nullable = true)
))

It turns out that everything was read properly.
So, now, instead of doing df1.union(df2), I do df1.select("key", "date").union(df2.select("key", "date")) and it works.

Related

Loading selected column from csv file to dataframe in Spark

I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit
Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)

Unable to infer schema for CSV in pyspark

I'm using databricks and trying to read in a csv file like this:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_my_file)
)
and I'm getting the error:
AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'
I've checked that my file is not empty, and I've also tried to specify schema myself like this:
schema = "datetime timestamp, id STRING, zone_id STRING, name INT, time INT, a INT"
df = (spark.read
.option("header", "true")
.schema(schema)
.csv(path_to_my_file)
)
But when try to see it using display(df), it just gives me this below, I'm totally lost and don't know what to do.
df.show() and df.printSchema() gives the following:
It looks like that data are not being read into the dataframe.
error snapshot:
Note, this is an incomplete answer as there isn't enough information about what your file looks like to understand why the inferSchema did not work. I've placed this response as an answer as it is too long as a comment.
Saying this, for programmatically specifying a schema, you would need to specify the schema using StructType().
Using your example of
datetime timestamp, id STRING, zone_id STRING, name INT, time INT, mod_a INT"
it would look something like this:
# Import data types
from pyspark.sql.types import *
schema = StructType(
[StructField('datetime', TimestampType(), True),
StructField('id', StringType(), True),
StructField('zone_id', StringType(), True),
StructField('name', IntegerType(), True),
StructField('time', IntegerType(), True),
StructField('mod_a', IntegerType(), True)
]
)
Note, how the df.printSchema() had specified that all of the columns were datatype string.
I discovered that the problem was caused by the filename.
Perhaps databrick is unable to read filename schemas that begin with '_'. (underscore).
I had the same problem, and when I uploaded the file without the first letter (ie, underscore), I was able to process it.

HBase + Spark : Dataframe put not replacing existing column values with null values for same RowKey

We are persisting spark dataframe into HBase.
We are facing issue while overwriting data into hbase, when a column in the updated row is having a value as null, which was not null earlier.
Issue which we are facing is as below:
First we insert a dataframe into HBase as below:
val rowsList =Seq(Row("Acct1", "100", "1")Row("Acct2", "200", "2")).asJava
val schema: StructType =
StructType(List(StructField("a", StringType, true),
StructField("b", StringType, true),
StructField("c", StringType, true)))
val df: DataFrame = sparkSession.createDataFrame(rowsList, schema)
Then we put this dataframe into HBase, which works as expected.
When we are overwriting an existing rowKey as below:
val rowsList = Seq(Row("Acct2", null, "3")).asJava
val df: DataFrame = sparkSession.createDataFrame(rowsList, schema)
Then here the value of column 'c' is getting changes from '2' to '3'.
But this row still has column 'b' with value '200'.
How to resolve this issue?

withColumn in spark dataframe inserts NULL in SaveMode.Append

I have a spark application to create Hive external table which works fine for the first time that is while creating the table in hive with partitions. I have three partition namely event,centerCode,ExamDate
var sqlContext = spark.sqlContext
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
import org.apache.spark.sql.functions._
val candidateList = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "null").option("quote", "\"").option("dateFormat", "dd/MM/yyyy")
.schema(StructType(Array(StructField("RollNo/SeatNo", StringType, true), StructField("LabName", StringType, true), StructField("Student_Name", StringType, true), StructField("ExamName", StringType, true), StructField("ExamDate", DateType, true), StructField("ExamTime", StringType, true), StructField("CenterCode", StringType, true), StructField("Center", StringType, true)))).option("multiLine", "true").option("mode", "DROPMALFORMED").load(filePath(0))
val nef = candidateList.withColumn("event", lit(eventsId))
Partition column event will not be present in input csv file so I'm adding that column to the dataframe candidateList using withColumn("event", lit(eventsId))
While im writing it to the Hive table it works fine withColumn added to the table with event say "ABCD" and the partitions are created as expected.
nef.repartition(1).write.mode(SaveMode.Overwrite).option("path", candidatePath).partitionBy("event", "CenterCode", "ExamDate").saveAsTable("sify_cvs_output.candidatelist")
candidateList.show() Gives
+-------------+--------------------+-------------------+----------+----------+--------+----------+--------------------+-----+
|RollNo/SeatNo| LabName| Student_Name| ExamName| ExamDate|ExamTime|CenterCode| Center|event|
+-------------+--------------------+-------------------+----------+----------+--------+----------+--------------------+-----+
| 80000077|BUILDING-MAIN FLO...| ABBAS MOHAMMAD|PGECETICET|2018-07-30|10:00 AM| 500098A|500098A-SURYA TEC...| ABCD|
| 80000056|BUILDING-MAIN FLO...| ABDUL YASARARFATH|PGECETICET|2018-07-30|10:00 AM| 500098A|500098A-SURYA TEC...| ABCD|
But for the second time i'm trying to Append the data to the hive table created already with a new event "EFGH" but for the second time the added column using withColumn inserted as NULL
nef.write.mode(SaveMode.Append).insertInto("sify_cvs_output.candidatelist") and the partitions also haven't come properly as one of the partition column becomes `NULL`, so I tried adding one more new column in the dataframe `.withColumn("sample", lit("sample"))` again for the first time it writes all the extra added columns to the table and the next time on `SaveMode.Append` inserts the `event` column and the `sample` column added to the table as `NULL`
show create table below
CREATE EXTERNAL TABLE `candidatelist`(
`rollno/seatno` string,
`labname` string,
`student_name` string,
`examname` string,
`examtime` string,
`center` string,
`sample` string)
PARTITIONED BY (
`event` string,
`centercode` string,
`examdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='hdfs://172.16.2.191:8020/biometric/sify/cvs/output/candidate/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://172.16.2.191:8020/biometric/sify/cvs/output/candidate'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='parquet',
'spark.sql.sources.schema.numPartCols'='3',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"RollNo/SeatNo\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"LabName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Student_Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamTime\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Center\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sample\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"event\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CenterCode\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ExamDate\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}',
'spark.sql.sources.schema.partCol.0'='event',
'spark.sql.sources.schema.partCol.1'='CenterCode',
'spark.sql.sources.schema.partCol.2'='ExamDate',
'transient_lastDdlTime'='1536040545')
Time taken: 0.025 seconds, Fetched: 32 row(s)
hive>
What am I doing wrong here..!
UPDATE
#pasha701, below is my sparkSession
val Spark=SparkSession.builder().appName("splitInput").master("local").config("spark.hadoop.fs.defaultFS", "hdfs://" + hdfsIp)
.config("hive.metastore.uris", "thrift://172.16.2.191:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
and if I add partitionBy in InsertInto
nef.write.mode(SaveMode.Append).partitionBy("event", "CenterCode", "ExamDate").option("path", candidatePath).insertInto("sify_cvs_output.candidatelist")
it throws exception as org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;
Second time "partitionBy" also have to be used. Also maybe option "hive.exec.dynamic.partition.mode" will be required.

Syntax while setting schema for Pyspark.sql using StructType

I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:
spark= SparkSession.builder.getOrCreate()
from pyspark.sql.types import StringType, IntegerType,
StructType, StructField
rdd = sc.textFile('./some csv_to_play_around.csv'
schema = StructType([StructField('Name', StringType(), True),
StructField('DateTime', TimestampType(), True)
StructField('Age', IntegerType(), True)])
# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)
My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance
It means if the column allows null values, true for nullable, and false for not nullable
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
Refer to Spark SQL and DataFrame Guide for more informations.
You can also use a datatype string:
schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'
There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes

Resources