Loading selected column from csv file to dataframe in Spark - apache-spark

I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit

Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)

Related

Trouble when writing the data to Delta Lake in Azure databricks (Incompatible format detected)

I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception :
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
Here is the code preceding the exception :
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
This error message is telling you that there is already data at the destination path (in this case dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/), and that that data is not in the Delta format (i.e. there is no transaction log). You can either choose a new path (which based on the comments above, it seems like you did) or delete that directory and try again.
I found this Question with this search: "You are trying to write to *** using Databricks Delta, but there is no transaction log present."
In case someone searches for the same:
For me the solution was to explicitly code
.write.format("parquet")
because
.format("delta")
is the dafault since Databricks Runtime 8.0 and above and I need "parquet" for legacy reasons.
One can get this error if also tries to read the data in a format that is not supported by spark.read (or if does not specify the format).
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo.
dataframe = spark.read.format('csv').load(path)

Specify pyspark dataframe schema with string longer than 256

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.
According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.
According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691
it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.
How can I specify the schema with varchar(max)?
df = ...from source
schema = StructType([
StructField('field1', StringType(), True),
StructField('description', StringType(), True)
])
df = sqlContext.createDataFrame(df.rdd, schema)
Redshift maxlength annotations are passed in format
{"maxlength":2048}
so this is the structure you should pass to StructField constructor:
from pyspark.sql.types import StructField, StringType
StructField("description", StringType(), metadata={"maxlength":2048})
or alias method:
from pyspark.sql.functions import col
col("description").alias("description", metadata={"maxlength":2048})
If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

How to create a schema for dataset in Hive table?

I am building a schema for the dataset below from a hive table.
After processing I have to write the data to S3.
I need to restructure and group the user id interaction based on date attached json image format to be prepared.
For building this schema i have prepared a struct type with array.
fields = [
StructField("expUserId", StringType(), True),
StructField("recordDate", StringType(), True),
StructField("siteId", StringType(), True),
StructField("siteName", StringType(), True),
StructField("itineraryNumber", StringType(), True),
StructField("travelStartDate", StringType(), True),
StructField("travelEndDate", StringType(), True),
StructField("destinationID", StringType(), True),
StructField("lineOfBusiness", StringType(), True),
StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
StructField("PageId", StringType(), True),
StructField("count", LongType(), True)]))), True)
]
schema = StructType(fields)
return schema
Is this schema correct? How to convert the DataFrame to the below json schema type.
Why wouldn't you just use a SparkSession to read in the json use schema to show the interpreted structure?
spark.read.json(inputPath).schema
If your dataset is in Hive, read it using a JDBC or Hive integration layer (see Hive Tables or JDBC To Other Databases in the official documentation of Spark).
It is as simple as spark.read.format("jdbc")...load() or spark.read.table respectively (see DataFrameReader API in the official documentation).
What's nice about this approach is that Spark can automatically infer the schema for you (so you can leave that out and have more time for yourself!)
Once the dataset is in your hands as a DataFrame or Dataset, you can save it to S3 in JSON format as follows:
inventoryDF.write.format("json").save("s3n://...")
See JSON Datasets and DataFrameWriter API in the official documentation.
I strongly recommend letting Spark do the hard work so you don't have to.
You can create new dataframe from json with your own defined schema.
val myManualSchema = new StructType(Array(
new StructField("column1", StringType, true),
new StructField("column2", LongType, false)
))
val myDf = spark.read.format("json")
.schema(myManualSchema)
.load('/x/y/zddd.json')
dataframe can be created without specifying schema manually. So spark will generate schema by evaluating input file.
val df = spark.read.format("json").load("/x/y/zddd.json")
read the schema from json using below command.
val SchJson = spark.read.format("json").load("/x/y/zddd.json").schema

Can you have a column of dataframes in pyspark?

I am a little new to pyspark/bigdata so this could be a bad idea, but I have about a million individual CSV files each associated with some metadata. I would like a pyspark dataframe with columns for all the metadata fields, but also with a column whose entries are the (whole) CSV files associated with each set of metadata.
I am not at work right now but I remember almost the exact code. I have tried a toy example something like
outer_pandas_df = pd.DataFrame.from_dict({"A":[1,2,3],"B":[4,5,6]})
## A B
## 0 1 4
## 1 2 5
## 2 3 6
And then if you do
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True)
])
outer_spark_df = sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
Then the result is a spark dataframe as expected. But now if you do
inner_pandas_df = pd.DataFrame.from_dict({"W":["X","Y","Z"]})
outer_pandas_df["C"] = [inner_pandas_df, inner_pandas_df, inner_pandas_df]
And make the schema like
inner_schema = StructType([
StructField("W", StringType(), True)
])
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True),
StructField("W", ArrayType(inner_schema), True)
])
then this fails:
sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
with an error related to ArrayType not accepting pandas dataframes. I don't have the exact error.
Is what I'm trying to do possible?
Spark does not support nested dataframes. Why do you want a column that contains the entire CSV to be constantly stored in memory, anyway? It seems to me that if you need that, you are not successfully extracting the data into the other columns.

How to unstack dataset (using pivot)?

I tried the new "pivot" function of 1.6 on a larger stacked dataset. It has 5,656,458 rows and the IndicatorCode column has 1344 different codes.
The idea was to use pivot to "unstack" (in pandas terms) this data set and have a column for each IndicatorCode.
schema = StructType([ \
StructField("CountryName", StringType(), True), \
StructField("CountryCode", StringType(), True), \
StructField("IndicatorName", StringType(), True), \
StructField("IndicatorCode", StringType(), True), \
StructField("Year", IntegerType(), True), \
StructField("Value", DoubleType(), True) \
])
data = sqlContext.read.load('hdfs://localhost:9000/tmp/world-development-indicators/Indicators.csv',
format='com.databricks.spark.csv',
header='true',
schema=schema)
data2 = indicators_csv.withColumn("IndicatorCode2", regexp_replace("indicatorCode", "\.", "_"))\
.select(["CountryCode", "IndicatorCode2", "Year", "Value"])
columns = [row.IndicatorCode2 for row in data2.select("IndicatorCode2").distinct().collect()]
data3 = data2.groupBy(["Year", "CountryCode"])\
.pivot("IndicatorCode2", columns)\
.max("Value")
While this returned successfully, data3.first() never returned a result (I interrupted on my standalone using 3 cores after 10 min).
My approach using RDD and aggregateByKey worked well, so I'm not searching for a solution about how to do it, but whether pivot with DataFrames can also do the trick.
Well, pivoting is not a very efficient operation in general and there is not much you can do about it using DataFrame API. One thing you can try though is to repartition your data:
(data2
.repartition("Year", "CountryCode")
.groupBy("Year", "CountryCode")
.pivot("IndicatorCode2", columns)
.max("Value"))
or even aggregate:
from pyspark.sql.functions import max
(df
.groupBy("Year", "CountryCode", "IndicatorCode")
.agg(max("Value").alias("Value"))
.groupBy("Year", "CountryCode")
.pivot("IndicatorCode", columns)
.max("Value"))
before applying pivot. The idea behind both solutions is the same. Instead of moving large expanded Rows move narrow dense data and expand locally.
Spark 2.0 introduced SPARK-13749 an implementation of pivot that is faster for a large number of pivot column values.
Testing with Spark 2.1.0 on my computer, your example now runs in 48 seconds.

Resources