PySpark not picking the custom schema in csv - apache-spark

I am struggling with a very basic pyspark example and I don't know what is going on and would really appreciate if some could help me out
Below is my pypsark code to read a csv file which contains three column
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Sample App").getOrCreate()
child_df1 = spark.read.csv("E:\\data\\person.csv",inferSchema=True,header=True,multiLine=True)
child_df1.printSchema()
Below is the output of above code
root
|-- CPRIMARYKEY: long (nullable = true)
|-- GENDER: string (nullable = true)
|-- FOREIGNKEY: long (nullable = true)
child_df1.select("CPRIMARYKEY","FOREIGNKEY","GENDER").show()
Output
+--------------------+----------------------+------+
| CPRIMARYKEY |FOREIGNKEY |GENDER|
+--------------------+----------------------+------+
| 6922132627268452352| -4967470388989657188| F|
|-1832965148339791872| 761108337125613824| F|
| 7948853342318925440| -914230724356211688| M|
The issue comes when I provide the custom schema
import pyspark.sql.types as T
child_schema = T.StructType(
[
T.StructField("CPRIMARYKEY", T.LongType()),
T.StructField("FOREIGNKEY", T.LongType())
]
)
child_df2 = spark.read.csv("E:\\data\\person.csv",schema=child_schema,multiLine=True,header=True)
child_df2.show()
+--------------------+----------------------+
| CPRIMARYKEY |FOREIGNKEY|
+--------------------+----------------------+
| 6922132627268452352| null|
|-1832965148339791872| null|
| 7948853342318925440| null|
I am not able to understand that when inferring schema spark can recognize long value however when providing schema , its putting null values for FOREIGNKEY column. I am struggling with this simple exercise for a very long time and no luck. Could someone please point me out on what I am missing. Thank you

As far as I understand, you said the CSV has 2 columns.
so the FOREGINKEY and GENDER columns are de-facto one.
so Spark tries to parse -4967470388989657188,F as Long and returns null because it's not a valid long.
Can you add the Gender column to the schema and see if it fixes FOREGINKEY?
If you don't want the gender column, instead of removing it in the schema just .drop('gender') after reading the csv.

Related

Do I need to use "mergeSchema" option in spark with parquet if I am passing in a schema explicitly?

From spark documentation:
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or setting the global SQL option spark.sql.parquet.mergeSchema to true.
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)
My understanding from the documentation is that if I have multiple parquet partitions with different schemas, spark will be able to merge these schemas automatically if I use spark.read.option("mergeSchema", "true").parquet(path).
This seems like a good option if I don't know at query time what schemas exist in these partitions.
However, consider the case where I have two partitions, one using an old schema, and one using a new schema that differs only in having one additional field. Let's also assume that my code knows the new schema and I'm able to pass this schema in explicitly.
In this case, I would do something like spark.read.schema(my_new_schema).parquet(path). What I'm hoping Spark would do in this case is read in both partitions using the new schema and simply supply null values for the new column to any rows in the old partition. Is this the expected behavior? Or do I need also need to use option("mergeSchema", "true") in this case as well?
I'm hoping to avoid using the mergeSchema option if possible in order to avoid the additional overhead mentioned in the documentation.
I've tried extending the example code from the spark documentation linked above, and my assumptions appear to be correct. See below:
// This is used to implicitly convert an RDD to a DataFrame.
scala> import spark.implicits._
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF: org.apache.spark.sql.DataFrame = [value: int, square: int]
scala> squaresDF.write.parquet("test_data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
scala> val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
scala> cubesDF: org.apache.spark.sql.DataFrame = [value: int, cube: int]
scala> cubesDF.write.parquet("test_data/test_table/key=2")
// Read the partitioned table
scala> val mergedDF = spark.read.option("mergeSchema", "true").parquet("test_data/test_table")
mergedDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 2 more fields]
scala> mergedDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- cube: integer (nullable = true)
|-- key: integer (nullable = true)
// Read without mergeSchema option
scala> val naiveDF = spark.read.parquet("test_data/test_table")
naiveDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 1 more field]
// Note that cube column is missing.
scala> naiveDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- key: integer (nullable = true)
// Take the schema from the mergedDF above and use it to read the same table with an explicit schema, but without the "mergeSchema" option.
scala> val explicitSchemaDF = spark.read.schema(mergedDF.schema).parquet("test_data/test_table")
explicitSchemaDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 2 more fields]
// Spark was able to use the correct schema despite not using the "mergeSchema" option
scala> explicitSchemaDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- cube: integer (nullable = true)
|-- key: integer (nullable = true)
// Data is as expected.
scala> explicitSchemaDF.show()
+-----+------+----+---+
|value|square|cube|key|
+-----+------+----+---+
| 3| 9|null| 1|
| 4| 16|null| 1|
| 5| 25|null| 1|
| 8| null| 512| 2|
| 9| null| 729| 2|
| 10| null|1000| 2|
| 1| 1|null| 1|
| 2| 4|null| 1|
| 6| null| 216| 2|
| 7| null| 343| 2|
+-----+------+----+---+
As you can see, spark appears to be correctly supplying null values to any columns missing from the parquet partitions when using an explicit schema to read the data.
This makes me feel fairly confident that I can answer my question with "no, the mergeSchema option is not necessary in this case," but I'm still wondering if there are any caveats that I should be aware of. Any additional help from others would be appreciated.

Is there any way to handle time in pyspark?

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?
I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.
In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.
Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))
In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Spark not recognizing question mark (?) as nullValue parameter when reading from a csv

I am using PySpark 2.3.1 with Python 3.6.6 at the moment.
I need to work with a .csv file where ? are used as NA. I want to make PySpark recognize ? as NA directly, so I can treat them consequently.
I have tried nullValue= argument in spark.read.csv for that without success, and I am not sure if it has to do with the argument being improperly used or the ? character being a problem in those cases (I have tried both nullValue='?' and nullValue='\?').
Having read PySpark API documentation, and tried Pandas pd.read_csv with na_values= with the same outcome, I would say it there is something with ? that makes it not to work, but feel free to tell me if I am wrong at that.
What should I do?
The file is the adult dataset from UCI: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The problem is likely caused by spaces around your null value. The easiest situation would be if the number of leading/trailing spaces was fixed (i.e. if it's always one space followed by the question mark: " ?"). In that case, just set nullValue=' ?'.
If the number of spaces is not fixed, a possible solution for this is to use the ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace flags. (Assuming you're okay with ignoring leading/trailing whitespace for all values, including non-nulls).
For example, if your file were the following:
col1,col2,col3,col4
1, ?,a,xxx
? ,5,b,yyy
7,8,?,zzz
where the ? is the null character, but it can have either trailing or leading spaces, you could read it as follows:
df = spark.read.csv(
"path/to/my/file",
header=True,
nullValue='?',
ignoreLeadingWhiteSpace=True,
ignoreTrailingWhiteSpace=True,
inferSchema=True
)
This results in the following DataFrame:
df.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| 1|null| a| xxx|
#|null| 5| b| yyy|
#| 7| 8|null| zzz|
#+----+----+----+----+
As you can see, the null values are in the correct places.
Additionally, since we set inferSchema=True, the data types are also correct:
df.printSchema()
#root
# |-- col1: integer (nullable = true)
# |-- col2: integer (nullable = true)
# |-- col3: string (nullable = true)
# |-- col4: string (nullable = true)

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox
18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3
# code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.
The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
I tried to see if name field contains any comma, as suggested by Chandan Ray.
There's no comma in name field.
rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1 else False )
rdd_name_comma.count()
==> 0
I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns.
I tried using databricks package
# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
# on pyspark
schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
df1.show()
+---+------+--------------------+----+-----+--------------------+
| id|cat_id| name|desc|price| url|
+---+------+--------------------+----+-----+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 60|http://images.acm...|
| 2| 2|Under Armour Men'...| | 130|http://images.acm...|
| 3| 2|Under Armour Men'...| | 90|http://images.acm...|
| 4| 2|Under Armour Men'...| | 90|http://images.acm...|
| 5| 2|Riddell Youth Rev...| | 200|http://images.acm...|
df1.printSchema()
root
|-- id: integer (nullable = true)
|-- cat_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: decimal(10,0) (nullable = true)
|-- url: string (nullable = true)
df1.count()
1345
I suppose your name field has comma in it, so its splitting this also. So its expecting 7 columns
There might be some malformed lines.
Please try to use the code as below to exclude bad record in one file
val df = spark.read.format(“csv”).option("badRecordsPath", "/tmp/badRecordsPath").load(“csvpath”)
//it will read csv and create a dataframe, if there will be any malformed record it will move this into the path you provided.
// please read below
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
Here is my take on cleaning of such records, we normally encounter such situations:
a. Anomaly on the data where the file when created, was not looked if "," is the best delimiter on the columns.
Here is my solution on the case:
Solution a: In such cases, we would like to have the process identify as part of data cleansing if that record is a qualified records. The rest of the records if routed to a bad file/collection would give the opportunity to reconcile such records.
Below is the structure of my dataset (product_id,product_name,unit_price)
1,product-1,10
2,product-2,20
3,product,3,30
In the above case, product,3 is supposed to be read as product-3 which might have been a typo when the product was registered. In such as case, the below sample would work.
>>> tf=open("C:/users/ip2134/pyspark_practice/test_file.txt")
>>> trec=tf.read().splitlines()
>>> for rec in trec:
... if rec.count(",") == 2:
... trec_clean.append(rec)
... else:
... trec_bad.append(rec)
...
>>> trec_clean
['1,product-1,10', '2,product-2,20']
>>> trec_bad
['3,product,3,30']
>>> trec
['1,product-1,10', '2,product-2,20','3,product,3,30']
The other alternative of dealing with this problem would be trying to see if skipinitialspace=True would work to parse out the columns.
(Ref:Python parse CSV ignoring comma with double-quotes)

Aggregating tuples within a DataFrame together [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)
First, extract the service and the value from the Services column by position. Note this assumes that the value is always in position 0 and the service is always in position 1 (as shown in your example).
import pyspark.sql.functions as f
df2 = df.select(
'cid',
f.col("Services").getItem(0).alias('value').cast('integer'),
f.col("Services").getItem(1).alias('service')
)
df2.show()
#+------------------+-------+-------+
#| cid| value|service|
#+------------------+-------+-------+
#|845124826013182686| 112931| serv1|
#|845124826013182686| 146936| serv1|
#|845124826013182686| 32718| serv2|
#|845124826013182686| 28839| serv2|
#|845124826013182686| 8710| serv2|
#|845124826013182686|2093140| serv3|
#+------------------+-------+-------+
Note that I casted the value to integer, but it may already be an integer depending on how your schema is defined.
Once the data is in this format, it's easy to pivot() it. Group by the cid column, pivot the service column, and aggregate by summing the value column:
df2.groupBy('cid').pivot('service').sum("value").show()
#+------------------+------+-----+-------+
#| cid| serv1|serv2| serv3|
#+------------------+------+-----+-------+
#|845124826013182686|259867|70267|2093140|
#+------------------+------+-----+-------+
Update
Based on the schema you provided, you will have to get the value and service by name, rather than by position:
df2 = df.select(
'cid',
f.col("Services").getItem("cpuCoreInSeconds").alias('value'),
f.col("Services").getItem("name").alias('service')
)
The rest is the same. Also, no need to cast to integer as cpuCoreInSeconds is already a long.

Resources