How to convert GUID into integer in pyspark - apache-spark

Hi Stackoverflow fams:
I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an int.
CHECKSUM(HASHBYTES('sha2_512',GUID)) AS int_value_wanted
I wanted to do the same thing in pyspark and tried to create a temporary table out of spark dataframe and add the above statement in the sql query. But the code keeps throwing "Undefined function: 'CHECKSUM'". Is there a way I can add the "CHECKSUM" function into pyspark or do the same thing using another pyspark way?
from awsglue.context import GlueContext
from pyspark.sql import SQLContext
glueContext = GlueContext(SparkContext.getOrCreate())
spark_session = glueContext.spark_session
sqlContext = SQLContext(spark_session.sparkContext, spark_session)
spark_df = spark.createDataFrame(
[("2540f487-7a29-400a-98a0-c03902e67f73", "1386172469"),
("0b32389a-ce01-4e6a-855c-15940cc91e9e", "-2013240275")],
("GUDI","int_value_wanted")
)
spark_df.show(truncate=False)
spark_df.registerTempTable('temp')
new_df = sqlContext.sql("SELECT .*, CHECKSUM(HASHBYTES('sha2_512', GUDI)) AS detail_id FROM temp")
new_df.show(truncate=False)
+------------------------------------+----------------+
|GUDI |int_value_wanted|
+------------------------------------+----------------+
|2540f487-7a29-400a-98a0-c03902e67f73|1386172469 |
|0b32389a-ce01-4e6a-855c-15940cc91e9e|-2013240275 |
+------------------------------------+----------------+
Thanks

There is a sha2 built-in function, which returns the checksum for the SHA-2 family as a hex string. SHA-512 is also supported.

Related

I can't see my values which are applied imputer

I was trying to handle null variables with imputer but I have a problem.
The problem is I have some null variables and to fill those values, I used imputer. Then I wanted to check new values which are applied imputer. So, I used the filter function and SQL to see new values but there is 0 raws when I use these methods.
My question is why I can't see these rows?
from pyspark.ml.feature import Imputer
imputer = Imputer()
imputer.setInputCols(["Life_expectancy","Adult_Mortality","Hepatitis_B","GDP","BMI","Polio"]) \
.setOutputCols(["Life_expectancy_Mean","Adult_Mortality_Mean","Hepatitis_B_Mean","GDP_Mean","BMI_Mean","Polio_Mean"])
model = imputer.fit(df)
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)
spark_df = spark.createDataFrame(df2)
spark_df.createTempView("table")
spark.sql("""
select * from table where Country = "Bahamas"
order by "Bahamas"
""").toPandas().head() # pyspark saying 0 raw 24 columns when ı use pyspark on spark
spark_df.filter(col("Country") == "Bahamas").toPandas().head(50) # pyspark saying 0 raw 28 columns when ı use filter
[![[![code screenshots](https://i.stack.imgur.com/Bs1AK.png)]](https://i.stack.imgur.com/jYqQh.png)]
I tried use pandas dataframe or spark dataframe but It didn't work. I didn't find any solve about this problem.

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

AWS Glue Job creates new column in Redshift if found different datatype for same column in parquet file from S3

I am trying to load parquet file which in S3 into Redshift using Glue Job. When I am running Glue Job first time, it is creating table and loading the data but when running second time by changing the datatype of 1 column, job is not failing instead it is creating new column in Redshift and appending the data.
For example: Here, I am changing datatype of integer number
FileName **abc**
Code,Name,Amount
'A','XYZ',200.00
FileName **xyz**
Code,Name,Amount
'A','XYZ',200.00
In Redshift
Output after processing both the above file:
Code Name Amount Amount_String
A XYZ 200.00
A XYZ 200.00
Code
import os
import sys
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import SQLContext
from datetime import date
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
## #params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
spark = SparkSession.builder.getOrCreate()
glueContext = GlueContext(SparkContext.getOrCreate())
spark.conf.set('spark.sql.session.timeZone', 'Europe/London')
#sc = SparkContext()
data_source = "s3://bucket/folder/data/"
#read delta and source dataset
employee = spark.read.parquet(data_source)
sq_datasource0 = DynamicFrame.fromDF(employee, glueContext, "new_dynamic_frame")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = sq_datasource0, catalog_connection = "redshiftDB", connection_options = {"dbtable": "employee", "database": "dbname"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
I want to fail the Glue Job if datatype mismatch issue is coming from the file.
I would appreciate it if you could provide any guidance to resolve this issue.
The above issue is due to Redshift writer used by Glue dynamicFrame. This creates new column for the table in Redshfit using alter table query, if there are empty records present in input data's specific column.
To avoid this behavior, convert Glue dynamicFrame to Spark dataframe and write out to redshift.
val amDf = am.toDF()
amDf.write.format("com.databricks.spark.redshift")
.mode(SaveMode.Overwrite)
.option("url", JDBC_URL)
.option("dbtable", TABLE_NAME)
.option("user", USER)
.option("password", PASSWORD)
.option("aws_iam_role", IAM_ROLE)
.option("tempdir", args("TempDir"))
.save()
A long time has passed since the question was posted, but I had recently the same issue and here is how I solved it.
You must create a new user in Redshift and grant to it only SELECT, INSERT, UPDATE, DELETE permissions. Then when you use this user in Glue, you will receive an error if the columns do not match and Glue tries creating new columns.
The only thing to remember is that if you want to truncate a table, you need to use DELETE FROM <table_name>; instead, this is because TRUNCATE requires ALTER TABLE permissions and it would fail.
Your crawler configuration setting might be set to the 1st or 2nd option as seen in the below picture:
If you do not want to modify your table when your S3 file structure change, you need edit your crawler and set the 'Configuration options' to select the third option "Ignore the change and don't update the table in the data catalog".

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Resources