#Spark #Python
Objective :
Read the location of the log files, extract the csv text table data from the logs and print the json of the table data (Table Columns ( CSV retrieved table columns + serial no + timestamp)
Read serial_no, time, s3_path from database
s3_path contains csv files.
Output needs as a dataframe of the columns in the tables + primary_key + timestamp
Current Code pseudo:
df = sparkSession.read \
.format("com.databricks.spark.redshift") \
.option("url",
"some url with id{}&password={}".format(
redshift_user, redshift_pass)) \
.option("query", query) \
.option("tempdir", s3_redshift_temp_dir) \
.option("forward_spark_s3_credentials", True)
df = df_context.load()
+-------------+-------------------+--------------------+
|serial_number| test_date| s3_path|
+-------------+-------------------+--------------------+
| A0123456|2019-07-10 04:11:52|s3://test-bucket-...|
| A0123456|2019-07-24 23:48:03|s3://test-bucket-...|
| A0123456|2019-07-22 20:56:57|s3://test-bucket-...|
| A0123456|2019-07-22 20:56:57|s3://test-bucket-...|
| A0123456|2019-07-22 20:58:36|s3://test-bucket-...|
+-------------+-------------------+--------------------+
Since we cannot pass the spark context to the worker nodes, used boto3 to read the text file and processed the text to fetch the csv table structure.
Not sharing the proprietary code here for retrieving the table from log.
spark.udf.register("read_s3_file", read_s3_file)
df_with_string_csv = df.withColumn('files_dataframes', read_s3_file(drive_event_tab.s3_path))
df_with_string_csv now contails below sample
+-------------+-------------------+--------------------+----------------------+
|serial_number| test_date| s3_path| table_csv_data |
+-------------+-------------------+--------------------+----------------------+
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1BE|2019-05-08 09:26:55|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 06:54:28|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 00:19:52|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-24 22:24:40|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-09-12 22:15:19|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:27:56|s3://test-bucket-...|col1,col2,col3,col4...|
+-------------+-------------------+--------------------+----------------------+
sample table_csv_data column contains:
timestamp,partition,offset,key,value
1625218801350,97,33009,2CKXTKAT_20210701193302_6400_UCMP,458969040
1625218801349,41,33018,3FGW9S6T_20210701193210_6400_UCMP,17569160
Trying to achieve the final dataframe as below, please help
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
|serial_number| test_date| timestamp| partition | offset | key | value |
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
| 1050D1B0|2019-05-07 15:41:11| 1625218801350 | 97 | 33009 | 2CKXTKAT_20210701193302_6400_UCMP | 458969040 |
| 1050D1B0|2019-05-07 15:41:11| 1625218801349 | 41 | 33018 | 3FGW9S6T_20210701193210_6400_UCMP | 17569160 |
..
..
..
+-------------+-------------------+--------------------+----------------------+
For Spark 2.4.0+ you just need some combinations of split, explode and array_except
Please use repartition to optimize as explode might create a lot of rows.
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import split, explode, col, array_except, array, trim
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
df = df \
.withColumn('table_csv_data', split(col('table_csv_data'), '\n')) \
.withColumn('table_csv_data', array_except(col('table_csv_data'), array([col('table_csv_data')[0]]))) \
.withColumn('table_csv_data', explode(col('table_csv_data'))) \
.withColumn('table_csv_data', split(col('table_csv_data'), ',')) \
.withColumn('timestamp', trim(col('table_csv_data')[0])) \
.withColumn('partition', trim(col('table_csv_data')[1])) \
.withColumn('offset', trim(col('table_csv_data')[2])) \
.withColumn('key', trim(col('table_csv_data')[3])) \
.withColumn('value', trim(col('table_csv_data')[4])) \
.drop('table_csv_data')
df.show(truncate=False)
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+
|serial_number|test_date |s3_path |timestamp |partition|offset|key |value |
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801350|97 |33009 |2CKXTKAT_20210701193302_6400_UCMP|458969040|
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801349|41 |33018 |3FGW9S6T_20210701193210_6400_UCMP|17569160 |
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801350|97 |33009 |2CKXTKAT_20210701193302_6400_UCMP|458969040|
|1050D1B0 |2019-05-07 15:41:11|s3://test-bucket-|1625218801349|41 |33018 |3FGW9S6T_20210701193210_6400_UCMP|17569160 |
+-------------+-------------------+-----------------+-------------+---------+------+---------------------------------+---------+
Related
I'm using PySpark 2.4.
I have a dataframe like below as input:
ceci_p| ceci_l|ceci_stok|
-------+-------+---------+
SFIL401| BPI202| BPI202|
BPI202| CDC111| BPI202|
LBP347|SFIL402| SFIL402|
LBP347|SFIL402| LBP347|
-------+-------+---------+
I want to detect which ceci_stok values exist in both ceci_l and ceci_p columns using a join (maybe a self join).
For example: ceci_stok = BPI202 exists in both ceci_l and ceci_p.
I want to create a new dataframe as a result that contains ceci_stok which exist in both ceci_l and ceci_p.
#c reate data for testing
data = [("SFIL401","BPI202","BPI202"),
("BPI202","CDC111","BPI202"),
("LBP347","SFIL402","SFIL402"),
("LBP347","SFIL402","LBP347")]
data_schema = ["ceci_p","ceci_l","ceci_stok"]
df = spark.createDataFrame(data=data, schema = data_schema)
ceci_p = df.cache()\ #don't forget to cache table you reference multiple times.
.select( df.ceci_p.alias("join_key") )\ #rename for union
.distinct()
ceci_l = df\
.select( df.ceci_l.alias("join_key") )\ #rename for union
.distinct()
vals = ceci_l.join(ceci_p,"join_key").distinct() # get unique values to both columns your interested in
df.join( vals, df.ceci_stok == vals.join_key ).show()
+-------+-------+---------+--------+
| ceci_p| ceci_l|ceci_stok|join_key|
+-------+-------+---------+--------+
|SFIL401| BPI202| BPI202| BPI202|
| BPI202| CDC111| BPI202| BPI202|
+-------+-------+---------+--------+
The following seems to be working in Spark 3.0.2. Please try it.
from pyspark.sql functions as F
df2 = (
df.select('ceci_stok').alias('_stok')
.join(df.alias('_p'), F.col('_stok.ceci_stok') == F.col('_p.ceci_p'), 'leftsemi')
.join(df.alias('_l'), F.col('_stok.ceci_stok') == F.col('_l.ceci_l'), 'leftsemi')
.distinct()
)
df2.show()
# +---------+
# |ceci_stok|
# +---------+
# | BPI202|
# +---------+
You're right, that can be done using autojoin. If you have a dataframe
>>> df.show(truncate=False)
+-------+-------+---------+
|ceci_p |ceci_l |ceci_stok|
+-------+-------+---------+
|SFIL401|BPI202 |BPI202 |
|BPI202 |CDC111 |BPI202 |
|LBP347 |SFIL402|SFIL402 |
|LBP347 |SFIL402|LBP347 |
+-------+-------+---------+
...then the following couple of joins (with "leftsemi" to drop right-hand side) should produce what you need:
>>> df.select("ceci_stok") \
.join(df.select("ceci_p"),df.ceci_stok == df.ceci_p,"leftsemi") \
.join(df.select("ceci_l"),df.ceci_stok == df.ceci_l,"leftsemi") \
.show(truncate=False)
+---------+
|ceci_stok|
+---------+
|BPI202 |
|BPI202 |
+---------+
You can dedup the result if you're just interested in unique values.
When reading data from a text file using pyspark using following code,
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')
My data text file looks like,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
But the output I got was,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a
Is there a way to add those missing columns for the dataframe?
Expecting,
col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
You can explicitly specify the schema, instead of infering it.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType() \
.add("col1",StringType(),True) \
.add("col2",StringType(),True) \
.add("col3",StringType(),True) \
.add("col4",StringType(),True) \
.add("col5",StringType(),True) \
.add("col6",StringType(),True) \
.add("col7",StringType(),True) \
.add("col8",StringType(),True)
df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')
Output
+----+----+----+-------+-------+---------+-------+--------+
|col1|col2|col3| col4| col5| col6| col7| col8|
+----+----+----+-------+-------+---------+-------+--------+
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
+----+----+----+-------+-------+---------+-------+--------+
I have a hive table tableA with the following format:
> desc tableA;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| statementid | string | |
| batchid | string | |
| requestparam | map<string,string> | |
+--------------------------+-----------------------+-----------------------+--+
I tried to load database with the following code:
val tempdf= spark.read.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://localhost:10000/tempdb")
.option("user","user1")
.option("password","password1")
.option("query","select statementid, batchid, requestparam from tempdb.tableA")
.load()
And my second attempt:
val tempdf = spark.read.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://localhost:10000/tempdb")
.option("user","user1")
.option("password","password1")
.option("dbtable","tempdb.tableA")
.load()
But map<string,string> column is causing an issue while loading source hive table into spark dataset.
Exception in thread "main" java.sql.SQLException: Unsupported type
JAVA_OBJECT at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:247)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:312)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:312)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
Forexample:
I have two dataframes in Pyspark.
A_dataframe【table name: link_data_test】,The size is so big about 1 billion rows:
-----+--------------------+---------------+
| id| link_date| tuch_url|
+-----+--------------------+-------------+
|day_1|2020-01-01 06:00:...|www.google.com|
|day_2|2020-01-01 11:00:...|www.33e.......|
|day_3|2020-01-03 22:21:...|www.3tg.......|
|day_4|2019-01-04 20:00:...|www.96g.......|
.........
+-----+--------------------+
B_dataframe【table name: url_data_test】:
-----+--------------------+
| url| extra_date|
+-----+--------------------+
|www.google.com|2019-02-01 |
|www.23........|2020-01-02 |
|www.hsi.......|2020-01-03 |
|www.cc........|2020-01-05 |
.......
+-----+--------------------+
I can use the spark.sql() to create a query:
sql_str="""
select
t1.*,t2.*
from
link_data_test as t1
inner join
url_data_test as t2
on
t1.link_date> t2.extra_date and t1.link_date< date_add(t2.extra_date,8)
where
t1.tuch_url like "%t2.url%"
"""
test1=spark.sql(sql_str).saveAsTable("xxxx",mode="overwrite")
I tried this to use the following writing replace the sql wording above for some other tests,but I don't know how writing this.
A_dataframe.join(B_dataframe, ......,'inner').select(....).saveAsTable("xxxx",mode="overwrite")
Thank you for your help!
Here is the way.
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df1.show(10, False)
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df2.show(10, False)
+-----+-------------------+--------------+
|id |link_date |tuch_url |
+-----+-------------------+--------------+
|day_1|2020-01-08 23:59:59|www.google.com|
+-----+-------------------+--------------+
+--------------+----------+
|url |extra_date|
+--------------+----------+
|www.google.com|2020-01-01|
+--------------+----------+
df1.join(broadcast(df2),
col('link_date').between(col('extra_date'), date_add('extra_date', 7))
& col('url').contains(col('tuch_url')), 'inner') \
.show(10, False)
+-----+-------------------+--------------+--------------+----------+
|id |link_date |tuch_url |url |extra_date|
+-----+-------------------+--------------+--------------+----------+
|day_1|2020-01-08 23:59:59|www.google.com|www.google.com|2020-01-01|
+-----+-------------------+--------------+--------------+----------+
I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+