Spark SQL- Regex to find 4 consequtive numbers - apache-spark

I am trying to write a regex function which has consequite 4 digits and checking it as rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' but I am getting below error.
My Spark sql is :
df_email_cleaned=spark.sql("select *,case when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[0] \
when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[1] \
when postal_code rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' and length(postal_code)>2 and length(postal_code)<6 then postal_code \
else '' \
end sd_ps_clean \
from df_email_cleaned_sql ")
I am getting below error:
pyspark.sql.utils.AnalysisException: cannot resolve '(named_struct('postal_code',
df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')'
due to data type mismatch: differing types in '(named_struct('postal_code', df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')' (struct<postal_code:string,col2:string> and string).; line 1 pos 343;

Related

How to add hours as variable to timestamp in Pyspark

Dataframe schema is like this:
["id", "t_create", "hours"]
string, timestamp, int
Sample data is like:
["abc", "2022-07-01 12:23:21.343998", 5]
I want to add hours to the t_create and get a new column t_update: "2022-07-01 17:23:21.343998"
Here is my code:
df_cols = ["id", "t_create", "hour"]
df = spark.read.format("delta").load("blablah path")
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL 5 HOURS"))
It works no problem. However the hours column should be a variable. I did not figure out how to put the variable to the expr, f string and the INTERVAL function, something like:
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {df.hours} HOURS"))
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {col(df.hours)} HOURS"))
etc... They don't work. Need help here.
Another way is to write a udf and wrap the whole expr string to the udf return value:
#udf
def udf_interval(hours):
return "INTERVAL " + str(hours) + " HOURS"
Then:
df = df.withColumn("t_update", df.t_create + expr(udf_interval(df.hours)))
Now I get TypeError: Column is not iterable.
Stuck. Need help in either the udf or non-udf way. Thanks!
You can do this without using the fiddly unix_timestamp and utilise make_interval within SparkSQL
SparkSQL - TO_TIMESTAMP & MAKE_INTERVAL
sql.sql("""
WITH INP AS (
SELECT
"abc" as id,
TO_TIMESTAMP("2022-07-01 12:23:21.343998","yyyy-MM-dd HH:mm:ss.SSSSSS") as t_create,
5 as t_hour
)
SELECT
id,
t_create,
t_hour,
t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS as t_update
FROM INP
""").show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
Pyspark API
s = StringIO("""
id,t_create,t_hour
abc,2022-07-01 12:23:21.343998,5
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('t_create'
,F.to_timestamp(F.col('t_create')
,'yyyy-MM-dd HH:mm:ss.SSSSSS'
)
).withColumn('t_update'
,F.expr('t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS')
).show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
A simple way would be to cast the timestamp to bigint (or decimal if dealing with fraction of second) and add the number of seconds to it. Here's an example where I've created columns for every calculation for detailed understanding - you can merge all the calculations into a single column.
spark.sparkContext.parallelize([("2022-07-01 12:23:21.343998",)]).toDF(['ts_str']). \
withColumn('ts', func.col('ts_str').cast('timestamp')). \
withColumn('hours_to_add', func.lit(5)). \
withColumn('ts_as_decimal', func.col('ts').cast('decimal(20, 10)')). \
withColumn('seconds_to_add_as_decimal',
func.col('hours_to_add').cast('decimal(20, 10)') * 3600
). \
withColumn('new_ts_as_decimal',
func.col('ts_as_decimal') + func.col('seconds_to_add_as_decimal')
). \
withColumn('new_ts', func.col('new_ts_as_decimal').cast('timestamp')). \
show(truncate=False)
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |ts_str |ts |hours_to_add|ts_as_decimal |seconds_to_add_as_decimal|new_ts_as_decimal |new_ts |
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |2022-07-01 12:23:21.343998|2022-07-01 12:23:21.343998|5 |1656678201.3439980000|18000.0000000000 |1656696201.3439980000|2022-07-01 17:23:21.343998|
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+

DataBricks/Spark: How to create and overwrite/append to table with periods in the name?

Some raw data that I want to capture in delta tables have periods in the column names.
My strategy had been to create a table that uses the backticks like so:
CREATE TABLE TestMe (
testMeKey bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
id bigint,
rev bigint,
`System.WorkItemType` string,
sourceFile string
)
USING DELTA
OPTIONS (PATH "/mnt/TestMe")
Then when I get new data, to do something like this:
spark.read.format("parquet") \
.load("/mnt/TestMeData/") \
.withColumn("sourceFile", input_file_name()) \
.write.option("mergeSchema", "true") \
.format("delta").mode("overwrite") \
.save("/mnt/TestMe")
The problem is, Spark throws name match errors for the columns with periods in them, saying:
Cannot resolve column name System.WorkItemType
I have tried to manually recode the dataframe column names, with something like this, but this also fails:
l1 = [col for col in df.columns]
l1 = [f'`{x}`' if ('.' in x) else x for x in l1]
# make the names match
df = df.toDF(*l1)
How can I create a table that is consistent with source data if it contains characters like periods? I would rather not alter the table and recode periods to underscores or something, but rather accept the data as-is.

spliced and unspliced sequence alignment using STAR

I am working with single cell sequencing data, and want to run this through RNA velocity (https://www.nature.com/articles/s41586-018-0414-6). For that, I need to map both spliced and unspliced reads. The dataset I am working with is a SMARTseq dataset, without UMIs or barcodes (https://www.embopress.org/doi/full/10.15252/embj.2018100164). Data are provided as a single sra file per cell, which can be fetched and unpacked into a single-ended fastq file through GEO using the SRAtoolkit.
What I aimed to do was map the data twice, using STAR alignment. First I wanted to map the reads to exonic regions, then again to intronic regions. To do this, I downloaded the fasta and gtf file from the ensembl reference genome GRCm38 build 100 (https://www.ensembl.org/Mus_musculus/Info/Index). The gtf file does not by itself contain intronic info, which I added in R using the following code:
library(gread) # https://rdrr.io/github/asrinivasan-oa/gread/
library(rtracklayer)
dir <- "/to/reference/genome"
gtf_file <- file.path(dir, "GRCm38build100.gtf")
gtf <- read_format(gtf_file)
gtf.new <- construct_introns(gtf, update = TRUE)[]
export(gtf.new, "GRCm38build100withIntrons.gtf", format = "gtf")
Then, I generated a STAR genome as follows:
STAR \
--runMode genomeGenerate \
--runThreadN 8 \
--sjdbOverhang 50 \
--genomeDir GRCm38build100/50bp/ \
--genomeFastaFiles GRCm38build100/GRCm38build100.fa \
--sjdbGTFfile GRCm38build100/GRCm38build100withIntrons.gtf \
The code for mapping reads against exonic sequences was as follows:
STAR \
--runMode alignReads \
--runThreadN 8 \
--genomeDir GRCm38build100/50bp \
--readFilesIn $R1 \
--outSAMtype None \
--twopassMode Basic \
--sjdbGTFfeatureExon exon \
--quantMode GeneCounts \
--outFileNamePrefix output/${SAMPLE}_exon_
This worked just fine. For unspliced reads, I wanted to do the following:
STAR \
--runMode alignReads \
--runThreadN 8 \
--genomeDir GRCm38build100/50bp \
--readFilesIn $R1 \
--outSAMtype None \
--twopassMode Basic \
--sjdbGTFfeatureExon intron \
--quantMode GeneCounts \
--outFileNamePrefix output/${SAMPLE}_intron_
Strangely, this gave me the exact same result as mapping against exonic reads. I'm confused, I'm sure I'm doing something wrong, but I cannot figure out what. Any help would be much appreciated!
Best regards,
Leon

PySpark first and last function over a partition in one go

I have pyspark code like this,
spark_df = spark_df.orderBy('id', 'a1', 'c1')
out_df = spark_df.groupBy('id', 'a1', 'a2').agg(
F.first('c1').alias('c1'),
F.last('c2').alias('c2'),
F.first('c3').alias('c3'))
I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1.
Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient.
w_first = Window.partitionBy('id', 'a1', 'a2').orderBy('c1')
w_last = Window.partitionBy('id', 'a1', 'a2').orderBy(F.desc('c1'))
out_first = spark_df.withColumn('Rank_First', F.rank().over(w_first)).filter(F.col('Rank_First') == 1).drop(
'Rank_First')
out_last = spark_df.withColumn('Rank_Last', F.rank().over(w_last)).filter(F.col('Rank_First') == 1).drop(
'Rank_Last')
out_first = out_first.withColumnRenamed('c1', 'First_c1') \
.withColumnRenamed('c2', 'First_c2') \
.withColumnRenamed('c3', 'First_c3')
out_last = out_last.withColumnRenamed('c1', 'Last_c1') \
.withColumnRenamed('c2', 'Last_c2') \
.withColumnRenamed('c3', 'Last_c3')
out_df = out_first.join(out_last, ['id', 'a1', 'a2']) \
.select('id', 'a1', 'a2', F.col('First_c1').alias('c1'),
F.col('Last_c2').alias('c2'),
F.col('First_c3').alias('c3'))
I was trying for a better and efficient alternative. I run in to bottle necks in performance when data size is huge.
Is there a better alternative to do first and last over a window ordered in a specific order in one go.
When using orderBy with Window you need to specify frame boundaries as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING otherwise the last function will only get last value between UNBOUNDED PRECEDING and CURRENT ROW (the default frame bounds when order by is specified).
Try this:
w = Window.partitionBy('id', 'a1', 'a2').orderBy('c1') \
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("First_c1", first("c1").over(w)) \
.withColumn("First_c3", first("c3").over(w)) \
.withColumn("Last_c2", last("c2").over(w))
df.groupby("id", "a1", "a2")\
.agg(first("First_c1").alias("c1"),
first("Last_c2").alias("c2"),
first("First_c3").alias("c3")
).show()

Pandas groupby: aggregate only on partial records

I have the following data frame:
id src target duration
001 A C 4
001 B C 3
001 C C 2
002 B D 5
002 C D 2
and I used the following code to do some aggregations, which works fine.
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'all_src':list(x['src'])
})).reset_index()
Now I want to compute the sum only for src != target records. I modified my code like below:
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration']) if x['src'] != x['target'], \
'all_src':list(x['src'])
})).reset_index()
But then got Invalid Syntax error in my new line:
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
I am wondering what should be the proper way to do the sum only for part of the records? Thanks!
Try writing your code like below
df.groupby(['id','target']).apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration'][x['src'] != x['target']]), \# I change this part
'all_src':list(x['src'])
})).reset_index()
Change line
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
To
sum(x['duration'][x['src'] != x['target']])

Resources