I have the following data frame:
id src target duration
001 A C 4
001 B C 3
001 C C 2
002 B D 5
002 C D 2
and I used the following code to do some aggregations, which works fine.
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'all_src':list(x['src'])
})).reset_index()
Now I want to compute the sum only for src != target records. I modified my code like below:
df_new = df.groupby(['id','target']) \
.apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration']) if x['src'] != x['target'], \
'all_src':list(x['src'])
})).reset_index()
But then got Invalid Syntax error in my new line:
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
I am wondering what should be the proper way to do the sum only for part of the records? Thanks!
Try writing your code like below
df.groupby(['id','target']).apply(lambda x: pd.Series({'min_duration': min(x['duration']), \
'total_duration':sum(x['duration']), \
'total_duration_condition':sum(x['duration'][x['src'] != x['target']]), \# I change this part
'all_src':list(x['src'])
})).reset_index()
Change line
'total_duration_condition':sum(x['duration']) if x['src'] != x['target']
To
sum(x['duration'][x['src'] != x['target']])
Related
This is my code:
x = list(coll.find({"activities.flowCenterInfo": {
'$exists': True
}},{'activities.activityId':1,'activities.flowCenterInfo':1,'_id':0}).limit(5))
for row in x:
print(row)
This is the result of x for one sample:
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}
I want to convert to Dataframe to write the oracle table. How can i convert it to Dataframe properly i can't find anyway
This image shows that the mongodb structure of one sample
Assuming that activities key contains a list with a single dict, each field within flowCenterInfo key is marked with fcinfo_:
# sample list
l = [{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}]
df = pd.DataFrame.from_records([dict(**{'activityId': r['activities'][0]['activityId']}, \
**dict(zip(map('fcinfo_{}'.format, r['activities'][0]['flowCenterInfo'].keys()), \
r['activities'][0]['flowCenterInfo'].values()))) for r in l])
print(df)
activityId fcinfo_processId ... fcinfo_demandComplaintDetailSubject fcinfo_demandComplaintId
0 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
1 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
2 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
[3 rows x 5 columns]
I am trying to write a regex function which has consequite 4 digits and checking it as rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' but I am getting below error.
My Spark sql is :
df_email_cleaned=spark.sql("select *,case when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[0] \
when postal_code like '%-%' and rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' then split(postal_code, '-')[1] \
when postal_code rlike(postal_code,'(?:^|\D)(\d{4})(?!\d)')=='true' and length(postal_code)>2 and length(postal_code)<6 then postal_code \
else '' \
end sd_ps_clean \
from df_email_cleaned_sql ")
I am getting below error:
pyspark.sql.utils.AnalysisException: cannot resolve '(named_struct('postal_code',
df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')'
due to data type mismatch: differing types in '(named_struct('postal_code', df_email_cleaned_sql.`postal_code`, 'col2', '(?:^|D)(d{4})(?!d)') = 'true')' (struct<postal_code:string,col2:string> and string).; line 1 pos 343;
Dataframe schema is like this:
["id", "t_create", "hours"]
string, timestamp, int
Sample data is like:
["abc", "2022-07-01 12:23:21.343998", 5]
I want to add hours to the t_create and get a new column t_update: "2022-07-01 17:23:21.343998"
Here is my code:
df_cols = ["id", "t_create", "hour"]
df = spark.read.format("delta").load("blablah path")
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL 5 HOURS"))
It works no problem. However the hours column should be a variable. I did not figure out how to put the variable to the expr, f string and the INTERVAL function, something like:
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {df.hours} HOURS"))
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {col(df.hours)} HOURS"))
etc... They don't work. Need help here.
Another way is to write a udf and wrap the whole expr string to the udf return value:
#udf
def udf_interval(hours):
return "INTERVAL " + str(hours) + " HOURS"
Then:
df = df.withColumn("t_update", df.t_create + expr(udf_interval(df.hours)))
Now I get TypeError: Column is not iterable.
Stuck. Need help in either the udf or non-udf way. Thanks!
You can do this without using the fiddly unix_timestamp and utilise make_interval within SparkSQL
SparkSQL - TO_TIMESTAMP & MAKE_INTERVAL
sql.sql("""
WITH INP AS (
SELECT
"abc" as id,
TO_TIMESTAMP("2022-07-01 12:23:21.343998","yyyy-MM-dd HH:mm:ss.SSSSSS") as t_create,
5 as t_hour
)
SELECT
id,
t_create,
t_hour,
t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS as t_update
FROM INP
""").show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
Pyspark API
s = StringIO("""
id,t_create,t_hour
abc,2022-07-01 12:23:21.343998,5
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('t_create'
,F.to_timestamp(F.col('t_create')
,'yyyy-MM-dd HH:mm:ss.SSSSSS'
)
).withColumn('t_update'
,F.expr('t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS')
).show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
A simple way would be to cast the timestamp to bigint (or decimal if dealing with fraction of second) and add the number of seconds to it. Here's an example where I've created columns for every calculation for detailed understanding - you can merge all the calculations into a single column.
spark.sparkContext.parallelize([("2022-07-01 12:23:21.343998",)]).toDF(['ts_str']). \
withColumn('ts', func.col('ts_str').cast('timestamp')). \
withColumn('hours_to_add', func.lit(5)). \
withColumn('ts_as_decimal', func.col('ts').cast('decimal(20, 10)')). \
withColumn('seconds_to_add_as_decimal',
func.col('hours_to_add').cast('decimal(20, 10)') * 3600
). \
withColumn('new_ts_as_decimal',
func.col('ts_as_decimal') + func.col('seconds_to_add_as_decimal')
). \
withColumn('new_ts', func.col('new_ts_as_decimal').cast('timestamp')). \
show(truncate=False)
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |ts_str |ts |hours_to_add|ts_as_decimal |seconds_to_add_as_decimal|new_ts_as_decimal |new_ts |
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |2022-07-01 12:23:21.343998|2022-07-01 12:23:21.343998|5 |1656678201.3439980000|18000.0000000000 |1656696201.3439980000|2022-07-01 17:23:21.343998|
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
Some raw data that I want to capture in delta tables have periods in the column names.
My strategy had been to create a table that uses the backticks like so:
CREATE TABLE TestMe (
testMeKey bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
id bigint,
rev bigint,
`System.WorkItemType` string,
sourceFile string
)
USING DELTA
OPTIONS (PATH "/mnt/TestMe")
Then when I get new data, to do something like this:
spark.read.format("parquet") \
.load("/mnt/TestMeData/") \
.withColumn("sourceFile", input_file_name()) \
.write.option("mergeSchema", "true") \
.format("delta").mode("overwrite") \
.save("/mnt/TestMe")
The problem is, Spark throws name match errors for the columns with periods in them, saying:
Cannot resolve column name System.WorkItemType
I have tried to manually recode the dataframe column names, with something like this, but this also fails:
l1 = [col for col in df.columns]
l1 = [f'`{x}`' if ('.' in x) else x for x in l1]
# make the names match
df = df.toDF(*l1)
How can I create a table that is consistent with source data if it contains characters like periods? I would rather not alter the table and recode periods to underscores or something, but rather accept the data as-is.
Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+